Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider deduplicating digests for hardlinked/symlinked outputs #24365

Open
tjgq opened this issue Nov 18, 2024 · 5 comments
Open

Consider deduplicating digests for hardlinked/symlinked outputs #24365

tjgq opened this issue Nov 18, 2024 · 5 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: feature request

Comments

@tjgq
Copy link
Contributor

tjgq commented Nov 18, 2024

Currently, when an action copies inputs to outputs by hardlinking or symlinking them, we recalculate digests for the output files, which can take a significant performance toll on builds that copy lots of files around. This could in theory be avoided if Bazel was able to recognize that it has seen the target of the link before (and hasn't been modified since).

For hardlinks, this could be as simple as using (st_dev, st_ino, st_mtime, st_ctime, st_size) as the DigestUtils cache key (instead of the current (path, st_ino, st_mtime, st_ctime, st_size)). Symlinks would require a bit more work, but are still doable. We'd have to think very carefully about the correctness implications, though.

cc @woody77 @fangism

@tjgq tjgq added team-Performance Issues for Performance teams type: feature request untriaged P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Nov 18, 2024
@tjgq
Copy link
Contributor Author

tjgq commented Dec 1, 2024

On second thought, this seems doable for symlinks but not hardlinks. Adding an additional hardlink to an inode affects its st_ctime, and the inclusion of st_ctime in the cache key is required to prevent correctness issues such as #14723.

@woody77
Copy link

woody77 commented Dec 4, 2024

Are the cache key paths using the "real path" or the apparent path? (so that for symlinks if you asked it for the real path it would end up naturally matching the same cache key as the source?

@woody77
Copy link

woody77 commented Dec 4, 2024

re #14723 - wouldn't a new file with different contents create a new value for st_ino? or are they opening the existing file for writing and then truncating to length when changing it's contents?

@tjgq
Copy link
Contributor Author

tjgq commented Dec 4, 2024

Are the cache key paths using the "real path" or the apparent path?

Currently the cache is keyed by path; this proposal is to key it by inode instead.

wouldn't a new file with different contents create a new value for st_ino? or are they opening the existing file for writing and then truncating to length when changing it's contents?

Yes, the problematic scenario is where the contents of an existing file are changed while keeping both its size and mtime. We could say "don't do that", but my understanding of the issue is that there are tools out there that do this, and it might be unreasonable to declare them incompatible with Bazel.

@woody77
Copy link

woody77 commented Dec 4, 2024

some testing reveals that opening for writing on linux, with a path that's to an existing file, doesn't create a new inode. OTOH, outputs should always be new inodes if their directories change (but that would possibly require treating outputs and input differently).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: feature request
Projects
None yet
Development

No branches or pull requests

3 participants
@tjgq @woody77 and others