Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rctx.watch_tree() to watch a directory tree #21362

Closed
wants to merge 1 commit into from

Conversation

Wyverald
Copy link
Member

@Wyverald Wyverald commented Feb 15, 2024

  • Added rctx.watch_tree() to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
    • In the future we could add glob patterns to this method.
  • Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.

Work towards #20952.

@Wyverald
Copy link
Member Author

@lberki I know I haven't written tests yet, but it's very late here and this compiles and I wanted to make sure it's not completely ridiculous before I continue to work on it tomorrow.

cc @ismell

@lberki
Copy link
Contributor

lberki commented Feb 15, 2024

The two main questions in my head:

  • Why not on StarlarkBaseExternalContext like watch?
  • WDYT about adopting a glob-like syntax for a bit more consistent API, possible code sharing with actual glob and possibly sharing caches between watching a directory and globbing over it?

@Wyverald
Copy link
Member Author

Why not on StarlarkBaseExternalContext like watch?

mostly just being conservative -- only introducing this in the context we know it's needed in right now. we can add it to mctx later if need be. (whereas 'watch' had to be added to the base context because 'read' was already in the base context)

WDYT about adopting a glob-like syntax for a bit more consistent API, possible code sharing with actual glob and possibly sharing caches between watching a directory and globbing over it?

i think i addressed this somewhat in the other thread, but just to the API point, I don't have a strong feeling as of right now. Do you think having glob patterns are potentially useful enough in this api that we should e.g. change rctx.watch_tree to path.glob instead?

@lberki
Copy link
Contributor

lberki commented Feb 15, 2024

I think the important part is not renaming watch_tree() to glob() but to pass in a list of directories, suitable for eventually amending with an excludes= attribute.

Symmetry is a nice bonus, but what leads me think it's better that way is that glob() is battle-proven and this function does a very similar thing so why not adopt the interface of glob()?

@Wyverald
Copy link
Member Author

I think the important part is not renaming watch_tree() to glob() but to pass in a list of directories, suitable for eventually amending with an excludes= attribute.

I mentioned "renaming rctx.watch_tree() to path.glob()" not because of the method name per se, but because we'll need a StarlarkPath parameter next to the glob patterns anyway. Outright writing something like rctx.glob(['*.txt']) would mean "give me all .txt files in the repo I'm currently fetching into" which is rather useless. What you likely want is to glob in some absolute path or inside some other repo; crucially, "globbing inside some other repo" can't be expressed using glob patterns alone, as you need some label-like syntax to deal with the repo part.

So my current thinking is that we can add support for glob patterns to rctx.watch_tree, so you could do rctx.watch_tree("/some_other_dir", glob=['*.txt'], exclude=['.git']). Or if you wanted to watch a path inside some other repo, you could do rctx.watch_tree(rctx.path(Label("@other_repo//pkg:BUILD")).dirname, glob=..., exclude=...). Both glob and exclude would be optional paramters. WDYT?

Symmetry is a nice bonus, but what leads me think it's better that way is that glob() is battle-proven and this function does a very similar thing so why not adopt the interface of glob()?

One major difference is that glob() doesn't imply it watches the file contents. It almost matches path.readdir() better than rctx.watch_tree(). So flat out adopting the interface of glob() could bring extra confusion, in addition to what I outlined above.

@Wyverald Wyverald marked this pull request as ready for review February 15, 2024 23:12
@github-actions github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. awaiting-review PR is awaiting review from an assigned reviewer labels Feb 15, 2024
@Wyverald
Copy link
Member Author

So I did some simple benchmarking using the Bazel project itself (just the bazelbuild/bazel git repo). With a simple repo rule that watches the Bazel source tree in its entirety (including the convenience symlinks going into output base, and the .git directory), totalling ~12K files, on my 2021 M1 macbook pro:

test times (s) avg max min dev
cold build (no running Bazel server) 6.84 7.56 6.42 0.41
hot build (running Bazel server) 1.13 1.20 1.08 0.04
control, cold build (no watching at all) 1.61 1.72 1.58 0.04

so the recursive digesting takes about ~5 seconds. This is probably already an improvement for @ismell since you do benefit from hot builds unlike the CACHE_BUST_DATE thing.

@lberki
Copy link
Contributor

lberki commented Feb 16, 2024

The fact that watch_tree() watches file contents but glob() does not is a good point. I'm fine with anything as long as the syntax can be extended later to be the same as that of glob(). As such, If you insist, I'm fine with the current approach, but how about watch_tree(path=<path>, includes=) and erroring out if includes= is not ["**"])? that would provide the easiest approach to glob-like semantics.

And now, on for the review of the actual code!

@lberki
Copy link
Contributor

lberki commented Feb 16, 2024

The code itself looks quite reasonable, modulo some nits. I have a question about semantics to both @Wyverald and @ismell though: AFAICT this implementation is agnostic to the symlink structure. I.e. if you have files a and b with the same contents and x is a symlink, the tree digest will be the same regardless of x points to a or b.

Also, if there is a dangling symlink y, the digest will be the same regardless of its content.

@Wyverald : Are these statements correct?

@Wyverald @ismell : are these properties desirable? Working on Blaze, I'd be more comfortable with being conservative and making the symlink structure affect the checksum, although that'd add quite a bit of complexity since it's not immedately obvious what differences in the symlink tree should matter (e.g. if a tree has an absolute symlink, but to within itself, should the checksum depend on the location of the tree in the file system?)

@lberki
Copy link
Contributor

lberki commented Feb 16, 2024

@Wyverald do you know what takes 5 seconds? Initially, I thought that it was because the checksumming and tree traversal run on one thread, but now I know that it's parallelized. My Bazel tree (according to du -sh .) is 1.2GB and 5 seconds to checksum 1.2GB isn't great performance.

@Wyverald
Copy link
Member Author

AFAICT this implementation is agnostic to the symlink structure.

Indeed; there's even a test case for that!

Also, if there is a dangling symlink y, the digest will be the same regardless of its content.

what do you mean by 'content'? as in, the path of the symlink target? then yes, the digest only contains the fact that it's dangling.

@Wyverald
Copy link
Member Author

how about watch_tree(path=<path>, includes=) and erroring out if includes= is not ["**"])?

how does that help with migration, actually? adding the includes parameter when it's available seems better to me.

@lberki
Copy link
Contributor

lberki commented Feb 16, 2024

AFAICT this implementation is agnostic to the symlink structure.

Indeed; there's even a test case for that!

not my sharpest moment

Also, if there is a dangling symlink y, the digest will be the same regardless of its content.

what do you mean by 'content'? as in, the path of the symlink target? then yes, the digest only contains the fact that it's dangling.

Yep, that's exactly it. TBH I'd be happier if in that case, there was a refetch on the theory that it's better to be slow than to be correct. Do you think that's feasible? (if @ismell says that this is OK for him, I'll relent, grudgingly)

@lberki
Copy link
Contributor

lberki commented Feb 16, 2024

how about watch_tree(path=<path>, includes=) and erroring out if includes= is not ["**"])?

how does that help with migration, actually? adding the includes parameter when it's available seems better to me.

Not a lot, just a tiny bit My line of reasoning was then you don't need to default includes= to ["**"] but I guess that's not that big a deal?

@fmeum
Copy link
Collaborator

fmeum commented Feb 16, 2024

Could we guard the watch_tree API with an experimental flag for 7.1.0? Compared to the other changes, this one looks much more intricate both regarding its API surface and its implementation details.

@ismell
Copy link

ismell commented Feb 16, 2024

are these properties desirable? Working on Blaze, I'd be more comfortable with being conservative and making the symlink structure affect the checksum, although that'd add quite a bit of complexity since it's not immedately obvious what differences in the symlink tree should matter (e.g. if a tree has an absolute symlink, but to within itself, should the checksum depend on the location of the tree in the file system?)

In our implementation we follow symlinks and hash the underlying file. If a symlink can't be resolved we hash the symlink contents instead. This way the digest changes if the missing file gets created and the symlink becomes valid.

@Wyverald
Copy link
Member Author

Wyverald commented Feb 16, 2024

@lberki

@Wyverald do you know what takes 5 seconds? Initially, I thought that it was because the checksumming and tree traversal run on one thread, but now I know that it's parallelized. My Bazel tree (according to du -sh .) is 1.2GB and 5 seconds to checksum 1.2GB isn't great performance.

Not completely sure, but just noting that du doesn't follow symlinks so it didn't count the output base and everything (whereas rctx.watch_tree does, thanks to the convenience symlinks). $outputBase/external alone reported 5.5GB so that's already much bigger. (EDIT: and bazel-out was 4.3GB. that already totals to 11GB)

Not a lot, just a tiny bit My line of reasoning was then you don't need to default includes= to ["**"] but I guess that's not that big a deal?

I think that's as small as deals get, haha...


@fmeum

Could we guard the watch_tree API with an experimental flag for 7.1.0? Compared to the other changes, this one looks much more intricate both regarding its API surface and its implementation details.

Not totally against it, but I feel like we've gone through most of the concerns already and this API and its behavior are malleable enough that we can add to it comfortably. Happy to hear @lberki's thoughts on this.


@ismell

If a symlink can't be resolved we hash the symlink contents instead. This way the digest changes if the missing file gets created and the symlink becomes valid.

For the record, you don't need to hash the symlink contents for that -- just storing the fact that it's dangling is enough to cause the digest to change if the missing file gets created. (which is what I'm doing in this PR.)

- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
  - In the future we could add glob patterns to this method.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.
@Wyverald Wyverald changed the base branch from wyv-watch-dir to master February 16, 2024 20:55
@Wyverald Wyverald removed request for a team, gregestren and fweikert February 16, 2024 20:55
@lberki
Copy link
Contributor

lberki commented Feb 19, 2024

@lberki

@Wyverald do you know what takes 5 seconds? Initially, I thought that it was because the checksumming and tree traversal run on one thread, but now I know that it's parallelized. My Bazel tree (according to du -sh .) is 1.2GB and 5 seconds to checksum 1.2GB isn't great performance.

Not completely sure, but just noting that du doesn't follow symlinks so it didn't count the output base and everything (whereas rctx.watch_tree does, thanks to the convenience symlinks). $outputBase/external alone reported 5.5GB so that's already much bigger. (EDIT: and bazel-out was 4.3GB. that already totals to 11GB)

Not a lot, just a tiny bit My line of reasoning was then you don't need to default includes= to ["**"] but I guess that's not that big a deal?

I think that's as small as deals get, haha...

@fmeum

Could we guard the watch_tree API with an experimental flag for 7.1.0? Compared to the other changes, this one looks much more intricate both regarding its API surface and its implementation details.

Not totally against it, but I feel like we've gone through most of the concerns already and this API and its behavior are malleable enough that we can add to it comfortably. Happy to hear @lberki's thoughts on this.

@Wyverald -- the question is, if you feel comfortable supporting this API forever, or else pay the migration costs. Let's game out what possible changes we might want to make:

  • Making the interface closer to that of glob() seems to be doable without a costly migration
  • Watching symlink structure (in addition to just the contents of files) can be put behind a flag and I can imagine people wanting both so that wouldn't be a "forever migration flag"
  • Watching file contents, likewise (that, I can't imagine anyone wanting, but still)
  • Unifying it with watch() will very probably not happen since watch_tree() must have a much richer API (includes, excludes, symlink structure awareness, etc.)

So I think we are fine marking this as non-experimental. @fmeum can you come up with a change to watch_tree() that would require a migration?

@ismell

If a symlink can't be resolved we hash the symlink contents instead. This way the digest changes if the missing file gets created and the symlink becomes valid.

For the record, you don't need to hash the symlink contents for that -- just storing the fact that it's dangling is enough to cause the digest to change if the missing file gets created. (which is what I'm doing in this PR.)

@lberki
Copy link
Contributor

lberki commented Feb 19, 2024

@lberki

@Wyverald do you know what takes 5 seconds? Initially, I thought that it was because the checksumming and tree traversal run on one thread, but now I know that it's parallelized. My Bazel tree (according to du -sh .) is 1.2GB and 5 seconds to checksum 1.2GB isn't great performance.

Not completely sure, but just noting that du doesn't follow symlinks so it didn't count the output base and everything (whereas rctx.watch_tree does, thanks to the convenience symlinks). $outputBase/external alone reported 5.5GB so that's already much bigger. (EDIT: and bazel-out was 4.3GB. that already totals to 11GB)

Good point -- I actually ran du in a clean checkout so following symlinks wouldn't have made a difference. Your numbers show an order of magnitude better performance and that ~2GB/sec is very much good enough. It's actually impressive how good that is without any specific effort going into optimization :)

Not a lot, just a tiny bit My line of reasoning was then you don't need to default includes= to ["**"] but I guess that's not that big a deal?

I think that's as small as deals get, haha...

@fmeum

Could we guard the watch_tree API with an experimental flag for 7.1.0? Compared to the other changes, this one looks much more intricate both regarding its API surface and its implementation details.

Not totally against it, but I feel like we've gone through most of the concerns already and this API and its behavior are malleable enough that we can add to it comfortably. Happy to hear @lberki's thoughts on this.

@ismell

If a symlink can't be resolved we hash the symlink contents instead. This way the digest changes if the missing file gets created and the symlink becomes valid.

For the record, you don't need to hash the symlink contents for that -- just storing the fact that it's dangling is enough to cause the digest to change if the missing file gets created. (which is what I'm doing in this PR.)

@fmeum
Copy link
Collaborator

fmeum commented Feb 19, 2024

@lberki Agreed, the changes I would reasonably expect could just be new parameters or, in the worst case, Bazel flags with a smaller scope.

import javax.annotation.Nullable;

/** A {@link SkyFunction} for {@link DirectoryTreeDigestValue}s. */
public final class DirectoryTreeDigestFunction implements SkyFunction {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that adding this feature turned out to be pretty manageable with decent performance makes me think that we should reevaluate the state of BAZEL_TRACK_SOURCE_DIRECTORIES. It's essentially an undocumented "forever experimental" flag and maybe, as we saw here, stabilizing it wouldn't have to be that much effort.

@lberki What do you think?

@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Feb 20, 2024
@Wyverald Wyverald deleted the wyv-watch-tree branch February 20, 2024 19:03
Wyverald added a commit that referenced this pull request Feb 20, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
  -  In the future we could add glob patterns to this method.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.

Closes #21362.

PiperOrigin-RevId: 608667062
Change-Id: Ibacbb7af4cf4d7628fe8fcf06e2c4fa50e811e4e
Wyverald added a commit that referenced this pull request Feb 20, 2024
- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
  -  In the future we could add glob patterns to this method.
- Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
  - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044)

Work towards #20952.

Closes #21362.

PiperOrigin-RevId: 608667062
Change-Id: Ibacbb7af4cf4d7628fe8fcf06e2c4fa50e811e4e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants