Add `rctx.watch_tree()` to watch a directory tree #21362

Wyverald · 2024-02-15T04:06:00Z

Added rctx.watch_tree() to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents.
- In the future we could add glob patterns to this method.
Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting.
- In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to Use stat() to avoid checking content hashes for repository up-to-dateness checks #21044)

Work towards #20952.

Wyverald · 2024-02-15T04:07:26Z

@lberki I know I haven't written tests yet, but it's very late here and this compiles and I wanted to make sure it's not completely ridiculous before I continue to work on it tomorrow.

cc @ismell

lberki · 2024-02-15T08:46:20Z

The two main questions in my head:

Why not on StarlarkBaseExternalContext like watch?
WDYT about adopting a glob-like syntax for a bit more consistent API, possible code sharing with actual glob and possibly sharing caches between watching a directory and globbing over it?

Wyverald · 2024-02-15T13:35:56Z

Why not on StarlarkBaseExternalContext like watch?

mostly just being conservative -- only introducing this in the context we know it's needed in right now. we can add it to mctx later if need be. (whereas 'watch' had to be added to the base context because 'read' was already in the base context)

WDYT about adopting a glob-like syntax for a bit more consistent API, possible code sharing with actual glob and possibly sharing caches between watching a directory and globbing over it?

i think i addressed this somewhat in the other thread, but just to the API point, I don't have a strong feeling as of right now. Do you think having glob patterns are potentially useful enough in this api that we should e.g. change rctx.watch_tree to path.glob instead?

lberki · 2024-02-15T14:07:21Z

I think the important part is not renaming watch_tree() to glob() but to pass in a list of directories, suitable for eventually amending with an excludes= attribute.

Symmetry is a nice bonus, but what leads me think it's better that way is that glob() is battle-proven and this function does a very similar thing so why not adopt the interface of glob()?

Wyverald · 2024-02-15T21:11:44Z

I think the important part is not renaming watch_tree() to glob() but to pass in a list of directories, suitable for eventually amending with an excludes= attribute.

I mentioned "renaming rctx.watch_tree() to path.glob()" not because of the method name per se, but because we'll need a StarlarkPath parameter next to the glob patterns anyway. Outright writing something like rctx.glob(['*.txt']) would mean "give me all .txt files in the repo I'm currently fetching into" which is rather useless. What you likely want is to glob in some absolute path or inside some other repo; crucially, "globbing inside some other repo" can't be expressed using glob patterns alone, as you need some label-like syntax to deal with the repo part.

So my current thinking is that we can add support for glob patterns to rctx.watch_tree, so you could do rctx.watch_tree("/some_other_dir", glob=['*.txt'], exclude=['.git']). Or if you wanted to watch a path inside some other repo, you could do rctx.watch_tree(rctx.path(Label("@other_repo//pkg:BUILD")).dirname, glob=..., exclude=...). Both glob and exclude would be optional paramters. WDYT?

Symmetry is a nice bonus, but what leads me think it's better that way is that glob() is battle-proven and this function does a very similar thing so why not adopt the interface of glob()?

One major difference is that glob() doesn't imply it watches the file contents. It almost matches path.readdir() better than rctx.watch_tree(). So flat out adopting the interface of glob() could bring extra confusion, in addition to what I outlined above.

Wyverald · 2024-02-15T23:51:50Z

So I did some simple benchmarking using the Bazel project itself (just the bazelbuild/bazel git repo). With a simple repo rule that watches the Bazel source tree in its entirety (including the convenience symlinks going into output base, and the .git directory), totalling ~12K files, on my 2021 M1 macbook pro:

test times (s)	avg	max	min	dev
cold build (no running Bazel server)	6.84	7.56	6.42	0.41
hot build (running Bazel server)	1.13	1.20	1.08	0.04
control, cold build (no watching at all)	1.61	1.72	1.58	0.04

so the recursive digesting takes about ~5 seconds. This is probably already an improvement for @ismell since you do benefit from hot builds unlike the CACHE_BUST_DATE thing.

lberki · 2024-02-16T07:49:24Z

The fact that watch_tree() watches file contents but glob() does not is a good point. I'm fine with anything as long as the syntax can be extended later to be the same as that of glob(). As such, If you insist, I'm fine with the current approach, but how about watch_tree(path=<path>, includes=) and erroring out if includes= is not ["**"])? that would provide the easiest approach to glob-like semantics.

And now, on for the review of the actual code!

src/main/java/com/google/devtools/build/lib/skyframe/DirectoryTreeDigestFunction.java

lberki · 2024-02-16T08:10:21Z

The code itself looks quite reasonable, modulo some nits. I have a question about semantics to both @Wyverald and @ismell though: AFAICT this implementation is agnostic to the symlink structure. I.e. if you have files a and b with the same contents and x is a symlink, the tree digest will be the same regardless of x points to a or b.

Also, if there is a dangling symlink y, the digest will be the same regardless of its content.

@Wyverald : Are these statements correct?

@Wyverald @ismell : are these properties desirable? Working on Blaze, I'd be more comfortable with being conservative and making the symlink structure affect the checksum, although that'd add quite a bit of complexity since it's not immedately obvious what differences in the symlink tree should matter (e.g. if a tree has an absolute symlink, but to within itself, should the checksum depend on the location of the tree in the file system?)

lberki · 2024-02-16T08:12:23Z

@Wyverald do you know what takes 5 seconds? Initially, I thought that it was because the checksumming and tree traversal run on one thread, but now I know that it's parallelized. My Bazel tree (according to du -sh .) is 1.2GB and 5 seconds to checksum 1.2GB isn't great performance.

Wyverald · 2024-02-16T13:25:06Z

AFAICT this implementation is agnostic to the symlink structure.

Indeed; there's even a test case for that!

Also, if there is a dangling symlink y, the digest will be the same regardless of its content.

what do you mean by 'content'? as in, the path of the symlink target? then yes, the digest only contains the fact that it's dangling.

Wyverald · 2024-02-16T13:33:07Z

how about watch_tree(path=<path>, includes=) and erroring out if includes= is not ["**"])?

how does that help with migration, actually? adding the includes parameter when it's available seems better to me.

lberki · 2024-02-16T13:35:48Z

AFAICT this implementation is agnostic to the symlink structure.

Indeed; there's even a test case for that!

not my sharpest moment

Also, if there is a dangling symlink y, the digest will be the same regardless of its content.

what do you mean by 'content'? as in, the path of the symlink target? then yes, the digest only contains the fact that it's dangling.

Yep, that's exactly it. TBH I'd be happier if in that case, there was a refetch on the theory that it's better to be slow than to be correct. Do you think that's feasible? (if @ismell says that this is OK for him, I'll relent, grudgingly)

lberki · 2024-02-16T13:37:39Z

how about watch_tree(path=<path>, includes=) and erroring out if includes= is not ["**"])?

how does that help with migration, actually? adding the includes parameter when it's available seems better to me.

Not a lot, just a tiny bit My line of reasoning was then you don't need to default includes= to ["**"] but I guess that's not that big a deal?

fmeum · 2024-02-16T14:06:38Z

Could we guard the watch_tree API with an experimental flag for 7.1.0? Compared to the other changes, this one looks much more intricate both regarding its API surface and its implementation details.

ismell · 2024-02-16T16:04:20Z

are these properties desirable? Working on Blaze, I'd be more comfortable with being conservative and making the symlink structure affect the checksum, although that'd add quite a bit of complexity since it's not immedately obvious what differences in the symlink tree should matter (e.g. if a tree has an absolute symlink, but to within itself, should the checksum depend on the location of the tree in the file system?)

In our implementation we follow symlinks and hash the underlying file. If a symlink can't be resolved we hash the symlink contents instead. This way the digest changes if the missing file gets created and the symlink becomes valid.

Wyverald · 2024-02-16T19:12:53Z

@lberki

@Wyverald do you know what takes 5 seconds? Initially, I thought that it was because the checksumming and tree traversal run on one thread, but now I know that it's parallelized. My Bazel tree (according to du -sh .) is 1.2GB and 5 seconds to checksum 1.2GB isn't great performance.

Not completely sure, but just noting that du doesn't follow symlinks so it didn't count the output base and everything (whereas rctx.watch_tree does, thanks to the convenience symlinks). $outputBase/external alone reported 5.5GB so that's already much bigger. (EDIT: and bazel-out was 4.3GB. that already totals to 11GB)

Not a lot, just a tiny bit My line of reasoning was then you don't need to default includes= to ["**"] but I guess that's not that big a deal?

I think that's as small as deals get, haha...

@fmeum

Could we guard the watch_tree API with an experimental flag for 7.1.0? Compared to the other changes, this one looks much more intricate both regarding its API surface and its implementation details.

Not totally against it, but I feel like we've gone through most of the concerns already and this API and its behavior are malleable enough that we can add to it comfortably. Happy to hear @lberki's thoughts on this.

@ismell

If a symlink can't be resolved we hash the symlink contents instead. This way the digest changes if the missing file gets created and the symlink becomes valid.

For the record, you don't need to hash the symlink contents for that -- just storing the fact that it's dangling is enough to cause the digest to change if the missing file gets created. (which is what I'm doing in this PR.)

- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents. - In the future we could add glob patterns to this method. - Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting. - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044) Work towards #20952.

lberki · 2024-02-19T07:16:28Z

@lberki

@Wyverald do you know what takes 5 seconds? Initially, I thought that it was because the checksumming and tree traversal run on one thread, but now I know that it's parallelized. My Bazel tree (according to du -sh .) is 1.2GB and 5 seconds to checksum 1.2GB isn't great performance.

Not completely sure, but just noting that du doesn't follow symlinks so it didn't count the output base and everything (whereas rctx.watch_tree does, thanks to the convenience symlinks). $outputBase/external alone reported 5.5GB so that's already much bigger. (EDIT: and bazel-out was 4.3GB. that already totals to 11GB)

Not a lot, just a tiny bit My line of reasoning was then you don't need to default includes= to ["**"] but I guess that's not that big a deal?

I think that's as small as deals get, haha...

@fmeum

Could we guard the watch_tree API with an experimental flag for 7.1.0? Compared to the other changes, this one looks much more intricate both regarding its API surface and its implementation details.

Not totally against it, but I feel like we've gone through most of the concerns already and this API and its behavior are malleable enough that we can add to it comfortably. Happy to hear @lberki's thoughts on this.

@Wyverald -- the question is, if you feel comfortable supporting this API forever, or else pay the migration costs. Let's game out what possible changes we might want to make:

Making the interface closer to that of glob() seems to be doable without a costly migration
Watching symlink structure (in addition to just the contents of files) can be put behind a flag and I can imagine people wanting both so that wouldn't be a "forever migration flag"
Watching file contents, likewise (that, I can't imagine anyone wanting, but still)
Unifying it with watch() will very probably not happen since watch_tree() must have a much richer API (includes, excludes, symlink structure awareness, etc.)

So I think we are fine marking this as non-experimental. @fmeum can you come up with a change to watch_tree() that would require a migration?

@ismell

If a symlink can't be resolved we hash the symlink contents instead. This way the digest changes if the missing file gets created and the symlink becomes valid.

For the record, you don't need to hash the symlink contents for that -- just storing the fact that it's dangling is enough to cause the digest to change if the missing file gets created. (which is what I'm doing in this PR.)

lberki · 2024-02-19T07:18:06Z

@lberki

@Wyverald do you know what takes 5 seconds? Initially, I thought that it was because the checksumming and tree traversal run on one thread, but now I know that it's parallelized. My Bazel tree (according to du -sh .) is 1.2GB and 5 seconds to checksum 1.2GB isn't great performance.

Not completely sure, but just noting that du doesn't follow symlinks so it didn't count the output base and everything (whereas rctx.watch_tree does, thanks to the convenience symlinks). $outputBase/external alone reported 5.5GB so that's already much bigger. (EDIT: and bazel-out was 4.3GB. that already totals to 11GB)

Good point -- I actually ran du in a clean checkout so following symlinks wouldn't have made a difference. Your numbers show an order of magnitude better performance and that ~2GB/sec is very much good enough. It's actually impressive how good that is without any specific effort going into optimization :)

Not a lot, just a tiny bit My line of reasoning was then you don't need to default includes= to ["**"] but I guess that's not that big a deal?

I think that's as small as deals get, haha...

@fmeum

Could we guard the watch_tree API with an experimental flag for 7.1.0? Compared to the other changes, this one looks much more intricate both regarding its API surface and its implementation details.

Not totally against it, but I feel like we've gone through most of the concerns already and this API and its behavior are malleable enough that we can add to it comfortably. Happy to hear @lberki's thoughts on this.

@ismell

If a symlink can't be resolved we hash the symlink contents instead. This way the digest changes if the missing file gets created and the symlink becomes valid.

For the record, you don't need to hash the symlink contents for that -- just storing the fact that it's dangling is enough to cause the digest to change if the missing file gets created. (which is what I'm doing in this PR.)

fmeum · 2024-02-19T07:32:42Z

@lberki Agreed, the changes I would reasonably expect could just be new parameters or, in the worst case, Bazel flags with a smaller scope.

fmeum · 2024-02-19T07:46:30Z

src/main/java/com/google/devtools/build/lib/skyframe/DirectoryTreeDigestFunction.java

+import javax.annotation.Nullable;
+
+/** A {@link SkyFunction} for {@link DirectoryTreeDigestValue}s. */
+public final class DirectoryTreeDigestFunction implements SkyFunction {


The fact that adding this feature turned out to be pretty manageable with decent performance makes me think that we should reevaluate the state of BAZEL_TRACK_SOURCE_DIRECTORIES. It's essentially an undocumented "forever experimental" flag and maybe, as we saw here, stabilizing it wouldn't have to be that much effort.

@lberki What do you think?

- Added `rctx.watch_tree()` to watch a directory tree, which includes all transitive descendants' names, and if they're files, their contents. - In the future we could add glob patterns to this method. - Added a new SkyFunction DirectoryTreeDigestFunction to do the heavy lifting. - In the future, for performance, we could try to get this skyfunction to have a mode where it only digests stat(), to use as heuristics (similar to #21044) Work towards #20952. Closes #21362. PiperOrigin-RevId: 608667062 Change-Id: Ibacbb7af4cf4d7628fe8fcf06e2c4fa50e811e4e

meteorcloudy requested a review from SalmaSamy February 15, 2024 09:58

Wyverald force-pushed the wyv-watch-tree branch from 01a297c to 1cf3caa Compare February 15, 2024 19:07

Wyverald force-pushed the wyv-watch-dir branch from da59002 to 551f437 Compare February 15, 2024 20:46

Wyverald force-pushed the wyv-watch-tree branch from 1cf3caa to 89c2a26 Compare February 15, 2024 20:46

Wyverald force-pushed the wyv-watch-tree branch from 89c2a26 to 372bed2 Compare February 15, 2024 23:11

Wyverald marked this pull request as ready for review February 15, 2024 23:12

Wyverald requested a review from lberki as a code owner February 15, 2024 23:12

Wyverald requested review from meteorcloudy and fmeum February 15, 2024 23:12

github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. awaiting-review PR is awaiting review from an assigned reviewer labels Feb 15, 2024

Wyverald force-pushed the wyv-watch-tree branch from 372bed2 to 5922093 Compare February 15, 2024 23:28

lberki reviewed Feb 16, 2024

View reviewed changes

Wyverald force-pushed the wyv-watch-tree branch from 5922093 to 50be63b Compare February 16, 2024 20:55

Wyverald requested review from a team, gregestren and fweikert as code owners February 16, 2024 20:55

Wyverald changed the base branch from wyv-watch-dir to master February 16, 2024 20:55

Wyverald removed request for a team, gregestren and fweikert February 16, 2024 20:55

lberki approved these changes Feb 19, 2024

View reviewed changes

fmeum approved these changes Feb 19, 2024

View reviewed changes

copybara-service bot closed this in fffa0af Feb 20, 2024

github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Feb 20, 2024

Wyverald deleted the wyv-watch-tree branch February 20, 2024 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `rctx.watch_tree()` to watch a directory tree #21362

Add `rctx.watch_tree()` to watch a directory tree #21362

Wyverald commented Feb 15, 2024 •

edited

Loading

Wyverald commented Feb 15, 2024

lberki commented Feb 15, 2024

Wyverald commented Feb 15, 2024

lberki commented Feb 15, 2024

Wyverald commented Feb 15, 2024

Wyverald commented Feb 15, 2024

lberki commented Feb 16, 2024

lberki commented Feb 16, 2024

lberki commented Feb 16, 2024 •

edited

Loading

Wyverald commented Feb 16, 2024

Wyverald commented Feb 16, 2024

lberki commented Feb 16, 2024 •

edited

Loading

lberki commented Feb 16, 2024

fmeum commented Feb 16, 2024 •

edited

Loading

ismell commented Feb 16, 2024

Wyverald commented Feb 16, 2024 •

edited

Loading

lberki commented Feb 19, 2024

lberki commented Feb 19, 2024

fmeum commented Feb 19, 2024

fmeum Feb 19, 2024

Add rctx.watch_tree() to watch a directory tree #21362

Add rctx.watch_tree() to watch a directory tree #21362

Conversation

Wyverald commented Feb 15, 2024 • edited Loading

Wyverald commented Feb 15, 2024

lberki commented Feb 15, 2024

Wyverald commented Feb 15, 2024

lberki commented Feb 15, 2024

Wyverald commented Feb 15, 2024

Wyverald commented Feb 15, 2024

lberki commented Feb 16, 2024

lberki commented Feb 16, 2024

lberki commented Feb 16, 2024 • edited Loading

Wyverald commented Feb 16, 2024

Wyverald commented Feb 16, 2024

lberki commented Feb 16, 2024 • edited Loading

lberki commented Feb 16, 2024

fmeum commented Feb 16, 2024 • edited Loading

ismell commented Feb 16, 2024

Wyverald commented Feb 16, 2024 • edited Loading

lberki commented Feb 19, 2024

lberki commented Feb 19, 2024

fmeum commented Feb 19, 2024

fmeum Feb 19, 2024

Choose a reason for hiding this comment

Add `rctx.watch_tree()` to watch a directory tree #21362

Add `rctx.watch_tree()` to watch a directory tree #21362

Wyverald commented Feb 15, 2024 •

edited

Loading

lberki commented Feb 16, 2024 •

edited

Loading

lberki commented Feb 16, 2024 •

edited

Loading

fmeum commented Feb 16, 2024 •

edited

Loading

Wyverald commented Feb 16, 2024 •

edited

Loading