Add notify fs watcher to engine. #9318

hrfuller · 2020-03-16T23:44:26Z

Problem

Watchman is error prone, and requires pants to maintain and communicate with another daemon.

Solution

Implement file event watching with the notify-rs crate. Hook watchman and notify invalidation strategies into the same invalidation method, with different event sources.

Result

Notify and watchman fs event systems operate in tandem. We will follow up to allow users to disable watchman. Because of the way notify can watch more granular paths than watchman, file change events result in less drastic node invalidation, which should lead to faster runs with pantsd.

stuhood

Thanks!

src/rust/engine/Cargo.toml

src/rust/engine/src/nodes.rs

src/rust/engine/src/watch.rs

illicitonion

Super exciting, this looks great!

It would be good to see a few tests for:

Watching a symlink, and {deleting,repointing} it as a well as changing the contents of the file it points at
Watching the target of a symlink
Watching a chain of several symlinks
Creating or deleting a file invalidating the readdir of the containing directory
Deleting or renaming a directory invalidating the reading of files in that directory

src/rust/engine/src/watch.rs

stuhood · 2020-03-17T17:23:21Z

Watching a symlink, and {deleting,repointing} it as a well as changing the contents of the file it points at
Watching the target of a symlink
Watching a chain of several symlinks

@illicitonion : All of these are handled in the Graph rather than here. All the watch implementation needs to do is notify us that the symlink itself changed... we'll track the content lookup as an operation on the destination file. We've never relied on watchman to provide that.

Creating or deleting a file invalidating the readdir of the containing directory
Deleting or renaming a directory invalidating the reading of files in that directory

These would be good to do though. @hrfuller : one thing to note with regard to testing is that because notify doesn't require a separate daemon, it also doesn't require pantsd... we are effectively now using notify "everywhere"... including in all python unit tests that create or delete files under a scheduler. So I think that you should be able to test this in test_fs.py:

pants/tests/python/pants_test/engine/test_fs.py

Lines 232 to 233 in 3c3aaed

    
           def test_files_content_symlink(self): 
        
               self.assert_content(["c.ln/../3.txt"], {"c.ln/../3.txt": b"three\n"})

... by running an operation, deleting/renaming a file, waiting a brief period, and then re-running the operation?

…operations.

Relativize paths returned from notify to the build_root. Refactor invalidate method to be an associated method on the InvalidationWatcher.

* Use spawn on io pool instead of custom future impl * Write python fs tests * Relativize paths to invalidate to build root * invalidate nodes with parent paths. * Comments

Make some things public so we can use them in tests. Use canonical path to build root for relativizing changed paths.

stuhood

Thanks a ton! Nothing blocking.

src/rust/engine/src/watch.rs

src/rust/engine/graph/src/lib.rs

src/rust/engine/src/watch.rs

tests/python/pants_test/engine/test_fs.py

illicitonion

Looking good :) I would appreciate a few more tests in rust before merging, as I suspect we may be over-invalidating a fair bit (and possibly under-invalidating in some places) - a few more tests would help set my mind at ease!

src/rust/engine/src/nodes.rs

src/rust/engine/src/watch.rs

illicitonion · 2020-03-20T12:35:06Z

src/rust/engine/src/watch_tests.rs

+use fs::File;
+use testutil::make_file;
+
+#[test]


It would be good to see unit tests here also for:

build_root/dir/subdir/file changing only invalidating build_root/dir/subdir/file

build_root/dir/subdir/file being created or deleted invalidating build_root/dir/subdir/file and build_root/dir/subdir

A symlink target being changed invalidating exactly the symlink

For build_root/dir/subdir/file, subdir being deleted invalidating build_root/dir, build_root/dir/subdir and build_root/dir/subdir/file

build_root being a symlink

Changing the executable bit on a file invalidating that file

Hopefully these are all nicely extractable into common code so that each test is just:

Set up a filesystem

Call a function

Modify the filesystem

Make an assertion

There will be at least one followup patch in this series: I think investing more time in testing there would be reasonable.

But I also think that some of those tests should likely be in other modules: this module should probably only be testing "point" invalidation (since that is all that it actually implements), and then a separate module could test impl VFS for Context.

src/rust/engine/Cargo.toml

Return watch errors as core::Failure all the way to user. Check notify event type when deciding whether to invalidate parents. Move task executor onto invalidation watcher. Move test_support trait impl into test_support mod.

stuhood

There is an issue with this revision: see the comment in nodes.rs.

To confirm that we're not missing anything else like that, please ensure that you do something like timing the average of five runs of ./pants list :: (without pantsd) before/after the patch. If it's within 5 or 10% (we should expect some overhead), then we're good.

src/rust/engine/src/nodes.rs

…lazy

with file watching. Format watch and nodes files.

[ci skip-jvm-tests] # No JVM changes made.

…t interfere" This reverts commit 974b3ce. [ci skip-jvm-tests] # No JVM changes made.

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

hrfuller · 2020-03-24T08:57:33Z

Perf numbers
master

( repeat 5; do; ./v2 --no-enable-pantsd list ::; done; ) 20.21s user 9.29s system 161% cpu 18.318 total
( repeat 5; do; ./v2 --no-enable-pantsd list src/::; done; ) 13.68s user 6.10s system 130% cpu 15.127 total
( repeat 5; do; ./v2 --no-enable-pantsd list contrib/::; done; ) 11.64s user 3.53s system 107% cpu 14.177 total

This branch

( repeat 5; do; ./v2 --no-enable-pantsd list ::; done; ) 39.03s user 39.18s system 113% cpu 1:08.82 total
( repeat 5; do; ./v2 --no-enable-pantsd list src/::; done; ) 11.96s user 4.17s system 103% cpu 15.619 total
( repeat 5; do; ./v2 --no-enable-pantsd list contrib/::; done; ) 13.30s user 6.12s system 107% cpu 18.093 total

A note on these numbers. From the first entry in each list it's clear that we take a significant performance hit for startup runs when we have to watch alot of files. As far as I can tell this just comes down to it being somewhat slow to add watches to files, or possibly to the notify-rs implementation. Could we parallelize more here, if we are waiting for a lock on the watcher too long?

It's also interesting that the overhead from watching files becomes negligible compared to startup time for smaller sets of targets (src/::).

The situation in general gets much better when pants.d is enabled, even if significant amounts of nodes are invalidated.

Interested to hear thoughts on if this is acceptable and how to proceed if its not.

stuhood · 2020-03-24T16:25:01Z

Commented in Slack: it looks like there's an issue with the first line there.

hrfuller · 2020-03-24T18:25:19Z

Updated the first line, and yes unfortunately it doesn't track with the other lines. Either there are some files still hiding in the repo that are being git ignored but not pants ignored, or the way we are watching files is non-linear: we could be watching the same file multiple times, but we shouldn't be if Scandir nodes are run once per dir.

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

src/rust/engine/Cargo.lock

src/rust/engine/Cargo.toml

stuhood · 2020-03-24T22:31:37Z

tests/python/pants_test/engine/test_fs.py

@@ -134,7 +143,7 @@ def test_walk_parent_link(self):

    def test_walk_escaping_symlink(self):
        link = "subdir/escaping"
-        dest = "../../this-is-probably-nonexistent"
+        dest = "../../"


What happened here? AFAICT, it should be fine to leave this behind, because we shouldn't actually get to the point of looking for it.

The test failed because we started propagating errors from notify, in this case that the path it was trying to watch didn't exist.

The dest path still tries to link outside of the build root so the test should be the same.

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

build root at startup. On Linux watch individual directory roots. [ci skip-jvm-tests] # No JVM changes made.

hrfuller · 2020-03-25T03:02:20Z

Revised Perf numbers after recursive watch on OSX: Looks like watching is essentially free now!

This branch:

( repeat 5; do; ./v2 --no-enable-pantsd list ::; done; ) > /dev/null 17.56s user 6.77s system 147% cpu 16.443 total
( repeat 5; do; ./v2 --no-enable-pantsd list src/::; done; ) > /dev/null 11.01s user 3.15s system 106% cpu 13.343 total
( repeat 5; do; ./v2 --no-enable-pantsd list contrib/::; done; ) > /dev/null 11.10s user 3.35s system 108% cpu 13.348 total

master

( repeat 5; do; ./v2 --no-enable-pantsd list ::; done; ) > /dev/null 20.37s user 9.91s system 167% cpu 18.107 total
( repeat 5; do; ./v2 --no-enable-pantsd list src/::; done; ) > /dev/null 13.23s user 5.85s system 130% cpu 14.658 total
( repeat 5; do; ./v2 --no-enable-pantsd list contrib/::; done; ) > /dev/null 11.12s user 3.39s system 109% cpu 13.203 total

stuhood

Thanks!

It would be good to confirm that we don't see a substantial slowdown on Linux, but based on the investigation yesterday I'm optimistic, and ok doing that post merge.

src/rust/engine/src/watch.rs

hrfuller · 2020-03-25T18:13:22Z

on linux courtesy of @gshuflin we have

[gregs@gregs-dev-laptop pants]$ bash -c "time (for i in {1..5}; do ./v2 --no-enable-pantsd list ::; done ) > /dev/null"
real    0m19.423s
user    0m30.202s
sys     0m9.692s

…tsbuild#9318 by default, to mitigate issues seen in pantsbuild#9415 until a fix is in place. [ci skip-jvm-tests] # No JVM changes made.

… (#9416) * Add a feature gate to disable the engine fs watcher introduced in #9318 by default, to mitigate issues seen in #9415 until a fix is in place.

* Use the notify crate to implement an `InvalidationWatcher` for Graph operations. * Make watch async, and watcher log to pantsd.log. Relativize paths returned from notify to the build_root. Refactor invalidate method to be an associated method on the InvalidationWatcher. * Respond to feedback. * Use spawn on io pool instead of custom future impl * Write python fs tests * Relativize paths to invalidate to build root * invalidate nodes with parent paths. * Comments * Add rust tests. Make some things public so we can use them in tests. Use canonical path to build root for relativizing changed paths. * Refactor Python tests. Return watch errors as core::Failure all the way to user. Move task executor onto invalidation watcher. Move test_support trait impl into test_support mod. * use futures lock on watcher * Platform specific watching behavior. On Darwin recursively watch the build root at startup. On Linux watch individual directory roots. Co-authored-by: Stu Hood <stuhood@gmail.com>

…tsbuild#9318 (pantsbuild#9416) * Add a feature gate to disable the engine fs watcher introduced in pantsbuild#9318 by default, to mitigate issues seen in pantsbuild#9415 until a fix is in place.

@hrfuller

* Port to tokio 0.2, and to stdlib futures for fs and task_executor (#9071) We're on an older version of tokio, which doesn't smoothly support usage of async/await. Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures. Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline. Progress toward more usage of async/await! * Add notify fs watcher to engine. (#9318) * Use the notify crate to implement an `InvalidationWatcher` for Graph operations. * Make watch async, and watcher log to pantsd.log. Relativize paths returned from notify to the build_root. Refactor invalidate method to be an associated method on the InvalidationWatcher. * Respond to feedback. * Use spawn on io pool instead of custom future impl * Write python fs tests * Relativize paths to invalidate to build root * invalidate nodes with parent paths. * Comments * Add rust tests. Make some things public so we can use them in tests. Use canonical path to build root for relativizing changed paths. * Refactor Python tests. Return watch errors as core::Failure all the way to user. Move task executor onto invalidation watcher. Move test_support trait impl into test_support mod. * use futures lock on watcher * Platform specific watching behavior. On Darwin recursively watch the build root at startup. On Linux watch individual directory roots. Co-authored-by: Stu Hood <stuhood@gmail.com> * Ignore notify events for pants_ignore patterns. (#9406) * Create a git ignorer on the context object. Adjust all call sites which create a posix fs to pass in an ignorer. * Ignore fsevents from files that match pants_ignore patterns. * Always pass is_dir = false to ignorer to avoid stat-ing every path the event watch thread sees. * Add a feature gate to disable the engine fs watcher introduced in #9318 (#9416) * Add a feature gate to disable the engine fs watcher introduced in #9318 by default, to mitigate issues seen in #9415 until a fix is in place. * Don't rerun uncachable nodes if they are dirtied while running. (#9452) * Don't rerun uncachable nodes if they are dirtied while running. - Retry dependencies of uncacheable nodes a few times to get a result until we are exhausted from trying too many times. - Bubble uncacheable node retry errors up to the user, tell them things were chaning to much. - Don't propagate dirtiness past uncacheable nodes when invalidating from changed roots. Otherwise dirty dependents of uncacheable nodes will need to re-run. * enable the engine fs watcher by default, now that it won't cause issues. Remove execution option override from tests. * use reference to self in stop_walk_predicate closure * invalidate often enough that test doesn't flake * Add a flag to prevent the FsEventService and watchman from starting (#9487) * add --watchman-enable flag * disable watchman when flag is false * Don't wait for the initial watchman event if we aren't using watchman. * check invalidation watcher liveness from scheduler service * Extract a `watch` crate. (#9635) The `watch` module directly accesses the `engine` crate's `Graph`, which makes it more challenging to test. Extract a `watch` crate which is used via an `trait Invalidatable` which is implemented for the engine's `Graph`, as well as independently in tests. [ci skip-jvm-tests] * Simplify Scheduler::execute and unify Graph retry (#9674) Both the `Graph` and the `Scheduler` implemented retry for requested Nodes, but the `Scheduler` implementation was pre-async-await and much more complicated. Unify the retry implementations into `Graph::get` (either roots or uncacheable nodes are retried), and simplify the `Scheduler`'s loop down to: ``` let maybe_display_handle = Self::maybe_display_initialize(&session); let result = loop { if let Ok(res) = receiver.recv_timeout(refresh_interval) { break Ok(res); } else if let Err(e) = Self::maybe_display_render(&session, &mut tasks) { break Err(e); } }; Self::maybe_display_teardown(session, maybe_display_handle); result ``` A single, more modern retry implementation (thanks @hrfuller!), and a cleaner `Scheduler::execute` loop. * Move file invalidation handling to rust (#9636) A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`: 1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed 2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity 3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph` 4. `--loop` - implemented directly in the `SchedulerService` Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them. The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`: * `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls. * `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing). These features are then used by: * `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed. * invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files. The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`. Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`. In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`. * pin tokio at exactly 0.2.13 * fix lint issues * fix mypy typing issues * Move away from the debounced notify watcher #9754 * Remove test that has raced file invalidation ever since the notify backend was added, but which will now fairly consistently lose that race. # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] * As explained in the comment: we can no longer create duplicate parallel BUILD files and hope that pants does not notice them before we scan the directory again! # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] Co-authored-by: Stu Hood <stuhood@gmail.com> Co-authored-by: Stu Hood <stuhood@twitter.com>

hrfuller requested review from stuhood, gshuflin and patliu85 March 16, 2020 23:44

stuhood reviewed Mar 17, 2020

View reviewed changes

src/rust/engine/Cargo.toml Show resolved Hide resolved

src/rust/engine/src/nodes.rs Outdated Show resolved Hide resolved

src/rust/engine/src/watch.rs Outdated Show resolved Hide resolved

src/rust/engine/src/watch.rs Outdated Show resolved Hide resolved

illicitonion reviewed Mar 17, 2020

View reviewed changes

src/rust/engine/src/watch.rs Outdated Show resolved Hide resolved

src/rust/engine/src/watch.rs Outdated Show resolved Hide resolved

stuhood and others added 4 commits March 19, 2020 19:51

Use the notify crate to implement an InvalidationWatcher for Graph …

f07a3df

…operations.

Make watch async, and watcher log to pantsd.log.

8bd1692

Relativize paths returned from notify to the build_root. Refactor invalidate method to be an associated method on the InvalidationWatcher.

Respond to feedback.

a7fb332

* Use spawn on io pool instead of custom future impl * Write python fs tests * Relativize paths to invalidate to build root * invalidate nodes with parent paths. * Comments

Add rust test.

3bad8b2

Make some things public so we can use them in tests. Use canonical path to build root for relativizing changed paths.

hrfuller force-pushed the hfuller/impl-notify-watcher branch from ae5c9dc to 3bad8b2 Compare March 20, 2020 03:02

hrfuller requested review from stuhood and illicitonion and removed request for gshuflin and patliu85 March 20, 2020 03:03

stuhood approved these changes Mar 20, 2020

View reviewed changes

illicitonion reviewed Mar 20, 2020

View reviewed changes

gshuflin reviewed Mar 20, 2020

View reviewed changes

src/rust/engine/Cargo.toml Show resolved Hide resolved

Refactor Python tests.

d9dd5de

Return watch errors as core::Failure all the way to user. Check notify event type when deciding whether to invalidate parents. Move task executor onto invalidation watcher. Move test_support trait impl into test_support mod.

stuhood reviewed Mar 21, 2020

View reviewed changes

src/rust/engine/src/nodes.rs Outdated Show resolved Hide resolved

src/rust/engine/src/nodes.rs Outdated Show resolved Hide resolved

stuhood self-requested a review March 21, 2020 16:33

Henry Fuller added 8 commits March 21, 2020 17:05

Remove notify event specific handling. Get rid of unncessary future::…

6dd0288

…lazy

add warn to scandir run

e916089

use futures lock on watcher

263fc05

Move rust engine target directory to pants cache so it doesn't interfere

974b3ce

with file watching. Format watch and nodes files.

Merge branch 'master' into hfuller/impl-notify-watcher

89a937f

[ci skip-jvm-tests] # No JVM changes made.

Revert "Move rust engine target directory to pants cache so it doesn'…

3c5dc63

…t interfere" This reverts commit 974b3ce. [ci skip-jvm-tests] # No JVM changes made.

no need for futures locks anymore

1602241

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

only watch on scandir nodes

77f734c

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

hrfuller mentioned this pull request Mar 24, 2020

Prototype switching to the notify crate for file invalidation #4999

Closed

empty commit for ci trigger

de31d18

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

stuhood reviewed Mar 24, 2020

View reviewed changes

src/rust/engine/Cargo.lock Outdated Show resolved Hide resolved

stuhood reviewed Mar 24, 2020

View reviewed changes

src/rust/engine/Cargo.toml Outdated Show resolved Hide resolved

stuhood reviewed Mar 24, 2020

View reviewed changes

Henry Fuller added 2 commits March 24, 2020 15:39

remove tokio 0.1 from engine dev-dependencies

04b8746

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

Platform specific watching behavior. On Darwin recursively watch the

3f52ee5

build root at startup. On Linux watch individual directory roots. [ci skip-jvm-tests] # No JVM changes made.

hrfuller requested a review from stuhood March 25, 2020 03:02

stuhood approved these changes Mar 25, 2020

View reviewed changes

src/rust/engine/src/watch.rs Show resolved Hide resolved

hrfuller mentioned this pull request Mar 26, 2020

Ignore notify events for pants_ignore patterns. #9406

Merged

hrfuller merged commit 6db9ab4 into pantsbuild:master Mar 26, 2020

hrfuller deleted the hfuller/impl-notify-watcher branch March 27, 2020 20:48

hrfuller mentioned this pull request Mar 29, 2020

Add a feature gate to disable the engine fs watcher introduced in #9318 #9416

Merged

hrfuller mentioned this pull request Apr 2, 2020

Don't rerun uncachable nodes if they are dirtied while running. #9452

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notify fs watcher to engine. #9318

Add notify fs watcher to engine. #9318

hrfuller commented Mar 16, 2020

stuhood left a comment

illicitonion left a comment

stuhood commented Mar 17, 2020

stuhood left a comment

illicitonion left a comment

illicitonion Mar 20, 2020

stuhood Mar 20, 2020 •

edited

Loading

stuhood left a comment

hrfuller commented Mar 24, 2020 •

edited

Loading

stuhood commented Mar 24, 2020

hrfuller commented Mar 24, 2020

stuhood Mar 24, 2020 •

edited

Loading

hrfuller Mar 24, 2020

hrfuller Mar 24, 2020

hrfuller commented Mar 25, 2020

stuhood left a comment

hrfuller commented Mar 25, 2020

Add notify fs watcher to engine. #9318

Add notify fs watcher to engine. #9318

Conversation

hrfuller commented Mar 16, 2020

Problem

Solution

Result

stuhood left a comment

Choose a reason for hiding this comment

illicitonion left a comment

Choose a reason for hiding this comment

stuhood commented Mar 17, 2020

stuhood left a comment

Choose a reason for hiding this comment

illicitonion left a comment

Choose a reason for hiding this comment

illicitonion Mar 20, 2020

Choose a reason for hiding this comment

stuhood Mar 20, 2020 • edited Loading

Choose a reason for hiding this comment

stuhood left a comment

Choose a reason for hiding this comment

hrfuller commented Mar 24, 2020 • edited Loading

stuhood commented Mar 24, 2020

hrfuller commented Mar 24, 2020

stuhood Mar 24, 2020 • edited Loading

Choose a reason for hiding this comment

hrfuller Mar 24, 2020

Choose a reason for hiding this comment

hrfuller Mar 24, 2020

Choose a reason for hiding this comment

hrfuller commented Mar 25, 2020

stuhood left a comment

Choose a reason for hiding this comment

hrfuller commented Mar 25, 2020

stuhood Mar 20, 2020 •

edited

Loading

hrfuller commented Mar 24, 2020 •

edited

Loading

stuhood Mar 24, 2020 •

edited

Loading