Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notify fs watcher to engine. #9318

Merged
merged 16 commits into from
Mar 26, 2020

Conversation

hrfuller
Copy link
Member

Problem

Watchman is error prone, and requires pants to maintain and communicate with another daemon.

Solution

Implement file event watching with the notify-rs crate. Hook watchman and notify invalidation strategies into the same invalidation method, with different event sources.

Result

Notify and watchman fs event systems operate in tandem. We will follow up to allow users to disable watchman. Because of the way notify can watch more granular paths than watchman, file change events result in less drastic node invalidation, which should lead to faster runs with pantsd.

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

src/rust/engine/Cargo.toml Show resolved Hide resolved
src/rust/engine/src/nodes.rs Outdated Show resolved Hide resolved
src/rust/engine/src/watch.rs Outdated Show resolved Hide resolved
src/rust/engine/src/watch.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super exciting, this looks great!

It would be good to see a few tests for:

  • Watching a symlink, and {deleting,repointing} it as a well as changing the contents of the file it points at
  • Watching the target of a symlink
  • Watching a chain of several symlinks
  • Creating or deleting a file invalidating the readdir of the containing directory
  • Deleting or renaming a directory invalidating the reading of files in that directory

src/rust/engine/src/watch.rs Outdated Show resolved Hide resolved
src/rust/engine/src/watch.rs Outdated Show resolved Hide resolved
@stuhood
Copy link
Member

stuhood commented Mar 17, 2020

Watching a symlink, and {deleting,repointing} it as a well as changing the contents of the file it points at
Watching the target of a symlink
Watching a chain of several symlinks

@illicitonion : All of these are handled in the Graph rather than here. All the watch implementation needs to do is notify us that the symlink itself changed... we'll track the content lookup as an operation on the destination file. We've never relied on watchman to provide that.

Creating or deleting a file invalidating the readdir of the containing directory
Deleting or renaming a directory invalidating the reading of files in that directory

These would be good to do though. @hrfuller : one thing to note with regard to testing is that because notify doesn't require a separate daemon, it also doesn't require pantsd... we are effectively now using notify "everywhere"... including in all python unit tests that create or delete files under a scheduler. So I think that you should be able to test this in test_fs.py:

def test_files_content_symlink(self):
self.assert_content(["c.ln/../3.txt"], {"c.ln/../3.txt": b"three\n"})
... by running an operation, deleting/renaming a file, waiting a brief period, and then re-running the operation?

stuhood and others added 4 commits March 19, 2020 19:51
Relativize paths returned from notify to the build_root.
Refactor invalidate method to be an associated method on the
InvalidationWatcher.
* Use spawn on io pool instead of custom future impl
* Write python fs tests
* Relativize paths to invalidate to build root
* invalidate nodes with parent paths.
* Comments
Make some things public so we can use them in tests.
Use canonical path to build root for relativizing changed paths.
@hrfuller hrfuller force-pushed the hfuller/impl-notify-watcher branch from ae5c9dc to 3bad8b2 Compare March 20, 2020 03:02
@hrfuller hrfuller requested review from stuhood and illicitonion and removed request for gshuflin and patliu85 March 20, 2020 03:03
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton! Nothing blocking.

src/rust/engine/src/watch.rs Outdated Show resolved Hide resolved
src/rust/engine/graph/src/lib.rs Outdated Show resolved Hide resolved
src/rust/engine/src/watch.rs Show resolved Hide resolved
src/rust/engine/src/watch.rs Outdated Show resolved Hide resolved
tests/python/pants_test/engine/test_fs.py Outdated Show resolved Hide resolved
Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good :) I would appreciate a few more tests in rust before merging, as I suspect we may be over-invalidating a fair bit (and possibly under-invalidating in some places) - a few more tests would help set my mind at ease!

src/rust/engine/src/nodes.rs Outdated Show resolved Hide resolved
src/rust/engine/src/watch.rs Show resolved Hide resolved
use fs::File;
use testutil::make_file;

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to see unit tests here also for:

  • build_root/dir/subdir/file changing only invalidating build_root/dir/subdir/file
  • build_root/dir/subdir/file being created or deleted invalidating build_root/dir/subdir/file and build_root/dir/subdir
  • A symlink target being changed invalidating exactly the symlink
  • For build_root/dir/subdir/file, subdir being deleted invalidating build_root/dir, build_root/dir/subdir and build_root/dir/subdir/file
  • build_root being a symlink
  • Changing the executable bit on a file invalidating that file

Hopefully these are all nicely extractable into common code so that each test is just:

  • Set up a filesystem
  • Call a function
  • Modify the filesystem
  • Make an assertion

Copy link
Member

@stuhood stuhood Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be at least one followup patch in this series: I think investing more time in testing there would be reasonable.

But I also think that some of those tests should likely be in other modules: this module should probably only be testing "point" invalidation (since that is all that it actually implements), and then a separate module could test impl VFS for Context.

Return watch errors as core::Failure all the way to user.
Check notify event type when deciding whether to invalidate parents.
Move task executor onto invalidation watcher.
Move test_support trait impl into test_support mod.
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue with this revision: see the comment in nodes.rs.

To confirm that we're not missing anything else like that, please ensure that you do something like timing the average of five runs of ./pants list :: (without pantsd) before/after the patch. If it's within 5 or 10% (we should expect some overhead), then we're good.

src/rust/engine/src/nodes.rs Outdated Show resolved Hide resolved
src/rust/engine/src/nodes.rs Outdated Show resolved Hide resolved
@stuhood stuhood self-requested a review March 21, 2020 16:33
Henry Fuller added 8 commits March 21, 2020 17:05
with file watching. Format watch and nodes files.
[ci skip-jvm-tests]  # No JVM changes made.
…t interfere"

This reverts commit 974b3ce.

[ci skip-jvm-tests]  # No JVM changes made.
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
@hrfuller
Copy link
Member Author

hrfuller commented Mar 24, 2020

Perf numbers
master

  • ( repeat 5; do; ./v2 --no-enable-pantsd list ::; done; ) 20.21s user 9.29s system 161% cpu 18.318 total
  • ( repeat 5; do; ./v2 --no-enable-pantsd list src/::; done; ) 13.68s user 6.10s system 130% cpu 15.127 total
  • ( repeat 5; do; ./v2 --no-enable-pantsd list contrib/::; done; ) 11.64s user 3.53s system 107% cpu 14.177 total

This branch

  • ( repeat 5; do; ./v2 --no-enable-pantsd list ::; done; ) 39.03s user 39.18s system 113% cpu 1:08.82 total
  • ( repeat 5; do; ./v2 --no-enable-pantsd list src/::; done; ) 11.96s user 4.17s system 103% cpu 15.619 total
  • ( repeat 5; do; ./v2 --no-enable-pantsd list contrib/::; done; ) 13.30s user 6.12s system 107% cpu 18.093 total

A note on these numbers. From the first entry in each list it's clear that we take a significant performance hit for startup runs when we have to watch alot of files. As far as I can tell this just comes down to it being somewhat slow to add watches to files, or possibly to the notify-rs implementation. Could we parallelize more here, if we are waiting for a lock on the watcher too long?

It's also interesting that the overhead from watching files becomes negligible compared to startup time for smaller sets of targets (src/::).

The situation in general gets much better when pants.d is enabled, even if significant amounts of nodes are invalidated.

Interested to hear thoughts on if this is acceptable and how to proceed if its not.

@stuhood
Copy link
Member

stuhood commented Mar 24, 2020

Commented in Slack: it looks like there's an issue with the first line there.

@hrfuller
Copy link
Member Author

Updated the first line, and yes unfortunately it doesn't track with the other lines. Either there are some files still hiding in the repo that are being git ignored but not pants ignored, or the way we are watching files is non-linear: we could be watching the same file multiple times, but we shouldn't be if Scandir nodes are run once per dir.

# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
src/rust/engine/Cargo.lock Outdated Show resolved Hide resolved
src/rust/engine/Cargo.toml Outdated Show resolved Hide resolved
@@ -134,7 +143,7 @@ def test_walk_parent_link(self):

def test_walk_escaping_symlink(self):
link = "subdir/escaping"
dest = "../../this-is-probably-nonexistent"
dest = "../../"
Copy link
Member

@stuhood stuhood Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened here? AFAICT, it should be fine to leave this behind, because we shouldn't actually get to the point of looking for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failed because we started propagating errors from notify, in this case that the path it was trying to watch didn't exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dest path still tries to link outside of the build root so the test should be the same.

Henry Fuller added 2 commits March 24, 2020 15:39
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
build root at startup. On Linux watch individual directory roots.

[ci skip-jvm-tests]  # No JVM changes made.
@hrfuller
Copy link
Member Author

Revised Perf numbers after recursive watch on OSX: Looks like watching is essentially free now!

This branch:

  • ( repeat 5; do; ./v2 --no-enable-pantsd list ::; done; ) > /dev/null 17.56s user 6.77s system 147% cpu 16.443 total
  • ( repeat 5; do; ./v2 --no-enable-pantsd list src/::; done; ) > /dev/null 11.01s user 3.15s system 106% cpu 13.343 total
  • ( repeat 5; do; ./v2 --no-enable-pantsd list contrib/::; done; ) > /dev/null 11.10s user 3.35s system 108% cpu 13.348 total

master

  • ( repeat 5; do; ./v2 --no-enable-pantsd list ::; done; ) > /dev/null 20.37s user 9.91s system 167% cpu 18.107 total
  • ( repeat 5; do; ./v2 --no-enable-pantsd list src/::; done; ) > /dev/null 13.23s user 5.85s system 130% cpu 14.658 total
  • ( repeat 5; do; ./v2 --no-enable-pantsd list contrib/::; done; ) > /dev/null 11.12s user 3.39s system 109% cpu 13.203 total

@hrfuller hrfuller requested a review from stuhood March 25, 2020 03:02
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

It would be good to confirm that we don't see a substantial slowdown on Linux, but based on the investigation yesterday I'm optimistic, and ok doing that post merge.

src/rust/engine/src/watch.rs Show resolved Hide resolved
@hrfuller
Copy link
Member Author

on linux courtesy of @gshuflin we have

[gregs@gregs-dev-laptop pants]$ bash -c "time (for i in {1..5}; do ./v2 --no-enable-pantsd list ::; done ) > /dev/null"
real    0m19.423s
user    0m30.202s
sys     0m9.692s

@hrfuller hrfuller merged commit 6db9ab4 into pantsbuild:master Mar 26, 2020
@hrfuller hrfuller deleted the hfuller/impl-notify-watcher branch March 27, 2020 20:48
hrfuller pushed a commit to twitter/pants that referenced this pull request Mar 29, 2020
…tsbuild#9318

by default, to mitigate issues seen in pantsbuild#9415 until a fix is in place.

[ci skip-jvm-tests]  # No JVM changes made.
hrfuller pushed a commit that referenced this pull request Mar 30, 2020
… (#9416)

* Add a feature gate to disable the engine fs watcher introduced in #9318
by default, to mitigate issues seen in #9415 until a fix is in place.
hrfuller pushed a commit to twitter/pants that referenced this pull request May 11, 2020
* Use the notify crate to implement an `InvalidationWatcher` for Graph operations.

* Make watch async, and watcher log to pantsd.log.
Relativize paths returned from notify to the build_root.
Refactor invalidate method to be an associated method on the
InvalidationWatcher.

* Respond to feedback.
* Use spawn on io pool instead of custom future impl
* Write python fs tests
* Relativize paths to invalidate to build root
* invalidate nodes with parent paths.
* Comments

* Add rust tests.
Make some things public so we can use them in tests.
Use canonical path to build root for relativizing changed paths.

* Refactor Python tests.
Return watch errors as core::Failure all the way to user.
Move task executor onto invalidation watcher.
Move test_support trait impl into test_support mod.

* use futures lock on watcher

* Platform specific watching behavior. On Darwin recursively watch the
build root at startup. On Linux watch individual directory roots.

Co-authored-by: Stu Hood <stuhood@gmail.com>
hrfuller pushed a commit to twitter/pants that referenced this pull request May 11, 2020
…tsbuild#9318 (pantsbuild#9416)

* Add a feature gate to disable the engine fs watcher introduced in pantsbuild#9318
by default, to mitigate issues seen in pantsbuild#9415 until a fix is in place.
hrfuller pushed a commit that referenced this pull request May 15, 2020
* Port to tokio 0.2, and to stdlib futures for fs and task_executor (#9071)

We're on an older version of tokio, which doesn't smoothly support usage of async/await.

Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures.

Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline.

Progress toward more usage of async/await!

* Add notify fs watcher to engine. (#9318)

* Use the notify crate to implement an `InvalidationWatcher` for Graph operations.

* Make watch async, and watcher log to pantsd.log.
Relativize paths returned from notify to the build_root.
Refactor invalidate method to be an associated method on the
InvalidationWatcher.

* Respond to feedback.
* Use spawn on io pool instead of custom future impl
* Write python fs tests
* Relativize paths to invalidate to build root
* invalidate nodes with parent paths.
* Comments

* Add rust tests.
Make some things public so we can use them in tests.
Use canonical path to build root for relativizing changed paths.

* Refactor Python tests.
Return watch errors as core::Failure all the way to user.
Move task executor onto invalidation watcher.
Move test_support trait impl into test_support mod.

* use futures lock on watcher

* Platform specific watching behavior. On Darwin recursively watch the
build root at startup. On Linux watch individual directory roots.

Co-authored-by: Stu Hood <stuhood@gmail.com>

* Ignore notify events for pants_ignore patterns. (#9406)

* Create a git ignorer on the context object. Adjust all call sites which
create a posix fs to pass in an ignorer.

* Ignore fsevents from files that match pants_ignore patterns.

* Always pass is_dir = false to ignorer to avoid stat-ing every path the
event watch thread sees.

* Add a feature gate to disable the engine fs watcher introduced in #9318 (#9416)

* Add a feature gate to disable the engine fs watcher introduced in #9318
by default, to mitigate issues seen in #9415 until a fix is in place.

* Don't rerun uncachable nodes if they are dirtied while running. (#9452)

* Don't rerun uncachable nodes if they are dirtied while running.
- Retry dependencies of uncacheable nodes a few times to get a result
  until we are exhausted from trying too many times.
- Bubble uncacheable node retry errors up to the user, tell them things
  were chaning to much.
- Don't propagate dirtiness past uncacheable nodes when invalidating
  from changed roots. Otherwise dirty dependents of uncacheable nodes
  will need to re-run.

* enable the engine fs watcher by default, now that it won't cause issues.
Remove execution option override from tests.

* use reference to self in stop_walk_predicate closure

* invalidate often enough that test doesn't flake

* Add a flag to prevent the FsEventService and watchman from starting (#9487)

* add --watchman-enable flag
* disable watchman when flag is false
* Don't wait for the initial watchman event if we aren't using watchman.
* check invalidation watcher liveness from scheduler service

* Extract a `watch` crate. (#9635)

The `watch` module directly accesses the `engine` crate's `Graph`, which makes it more challenging to test.

Extract a `watch` crate which is used via an `trait Invalidatable` which is implemented for the engine's `Graph`, as well as independently in tests.

[ci skip-jvm-tests]

* Simplify Scheduler::execute and unify Graph retry (#9674)

Both the `Graph` and the `Scheduler` implemented retry for requested Nodes, but the `Scheduler` implementation was pre-async-await and much more complicated.

Unify the retry implementations into `Graph::get` (either roots or uncacheable nodes are retried), and simplify the `Scheduler`'s loop down to:
```
let maybe_display_handle = Self::maybe_display_initialize(&session);
let result = loop {
  if let Ok(res) = receiver.recv_timeout(refresh_interval) {
    break Ok(res);
  } else if let Err(e) = Self::maybe_display_render(&session, &mut tasks) {
    break Err(e);
  }
};
Self::maybe_display_teardown(session, maybe_display_handle);
result
```

A single, more modern retry implementation (thanks @hrfuller!), and a cleaner `Scheduler::execute` loop.

* Move file invalidation handling to rust (#9636)

A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`:
1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed
2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity
3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph`
4. `--loop` - implemented directly in the `SchedulerService`

Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them.

The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`:
* `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls.
* `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing).

These features are then used by:
* `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed.
* invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files.

The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`.

Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`.

In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`.

* pin tokio at exactly 0.2.13

* fix lint issues

* fix mypy typing issues

* Move away from the debounced notify watcher #9754

* Remove test that has raced file invalidation ever since the notify backend was added, but which will now fairly consistently lose that race.
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]

* As explained in the comment: we can no longer create duplicate parallel BUILD files and hope that pants does not notice them before we scan the directory again!
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]

Co-authored-by: Stu Hood <stuhood@gmail.com>
Co-authored-by: Stu Hood <stuhood@twitter.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants