Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow reuse of dispatch queues between fsevents in macOS #9009

Merged
merged 1 commit into from
Nov 28, 2023
Merged

fix: allow reuse of dispatch queues between fsevents in macOS #9009

merged 1 commit into from
Nov 28, 2023

Conversation

kevinji
Copy link
Contributor

@kevinji kevinji commented Oct 27, 2023

Previously, the code waiting on dq->dq_finished would get notified in dune_fsevents_dispatch_queue_wait_until_stopped that the dispatch queue had completed running. This would lead to the dispatch queue being marked as Stopped on the OCaml side, even though the dispatch queue could be passed into Fsevents.start again.

This change now tracks the number of fsevents streams in a fsevents_streams counter so that Dispatch_queue.wait_until_stopped will only return when an exception is raised or if all streams have been stopped, restoring the original behavior in the code that used a runloop.

Fixes #9004. Fixes #8910.

@Alizter
Copy link
Collaborator

Alizter commented Oct 27, 2023

Why does this only trigger on M1?

@kevinji
Copy link
Contributor Author

kevinji commented Oct 27, 2023

I think this should apply to any Mac running Dune, not just an M-series chip, but I don’t have an Intel Mac to confirm this.

This fsevents code is only used on Macs.

@rgrinberg rgrinberg requested a review from gridbugs October 30, 2023 18:55
@rgrinberg rgrinberg added this to the 3.12.0 milestone Nov 3, 2023
@emillon emillon mentioned this pull request Nov 10, 2023
26 tasks
@gridbugs
Copy link
Collaborator

Confirming that I can reproduce #9004 and the fix in this PR on an Intel mac running Monterey and an M1 mac running Sonoma.

@kevinji
Copy link
Contributor Author

kevinji commented Nov 13, 2023

I've pushed a new version that has slightly altered behavior from the original PR: now, the dune_fsevents_dispatch_queue_wait_until_stopped call will also return if all fsevents streams have been removed.

I'm running into what seems like a race condition where if the callback returns an exception which then notifies the dune_fsevents_dispatch_queue_wait_until_stopped function via a condition variable, the dispatch queue will sometimes (non-deterministically) still execute another item in the queue before stopping.

Ideally if an exception is seen, the queue should stop processing immediately. I tried inserting

  FSEventStreamStop(t->stream);
  FSEventStreamInvalidate(t->stream);

into the body of dune_fsevents_callback after an exception is seen, but still seem to see some race conditions, so any suggestions would be appreciated.

@gridbugs
Copy link
Collaborator

gridbugs commented Nov 17, 2023

I'm running into what seems like a race condition...

@kevinji can you clarify whether that issue was present before your recent changes to this PR? If not, would it make sense to merge the original form of this PR for the release of 3.12 while we work out the issues with the new code?

I'm trying to get an understanding of how the new changes work but it looks like there are several changes unrelated to the description (e.g. some seem stylistic). Do you mind splitting those into a separate commit so I can be sure I'm paying attention to the right stuff.

emillon added a commit to emillon/opam-repository that referenced this pull request Nov 22, 2023
CHANGES:

- Cherry-pick ocaml/dune#9215 and ocaml/dune#9009 (@emillon)

- dune-build-info: when `version=""` is found in a `META` file, we now return
  `None` as a version string (ocaml/dune#9177, @emillon)
@kevinji
Copy link
Contributor Author

kevinji commented Nov 25, 2023

@gridbugs Apologies for the late response. I've pulled out refactoring changes into #9288 and the test changes into #9289. Once those PRs are merged I can rebase this PR on top of main to clean up the diff.

This PR fixes the bugs mentioned in the initial comment as well as segfaults that I saw when running the tests in test/expect-tests/fsevents. Currently, the race condition I'm experiencing is in the "raise inside callback" test, where the [EXIT] in the output often (but not always) fails to show up.

Let me know if I can provide any other clarification.

Previously, the code waiting on `dq->dq_finished` would get notified in
`dune_fsevents_dispatch_queue_wait_until_stopped` that the dispatch
queue had completed running. This would lead to the dispatch queue being
marked as `Stopped` on the OCaml side, even though the dispatch queue
could be passed into `Fsevents.start` again.

This change now tracks the number of fsevents streams in a
`fsevents_streams` counter so that `Dispatch_queue.wait_until_stopped`
will only return when an exception is raised or if all streams have been
stopped, restoring the original behavior in the code that used a runloop.

Fixes #9004. Fixes #8910.

Signed-off-by: Kevin Ji <1146876+kevinji@users.noreply.github.com>
@gridbugs
Copy link
Collaborator

I'm happy with this. One issue is that the race condition could lead to flaky CI. I now understand why it happens. In dune_fsevents_callback (fsevents_stubs.c) we call the user-provided callback and then if the callback raised an exception we store the exception in dq->v_exn and broadcast dq->dq_finished which causes wait_until_stopped to stop waiting and either return or raise the exception in dq->v_exn if there's an exception stored there. The problem happens when dune_fsevents_stop is called while a callback is being processed. dune_fsevents_stop also broadcasts on dq->dq_finished which wakes the thread that called wait_until_stopped. If the callback raises but doesn't do so until after wait_until_stopped checks for exceptions, the exception never gets re-raised from the c code to the caller of wait_until_stopped.

So the question is what should happen to exceptions that are raised by callbacks after the watcher has been stopped (note that the watcher may have been stopped while the callback was running)? I think it's reasonable to continue the current behaviour of dropping such exceptions, in which case a quick fix for the test would be to delay stopping the watcher until the callback is likely to have completed (I don't feel too bad about sleeping in this test since this test already sleeps in another place):

diff --git a/test/expect-tests/fsevents/fsevents_tests.ml b/test/expect-tests/fsevents/fsevents_tests.ml
index fbc5c6551..ff2fc5ee5 100644
--- a/test/expect-tests/fsevents/fsevents_tests.ml
+++ b/test/expect-tests/fsevents/fsevents_tests.ml
@@ -258,7 +258,8 @@ let%expect_test "raise inside callback" =
       raise Exit)
     (fun () ->
       Io.String_path.write_file "old" "foobar";
-      Io.String_path.write_file "old" "foobar");
+      Io.String_path.write_file "old" "foobar";
+      Unix.sleepf 1.0);
   [%expect {|
     [EXIT]
     exiting. |}]

@emillon
Copy link
Collaborator

emillon commented Nov 28, 2023

Thanks for the review and the fix. In terms of releasing a fix I'll take a flaky test against a regression, so I'll merge this now. We can discuss the fix for the flakiness separately.

@emillon emillon mentioned this pull request Nov 28, 2023
@emillon emillon merged commit e25ba5a into ocaml:main Nov 28, 2023
4 checks passed
@kevinji kevinji deleted the fix-macos-watch branch December 5, 2023 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watch mode crashes on macos on main branch fsevents expect tests fail on macos
5 participants