Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mac] Fix callback exceptions when the watcher is deleted but still receiving events #786

Merged
merged 6 commits into from
May 10, 2021

Conversation

BoboTiG
Copy link
Collaborator

@BoboTiG BoboTiG commented May 6, 2021

The watcher might be cleared on thread stop but the C callback would still send events.
Let's ignore them to prevent unhandled exceptions later:

    AttributeError: 'NoneType' object has no attribute 'is_recursive'
      File "watchdog/observers/fsevents.py", line 299, in callback
        emitter.queue_events(emitter.timeout, events)
      File "watchdog/observers/fsevents.py", line 261, in queue_events
        self._queue_created_event(event, src_path, src_dirname)
      File "watchdog/observers/fsevents.py", line 124, in _queue_created_event
        self.queue_event(DirModifiedEvent(dirname))
      File "watchdog/observers/fsevents.py", line 97, in queue_event
        if self._watch.is_recursive :

Or even:

    AttributeError: 'NoneType' object has no attribute 'path'
      File "watchdog/observers/fsevents.py", line 299, in callback
        emitter.queue_events(emitter.timeout, events)
      File "watchdog/observers/fsevents.py", line 174, in queue_events
        src_path = self._encode_path(event.path)
      File "watchdog/observers/fsevents.py", line 323, in _encode_path
        if isinstance(self.watch.path, bytes):

Co-authored-by: Romain Grasland rgrasland@nuxeo.com

…eceiving events

The watcher might be cleared on thread stop but the C callback would still send events.
Let's ignore them to prevent unhandled exceptions later:

    AttributeError: 'NoneType' object has no attribute 'is_recursive'
      File "watchdog/observers/fsevents.py", line 299, in callback
        emitter.queue_events(emitter.timeout, events)
      File "watchdog/observers/fsevents.py", line 261, in queue_events
        self._queue_created_event(event, src_path, src_dirname)
      File "watchdog/observers/fsevents.py", line 124, in _queue_created_event
        self.queue_event(DirModifiedEvent(dirname))
      File "watchdog/observers/fsevents.py", line 97, in queue_event
        if self._watch.is_recursive :

Or even:

    AttributeError: 'NoneType' object has no attribute 'path'
      File "watchdog/observers/fsevents.py", line 299, in callback
        emitter.queue_events(emitter.timeout, events)
      File "watchdog/observers/fsevents.py", line 174, in queue_events
        src_path = self._encode_path(event.path)
      File "watchdog/observers/fsevents.py", line 323, in _encode_path
        if isinstance(self.watch.path, bytes):

Co-authored-by: Romain Grasland <rgrasland@nuxeo.com>
@BoboTiG
Copy link
Collaborator Author

BoboTiG commented May 6, 2021

cc @samschott and @CCP-Aporia.

@samschott
Copy link
Contributor

Ouch, what a nasty exception.

Just a few questions to understand your changes:

  1. Does the use of the lock in on_thread_stop already fix the race condition by itself? Or do we still need the test for self.watch in the callback?
  2. Do we really want to drop those last events? Technically, since they where emitted while the watch was still running, they should still be processed. But this is just nitpicking...

In any case, moving the callback out of that inline definition is an improvement in the first place :)

@BoboTiG
Copy link
Collaborator Author

BoboTiG commented May 6, 2021

  1. It is part of the fix, yes. Both the lock and the check are needed :)
  2. Actually, they cannot be processed: the watcher is deleted and queue_events() needs several attributes from self.watch, which no more exists. Even queue_event() needs attributes from self.watch.

I would say pending events are lost in any cases, that's just cleaner now. As the code processing events heavily requires self.watch attributes, it is not worth adding checks everywhere to prevent issues. WDYT?

@samschott
Copy link
Contributor

Those are fair points. It's just that the explicit check for self.watch somehow strikes me as a roundabout way of fixing this. I can't quite put my finger on why though.

Maybe I just find it odd that the supplied callback, wrapped in extension module as watchdog_FSEventStreamCallback, is still invoked after FSEventStreamStop was called, i.e., after:

_fsevents.remove_watch(self.watch)

Similarly, the callback acquires the GIL so calling _fsevents.remove_watch() should block until the callback has finished. So where does the race come from? I could be missing something obvious here. And maybe @CCP-Aporia knows more about this.

I certainly can't argue with the issue that you are seeing or with the success of the fix. So I'm happy with the changes!

@BoboTiG
Copy link
Collaborator Author

BoboTiG commented May 6, 2021

My understanding (and it may be wrong) is that the issue happens in the callback, more specifically here:

events = [
_fsevents.NativeEvent(path, inode, event_flags, event_id)
for path, inode, event_flags, event_id in zip(paths, inodes, flags, ids)
]
emitter.queue_events(emitter.timeout, events)

The watcher is stopped while iterating and before sending events to queue_events(). And so the self._watch attribute is set to None. That's a tricky one for sure, and it indeed fixes reported issues. Maybe a better fix is possible, that's why I pinged @CCP-Aporia and you :)

@CCP-Aporia
Copy link
Contributor

I'm in agreement with @samschott that the explicit self.watch checks look like a work around for an underlying deficiency. Let me take a look at the C side of things; it's very possible that the issue is in there.

@BoboTiG
Copy link
Collaborator Author

BoboTiG commented May 7, 2021

Of course, go ahead :)
Both tests are showing the issue, it will be easier to have a look then.

@CCP-Aporia
Copy link
Contributor

These tests were great, indeed. Thank you for those! :-) I've fixed the issue in my fork and opened a PR to update your branch here, see #788

BoboTiG and others added 3 commits May 7, 2021 14:01
@BoboTiG
Copy link
Collaborator Author

BoboTiG commented May 7, 2021

Thanks a lot @CCP-Aporia 🍾
I will merge that PR and correctly set the author, so you do not loose the credits ;)

If the review is OK for you both, let's merge when the CI will be green.

@BoboTiG BoboTiG closed this May 7, 2021
@BoboTiG BoboTiG reopened this May 7, 2021
Copy link
Contributor

@samschott samschott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @CCP-Aporia, this looks much cleaner :)

So, just to understand why those changes work:

  1. The workaround with setting self._watch = None was added to prevent issues when on_thread_stop is called twice because the C extension did not properly handle the case of watchdog_remove_watch being called twice.
  2. This workaround led to queue_events raising errors when self._watch is None.
  3. It is still possible that queue_events is called while the emitter is being stopped but this no longer raises "unhandled" exceptions.

PyObject *streamref_capsule = PyDict_GetItem(watch_to_stream, watch);
if (!streamref_capsule) {
// A watch might have been removed explicitly before, in which case we can simply early out.
Py_RETURN_NONE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When exactly does this case occur?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it says in the comment - if someone explicitly calls remove_watch twice for the same watch then PyDict_GetItem won't find it again. Not checking for the nullptr return here was the main reason for the work-arounds in Python code. 🙂
Effectively, streamref_capsule being a nullptr caused the assertion failures for the three fsevents related API calls that follow, as well as the error message about returning a value with an exception set.

@BoboTiG
Copy link
Collaborator Author

BoboTiG commented May 7, 2021

You are right about the 3rd statement @samschott. Without testing now, I guess it simply works because self._watch is not reset to None. I guess remaining events are then sent with success (it was one of your consideration).

I do not have a strong opinion if we should process those events or not, I am just thinking loud.

@BoboTiG
Copy link
Collaborator Author

BoboTiG commented May 7, 2021

To ease working on it, I could split the PR: a smaller one could be merged ASAP to move out the callback from the inline function. Then that PR will only contains tests + C-fix to ease thinking about a potential fix.

@samschott
Copy link
Contributor

samschott commented May 8, 2021

I do not have a strong opinion if we should process those events or not, I am just thinking loud.

I don't really have a strong opinion either. My intuition says that those events should be processed since they are triggered before the emitter is stopped. There also should not be any consequences with unexpected events being dispatched to the handlers after stopping and joining the observer thread. So it's really up to you!

To ease working on it, I could split the PR: a smaller one could be merged ASAP to move out the callback from the inline function.

Personally I am quite happy with the current fix. Is there any more work to do that's worth splitting off? You can of course split things off to have a cleaner commit history.

@CCP-Aporia
Copy link
Contributor

@samschott - your understanding of the fix is correct. 🙂

@BoboTiG - indeed, resetting self._watch to None was the whole work-around which is now completely unnecessary. That kind of reset only introduced edge cases which can be avoid with the fix in the C extension.

@BoboTiG BoboTiG merged commit a70b350 into master May 10, 2021
@BoboTiG BoboTiG deleted the fix-fsevent-callback-no-watcher branch May 10, 2021 12:30
@BoboTiG
Copy link
Collaborator Author

BoboTiG commented May 10, 2021

Thank you both, let's do a new release with that fix :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants