-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1634310: Python: Fix race condition in atexit #854
Conversation
Cc: @hackebrot |
3d42468
to
50c5915
Compare
50c5915
to
486423c
Compare
log.error("socket.gaierror: {}".format(e)) | ||
log.error("socket.gaierror: '{}' {}".format(url, e)) | ||
return False | ||
except OSError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking this is a separate bug -- we get this error if DNS lookup fails. In typical Python fashion the exceptions that conn.getresponse
can throw aren't documented, so we're playing whack-a-mole here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't really want this to ever throw, is there any value in just changing this to except Exception as e:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suppose that's fine. Makes me feel icky in the event that anything truly unexpected does happen, but I should probably just get over that.
@@ -105,15 +101,20 @@ def _worker(self): | |||
|
|||
def _shutdown_thread(self): | |||
""" | |||
An atexit handler to tell the worker thread to shutdown and then wait | |||
for 1 seconds for it to finish. | |||
Tell the worker thread to shutdown and then wait for 1 seconds for it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, when going from mobile to the desktop world, we should try to be a bit more careful at shutdown (which is something we didn't have to pay much attention to, up to now!). We should probably:
- when shutdown is initiated, disallow all the recording APIs;
- terminate pending uploads;
- wait for ping I/O (if any).
So, in general, perform a clean telemetry shutdown as we currently do for legacy telemetry on Firefox desktop. This is not something that should be tackled in this bug, but it could be worthwhile filing a bug about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- let's save that for another bug. It's tricky, because CLI doesn't quite work the same as desktop.
log.error("socket.gaierror: {}".format(e)) | ||
log.error("socket.gaierror: '{}' {}".format(url, e)) | ||
return False | ||
except OSError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't really want this to ever throw, is there any value in just changing this to except Exception as e:
?
data_dir = None # type: Optional[Path] | ||
if not clear_stores: | ||
Glean._destroy_data_dir = False | ||
data_dir = Glean._data_dir | ||
|
||
Glean._reset() | ||
Dispatcher._testing_mode = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this moved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glean._reset
waits for the worker thread to complete, and how that happens depends on the value of _testing_mode
. If _testing_mode
changes before this happens, it won't wait on the worker thread, because it assumes all work is being done on the main thread.
I can add a comment to this effect here.
Thank you for resolving this bug this quickly, @mdboom @Dexterp37! 👩🚀 |
Fixes a race condition in the
atexit
handlers.Glean currently has two atexit handlers: (a) to make sure the thread worked completes all of its tasks, and (b) that (among other things) deletes the data directory if it's a tmpdir. atexit handlers are run sequentially on the main thread, but the ordering is based on the order in which they are registered, which is somewhat non-deterministic in Glean.
If (b) runs before (a), the data directory is deleted, and then any operations that might be waiting the thread queue will fail with "Database not found".
The fix is to combine the atexit handlers into one, and join on the thread queue before deleting the tempdir.
This raised another issue in my mind that using a tempdir by default is probably not a good choice, and we are seeing this bug in burnham only because burnham doesn't override the data dir (as other "real" apps, such as mozregression have done). Changing the default to a retained directory probably makes sense, and I created bug 1634410 to track that work.