Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1634310: Python: Fix race condition in atexit #854

Merged
merged 4 commits into from
May 4, 2020

Conversation

mdboom
Copy link
Contributor

@mdboom mdboom commented Apr 30, 2020

Fixes a race condition in the atexit handlers.

Glean currently has two atexit handlers: (a) to make sure the thread worked completes all of its tasks, and (b) that (among other things) deletes the data directory if it's a tmpdir. atexit handlers are run sequentially on the main thread, but the ordering is based on the order in which they are registered, which is somewhat non-deterministic in Glean.

If (b) runs before (a), the data directory is deleted, and then any operations that might be waiting the thread queue will fail with "Database not found".

The fix is to combine the atexit handlers into one, and join on the thread queue before deleting the tempdir.

This raised another issue in my mind that using a tempdir by default is probably not a good choice, and we are seeing this bug in burnham only because burnham doesn't override the data dir (as other "real" apps, such as mozregression have done). Changing the default to a retained directory probably makes sense, and I created bug 1634410 to track that work.

@auto-assign auto-assign bot requested a review from Dexterp37 April 30, 2020 19:20
@mdboom
Copy link
Contributor Author

mdboom commented Apr 30, 2020

Cc: @hackebrot

@mdboom mdboom force-pushed the fix-race-condition branch from 3d42468 to 50c5915 Compare April 30, 2020 21:06
@mdboom mdboom force-pushed the fix-race-condition branch from 50c5915 to 486423c Compare April 30, 2020 21:17
log.error("socket.gaierror: {}".format(e))
log.error("socket.gaierror: '{}' {}".format(url, e))
return False
except OSError as e:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking this is a separate bug -- we get this error if DNS lookup fails. In typical Python fashion the exceptions that conn.getresponse can throw aren't documented, so we're playing whack-a-mole here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't really want this to ever throw, is there any value in just changing this to except Exception as e:?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suppose that's fine. Makes me feel icky in the event that anything truly unexpected does happen, but I should probably just get over that.

@@ -105,15 +101,20 @@ def _worker(self):

def _shutdown_thread(self):
"""
An atexit handler to tell the worker thread to shutdown and then wait
for 1 seconds for it to finish.
Tell the worker thread to shutdown and then wait for 1 seconds for it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, when going from mobile to the desktop world, we should try to be a bit more careful at shutdown (which is something we didn't have to pay much attention to, up to now!). We should probably:

  • when shutdown is initiated, disallow all the recording APIs;
  • terminate pending uploads;
  • wait for ping I/O (if any).

So, in general, perform a clean telemetry shutdown as we currently do for legacy telemetry on Firefox desktop. This is not something that should be tackled in this bug, but it could be worthwhile filing a bug about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- let's save that for another bug. It's tricky, because CLI doesn't quite work the same as desktop.

log.error("socket.gaierror: {}".format(e))
log.error("socket.gaierror: '{}' {}".format(url, e))
return False
except OSError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't really want this to ever throw, is there any value in just changing this to except Exception as e:?

data_dir = None # type: Optional[Path]
if not clear_stores:
Glean._destroy_data_dir = False
data_dir = Glean._data_dir

Glean._reset()
Dispatcher._testing_mode = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this moved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glean._reset waits for the worker thread to complete, and how that happens depends on the value of _testing_mode. If _testing_mode changes before this happens, it won't wait on the worker thread, because it assumes all work is being done on the main thread.

I can add a comment to this effect here.

@mdboom mdboom merged commit 848bd1b into mozilla:master May 4, 2020
@mdboom mdboom deleted the fix-race-condition branch May 4, 2020 13:25
@hackebrot
Copy link
Contributor

Thank you for resolving this bug this quickly, @mdboom @Dexterp37! 👩‍🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants