Bug 1634310: Python: Fix race condition in atexit #854

mdboom · 2020-04-30T19:20:40Z

Fixes a race condition in the atexit handlers.

Glean currently has two atexit handlers: (a) to make sure the thread worked completes all of its tasks, and (b) that (among other things) deletes the data directory if it's a tmpdir. atexit handlers are run sequentially on the main thread, but the ordering is based on the order in which they are registered, which is somewhat non-deterministic in Glean.

If (b) runs before (a), the data directory is deleted, and then any operations that might be waiting the thread queue will fail with "Database not found".

The fix is to combine the atexit handlers into one, and join on the thread queue before deleting the tempdir.

This raised another issue in my mind that using a tempdir by default is probably not a good choice, and we are seeing this bug in burnham only because burnham doesn't override the data dir (as other "real" apps, such as mozregression have done). Changing the default to a retained directory probably makes sense, and I created bug 1634410 to track that work.

mdboom · 2020-04-30T19:20:54Z

Cc: @hackebrot

mdboom · 2020-04-30T21:18:44Z

glean-core/python/glean/net/http_client.py

-            log.error("socket.gaierror: {}".format(e))
+            log.error("socket.gaierror: '{}' {}".format(url, e))
+            return False
+        except OSError as e:


Strictly speaking this is a separate bug -- we get this error if DNS lookup fails. In typical Python fashion the exceptions that conn.getresponse can throw aren't documented, so we're playing whack-a-mole here.

Since we don't really want this to ever throw, is there any value in just changing this to except Exception as e:?

Yeah, I suppose that's fine. Makes me feel icky in the event that anything truly unexpected does happen, but I should probably just get over that.

Dexterp37 · 2020-05-04T08:45:20Z

glean-core/python/glean/_dispatcher.py

@@ -105,15 +101,20 @@ def _worker(self):

    def _shutdown_thread(self):
        """
-        An atexit handler to tell the worker thread to shutdown and then wait
-        for 1 seconds for it to finish.
+        Tell the worker thread to shutdown and then wait for 1 seconds for it


In general, when going from mobile to the desktop world, we should try to be a bit more careful at shutdown (which is something we didn't have to pay much attention to, up to now!). We should probably:

when shutdown is initiated, disallow all the recording APIs;

terminate pending uploads;

wait for ping I/O (if any).

So, in general, perform a clean telemetry shutdown as we currently do for legacy telemetry on Firefox desktop. This is not something that should be tackled in this bug, but it could be worthwhile filing a bug about this.

Yeah -- let's save that for another bug. It's tricky, because CLI doesn't quite work the same as desktop.

Dexterp37 · 2020-05-04T08:47:16Z

glean-core/python/glean/net/http_client.py

-            log.error("socket.gaierror: {}".format(e))
+            log.error("socket.gaierror: '{}' {}".format(url, e))
+            return False
+        except OSError as e:


Since we don't really want this to ever throw, is there any value in just changing this to except Exception as e:?

Dexterp37 · 2020-05-04T08:47:28Z

glean-core/python/glean/testing/__init__.py

    data_dir = None  # type: Optional[Path]
    if not clear_stores:
        Glean._destroy_data_dir = False
        data_dir = Glean._data_dir

    Glean._reset()
+    Dispatcher._testing_mode = True


Why was this moved?

Glean._reset waits for the worker thread to complete, and how that happens depends on the value of _testing_mode. If _testing_mode changes before this happens, it won't wait on the worker thread, because it assumes all work is being done on the main thread.

I can add a comment to this effect here.

hackebrot · 2020-05-05T10:45:47Z

Thank you for resolving this bug this quickly, @mdboom @Dexterp37! 👩‍🚀

Bug 1634310: Python: Fix race condition in atexit

951f065

auto-assign bot requested a review from Dexterp37 April 30, 2020 19:20

mdboom force-pushed the fix-race-condition branch from 3d42468 to 50c5915 Compare April 30, 2020 21:06

Attempt to fix Windows issue

486423c

mdboom force-pushed the fix-race-condition branch from 50c5915 to 486423c Compare April 30, 2020 21:17

mdboom commented Apr 30, 2020

View reviewed changes

Dexterp37 reviewed May 4, 2020

View reviewed changes

mdboom added 2 commits May 4, 2020 08:08

Add comment about _testing_mode

1d5f9c2

Add fall-through exception case

f87939e

Dexterp37 approved these changes May 4, 2020

View reviewed changes

mdboom merged commit 848bd1b into mozilla:master May 4, 2020

mdboom deleted the fix-race-condition branch May 4, 2020 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1634310: Python: Fix race condition in atexit #854

Bug 1634310: Python: Fix race condition in atexit #854

mdboom commented Apr 30, 2020

mdboom commented Apr 30, 2020

mdboom Apr 30, 2020

Dexterp37 May 4, 2020

mdboom May 4, 2020

Dexterp37 May 4, 2020

mdboom May 4, 2020

Dexterp37 May 4, 2020

Dexterp37 May 4, 2020

mdboom May 4, 2020

hackebrot commented May 5, 2020

Bug 1634310: Python: Fix race condition in atexit #854

Bug 1634310: Python: Fix race condition in atexit #854

Conversation

mdboom commented Apr 30, 2020

mdboom commented Apr 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hackebrot commented May 5, 2020