Add mutex to ensure only single thread at pybind->c++ layer #973

phoebusm · 2023-10-18T12:40:04Z

Reference Issues/PRs

#974

What does this implement or fix?

#973 (comment)

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

vasil-pashov · 2023-10-18T14:17:03Z

cpp/arcticdb/version/version_store_api.cpp

@@ -1122,4 +1125,11 @@ void PythonVersionStore::force_delete_symbol(const StreamId& stream_id) {
 delete_all_for_stream(store(), stream_id, true);
 }

+std::lock_guard<std::mutex> PythonVersionStore::ensure_single_thread_cpp_pybind_entry(){ 
+ py::gil_scoped_release release;


Why is there a gil release call here? IMO this does a bit more than the name implies.

Sorry I should have put more details in the PR's description.
Every pybind function called from python naturally hold the GIL. Most of the function will hold the GIL in the entire lifespan. But not for batch_read. batch_read will release GIL at some point, so the folly threads which get its "tasks" can acquire the GIL to call some python function. Which, if the python script is multithreaded, another pybind function can be called. In this case, def read() is called. def read() also use folly future too. Both function share the same pool of thread. So,

batch_read thread which does not have the GIL, has exhausted all the threads in the pool. And those threads are all waiting for the GIL.

read thread which does have the GIL, waits for the result of the future, which will never return as there is no thread available in the pool
Therefore, deadlock occurs.
As a short term fix, a mutex is added to ensure only single thread at pybind->c++ layer. However, as I mention above, every pybind already has the GIL. So when the new function called is waiting for the mutex, the GIL is needed to be released. It enables the other running thread (def batch_read) and its task runners can acquire GIL.

Can you also add this information as a comment above the function (preferably in doxygen style)?

poodlewars · 2023-10-18T15:10:49Z

Please could you,

Explain why we need the GIL part? I assume this is something about what happens if we hold the GIL in the thread blocked by the new mutex, which might deadlock the thread holding the mutex if that has previously dropped the GIL and is now waiting to re-acquire it.
Confirm how this works with Python multiprocessing? Is there a new mutex for each sub-process?
Add tests for this for both multiprocessing (should permit the concurrency) and multithreading (should forbid it)
Share any performance benchmarking showing the impact of this when not using Python multi-threading

I thought the plan was to add this to all the functions. Is the idea that that will be a follow up?

phoebusm · 2023-10-18T16:57:46Z

Please could you,

Explain why we need the GIL part? I assume this is something about what happens if we hold the GIL in the thread blocked by the new mutex, which might deadlock the thread holding the mutex if that has previously dropped the GIL and is now waiting to re-acquire it.

Confirm how this works with Python multiprocessing? Is there a new mutex for each sub-process?

Add tests for this for both multiprocessing (should permit the concurrency) and multithreading (should forbid it)

Share any performance benchmarking showing the impact of this when not using Python multi-threading

I thought the plan was to add this to all the functions. Is the idea that that will be a follow up?

Exactly
Correct me if I am wrong, but I believe C++ mutex is intraprocess only. <- Ok just realise something. Let me do more tests
Yea it passes both multithreading and multiprocessing test <- Ok just realise something. Let me do more tests
Yes this is something I should do

I think I have misunderstood the urgency of the fix. I thought it is an urgent task so I only fix the functions being reported in this PR. That's why this PR is drafted.
Pending task for a complete PR:

Mutex for all pybind function
Test
Benchmark

If above tasks are required before merging, I will like to add WIP to this PR or simply close it until the above tasks are ready

poodlewars · 2023-10-18T19:16:37Z

Please could you,

Explain why we need the GIL part? I assume this is something about what happens if we hold the GIL in the thread blocked by the new mutex, which might deadlock the thread holding the mutex if that has previously dropped the GIL and is now waiting to re-acquire it.

Confirm how this works with Python multiprocessing? Is there a new mutex for each sub-process?

Add tests for this for both multiprocessing (should permit the concurrency) and multithreading (should forbid it)

Share any performance benchmarking showing the impact of this when not using Python multi-threading

I thought the plan was to add this to all the functions. Is the idea that that will be a follow up?
1. Exactly

2. Correct me if I am wrong, but I believe C++ mutex is intraprocess only. <- Ok just realise something. Let me do more tests

3. Yea it passes both multithreading and multiprocessing test

4. Yes this is something I should do
I think I have misunderstood the urgency of the fix. I thought it is an urgent task so I only fix the functions being reported in this PR. That's why this PR is drafted. Pending task for a complete PR:
1. Mutex for all pybind function

2. Test

3. Benchmark
If above tasks are required before merging, I will like to add WIP to this PR or simply close it until the above tasks are ready

Thanks. It would be good to confirm the urgency with the users, I'm not sure whether they have a good workaround or not. If they're in a rush, I think it's OK just to add this to read and read batch, and we can follow up immediately for the other methods. Otherwise, let's do them all at once and save ourselves an extra release.

I think it would be good to have the multithreading and processing tests whether they are in a rush or not, we should be careful that we are not making things worse.

phoebusm · 2023-10-24T01:07:01Z

No deterioration of performance observed after adding the lock:
Average time taken for read
Before adding the lock: 0.027s
After adding the lock: 0.023s

poodlewars · 2023-10-24T10:17:22Z

python/tests/integration/arcticdb/test_arctic_batch.py

+ q.put(e)
+
+ read_th_count = 1
+ batch_read_th_count = 1


Increase these thread counts?

What is the point of the queue (which isn't synchronised correctly), why not just let the exception bubble up and fail the test?

The exception from the "sub-thread" does not give any error during the tests unless it is caught then re-thrown in the main thread, according to my test

I can increase the thread count. Yet according to my test, current setting is sufficient to repeating the issue. I would like to keep the additional time needed for running the test as minimum as possible.

Sounds good, the minimal amount to repro the issue is fine.

poodlewars · 2023-10-24T10:20:04Z

cpp/arcticdb/version/version_store_api.hpp

@@ -344,6 +344,8 @@ class PythonVersionStore : public LocalVersionedEngine {
 bool prune_previous_versions);

 void delete_snapshot_sync(const SnapshotId& snap_name, const VariantKey& snap_key);
+
+ [[nodiscard]] static std::lock_guard<std::mutex> ensure_single_thread_cpp_pybind_entry();


I'm confused, I can't see where this is defined, or used?

It is used as py::call_guard<SingleThreadMutexHolder>() while defining pybind function.

m.def("foo", foo, py::call_guard<T>());

is equal to

m.def("foo", [](args...) { T scope_guard; return foo(args...); // forwarded arguments });

Oops I see your point now. Removing

poodlewars · 2023-10-24T10:22:19Z

cpp/arcticdb/storage/python_bindings.cpp

@@ -148,7 +149,7 @@ void register_bindings(py::module& storage, py::exception<arcticdb::ArcticExcept
 }, py::arg("library_path"), py::arg("throw_on_failure") = true)
 .def("remove_library_config", [](const LibraryManager& library_manager, std::string_view library_path){
 return library_manager.remove_library_config(LibraryPath{library_path, '.'});
- })
+ }, py::call_guard<SingleThreadMutexHolder>())


Why does this method need the mutex, and not the other methods in this class?

I think I should have put the explanation in the description of this PR.
So basically, my understanding is, this will be short-lived patch. So my idea is keeping it safe for the short term:

No major rehaul

Add mutex to all functions in PythonVersionStore

If functions use folly::Future, which, in my eyes, can be converted async task easily. So
remove_key is called by remove_library_config, which uses folly::Future, mutex is necessary

poodlewars · 2023-10-24T10:25:53Z

cpp/arcticdb/version/python_bindings.cpp

- &PythonVersionStore::write_partitioned_dataframe,
- "Write a dataframe to the store")
+ &PythonVersionStore::write_partitioned_dataframe, 
+ py::call_guard<SingleThreadMutexHolder>(), "Write a dataframe to the store")


Did you consider alternative approaches like subclassing py::class_ to always insert the call guard on def? Might be more robust as more methods are added. What do you think?

I have thought about that but the call_guard is only needed for functions involves folly::future, which are not the majority of the case. So at the end of the day, developers still need to differentiate whether they need the guard or not. Then, in your case, assigning the new subclass of py::class_. To me, it is not a big different to adding the py::call_guard.
Or you mean just simply for PythonVersionStore?

I was thinking just for PythonVersionStore, but I can see that my idea might add about as much complexity as it removes. Let's leave it as-is for now.

poodlewars · 2023-10-24T10:26:42Z

cpp/arcticdb/async/python_bindings.cpp

@@ -8,6 +8,7 @@
 #include <arcticdb/python/python_utils.hpp>
 #include <arcticdb/async/python_bindings.hpp>
 #include <arcticdb/async/task_scheduler.hpp>
+#include <arcticdb/util/pybind_mutex.hpp>


Not used, is it?

Oops sorry will be remove

poodlewars

Worried about forking? Then new process will get the mutex state from the static one in pybind_mutex.hpp, which may be locked. How will it ever get unlocked? Do we need some mechanism to have a per-process mutex? Does pybind have anything to help us here?

phoebusm · 2023-10-31T10:09:34Z

Worried about forking? Then new process will get the mutex state from the static one in pybind_mutex.hpp, which may be locked. How will it ever get unlocked? Do we need some mechanism to have a per-process mutex? Does pybind have anything to help us here?

Handling for multiprocess added

phoebusm added the bug Something isn't working label Oct 18, 2023

phoebusm self-assigned this Oct 18, 2023

phoebusm mentioned this pull request Oct 18, 2023

Deadlock when batch_read and read simultaneously #974

Closed

vasil-pashov reviewed Oct 18, 2023

View reviewed changes

phoebusm changed the title ~~Add mutex to ensure only single thread at pybind->c++ layer~~ WIP: Add mutex to ensure only single thread at pybind->c++ layer Oct 19, 2023

phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from cb0a748 to 86d19b9 Compare October 24, 2023 00:25

phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from 86d19b9 to e41d7f7 Compare October 24, 2023 01:08

poodlewars reviewed Oct 24, 2023

View reviewed changes

phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from 6b23a16 to 9af41f0 Compare October 24, 2023 13:32

phoebusm changed the title ~~WIP: Add mutex to ensure only single thread at pybind->c++ layer~~ Add mutex to ensure only single thread at pybind->c++ layer Oct 24, 2023

phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from 9af41f0 to 862e811 Compare October 24, 2023 14:36

poodlewars approved these changes Oct 24, 2023

View reviewed changes

poodlewars requested changes Oct 24, 2023

View reviewed changes

phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from 96e8441 to edb6a50 Compare October 27, 2023 15:18

poodlewars approved these changes Oct 28, 2023

View reviewed changes

Phoebus Mak added 2 commits October 30, 2023 09:46

Add mutex to ensure only single thread at pybind->c++ layer

3c49ee9

Add multiprocess handling

2af003f

phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from edb6a50 to 2af003f Compare October 30, 2023 09:46

phoebusm merged commit d633f9b into master Oct 31, 2023
98 checks passed

phoebusm deleted the bugfix/batch_read_read_deadlock branch October 31, 2023 10:09

phoebusm mentioned this pull request Oct 31, 2023

Bugfix/backport deadlock tmp fix #1021

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mutex to ensure only single thread at pybind->c++ layer #973

Add mutex to ensure only single thread at pybind->c++ layer #973

phoebusm commented Oct 18, 2023 •

edited

Loading

vasil-pashov Oct 18, 2023

phoebusm Oct 18, 2023 •

edited

Loading

vasil-pashov Oct 19, 2023

poodlewars commented Oct 18, 2023 •

edited

Loading

phoebusm commented Oct 18, 2023 •

edited

Loading

poodlewars commented Oct 18, 2023

phoebusm commented Oct 24, 2023

poodlewars Oct 24, 2023

phoebusm Oct 24, 2023 •

edited

Loading

phoebusm Oct 24, 2023

poodlewars Oct 24, 2023

poodlewars Oct 24, 2023

phoebusm Oct 24, 2023

phoebusm Oct 24, 2023

poodlewars Oct 24, 2023

phoebusm Oct 24, 2023

poodlewars Oct 24, 2023

phoebusm Oct 24, 2023 •

edited

Loading

poodlewars Oct 24, 2023

poodlewars Oct 24, 2023

phoebusm Oct 24, 2023

poodlewars left a comment •

edited

Loading

phoebusm commented Oct 31, 2023

Add mutex to ensure only single thread at pybind->c++ layer #973

Add mutex to ensure only single thread at pybind->c++ layer #973

Conversation

phoebusm commented Oct 18, 2023 • edited Loading

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

Choose a reason for hiding this comment

phoebusm Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars commented Oct 18, 2023 • edited Loading

phoebusm commented Oct 18, 2023 • edited Loading

poodlewars commented Oct 18, 2023

phoebusm commented Oct 24, 2023

Choose a reason for hiding this comment

phoebusm Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phoebusm Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars left a comment • edited Loading

Choose a reason for hiding this comment

phoebusm commented Oct 31, 2023

phoebusm commented Oct 18, 2023 •

edited

Loading

phoebusm Oct 18, 2023 •

edited

Loading

poodlewars commented Oct 18, 2023 •

edited

Loading

phoebusm commented Oct 18, 2023 •

edited

Loading

phoebusm Oct 24, 2023 •

edited

Loading

phoebusm Oct 24, 2023 •

edited

Loading

poodlewars left a comment •

edited

Loading