-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mutex to ensure only single thread at pybind->c++ layer #973
Conversation
@@ -1122,4 +1125,11 @@ void PythonVersionStore::force_delete_symbol(const StreamId& stream_id) { | |||
delete_all_for_stream(store(), stream_id, true); | |||
} | |||
|
|||
std::lock_guard<std::mutex> PythonVersionStore::ensure_single_thread_cpp_pybind_entry(){ | |||
py::gil_scoped_release release; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a gil release call here? IMO this does a bit more than the name implies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I should have put more details in the PR's description.
Every pybind
function called from python naturally hold the GIL. Most of the function will hold the GIL in the entire lifespan. But not for batch_read
. batch_read
will release GIL at some point, so the folly threads which get its "tasks" can acquire the GIL to call some python function. Which, if the python script is multithreaded, another pybind
function can be called. In this case, def read()
is called. def read()
also use folly future too. Both function share the same pool of thread. So,
batch_read
thread which does not have the GIL, has exhausted all the threads in the pool. And those threads are all waiting for the GIL.read
thread which does have the GIL, waits for the result of the future, which will never return as there is no thread available in the pool
Therefore, deadlock occurs.
As a short term fix, a mutex is added to ensure only single thread at pybind->c++ layer. However, as I mention above, everypybind
already has the GIL. So when the new function called is waiting for the mutex, the GIL is needed to be released. It enables the other running thread (def batch_read
) and its task runners can acquire GIL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add this information as a comment above the function (preferably in doxygen style)?
Please could you,
I thought the plan was to add this to all the functions. Is the idea that that will be a follow up? |
I think I have misunderstood the urgency of the fix. I thought it is an urgent task so I only fix the functions being reported in this PR. That's why this PR is drafted.
If above tasks are required before merging, I will like to add WIP to this PR or simply close it until the above tasks are ready |
Thanks. It would be good to confirm the urgency with the users, I'm not sure whether they have a good workaround or not. If they're in a rush, I think it's OK just to add this to read and read batch, and we can follow up immediately for the other methods. Otherwise, let's do them all at once and save ourselves an extra release. I think it would be good to have the multithreading and processing tests whether they are in a rush or not, we should be careful that we are not making things worse. |
cb0a748
to
86d19b9
Compare
No deterioration of performance observed after adding the lock: |
86d19b9
to
e41d7f7
Compare
q.put(e) | ||
|
||
read_th_count = 1 | ||
batch_read_th_count = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increase these thread counts?
What is the point of the queue (which isn't synchronised correctly), why not just let the exception bubble up and fail the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception from the "sub-thread" does not give any error during the tests unless it is caught then re-thrown in the main thread, according to my test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can increase the thread count. Yet according to my test, current setting is sufficient to repeating the issue. I would like to keep the additional time needed for running the test as minimum as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, the minimal amount to repro the issue is fine.
@@ -344,6 +344,8 @@ class PythonVersionStore : public LocalVersionedEngine { | |||
bool prune_previous_versions); | |||
|
|||
void delete_snapshot_sync(const SnapshotId& snap_name, const VariantKey& snap_key); | |||
|
|||
[[nodiscard]] static std::lock_guard<std::mutex> ensure_single_thread_cpp_pybind_entry(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, I can't see where this is defined, or used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used as py::call_guard<SingleThreadMutexHolder>()
while defining pybind
function.
m.def("foo", foo, py::call_guard<T>());
is equal to
m.def("foo", [](args...) {
T scope_guard;
return foo(args...); // forwarded arguments
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops I see your point now. Removing
@@ -148,7 +149,7 @@ void register_bindings(py::module& storage, py::exception<arcticdb::ArcticExcept | |||
}, py::arg("library_path"), py::arg("throw_on_failure") = true) | |||
.def("remove_library_config", [](const LibraryManager& library_manager, std::string_view library_path){ | |||
return library_manager.remove_library_config(LibraryPath{library_path, '.'}); | |||
}) | |||
}, py::call_guard<SingleThreadMutexHolder>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this method need the mutex, and not the other methods in this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should have put the explanation in the description of this PR.
So basically, my understanding is, this will be short-lived patch. So my idea is keeping it safe for the short term:
- No major rehaul
- Add mutex to all functions in
PythonVersionStore
- If functions use folly::Future, which, in my eyes, can be converted async task easily. So
remove_key
is called byremove_library_config
, which usesfolly::Future
, mutex is necessary
&PythonVersionStore::write_partitioned_dataframe, | ||
"Write a dataframe to the store") | ||
&PythonVersionStore::write_partitioned_dataframe, | ||
py::call_guard<SingleThreadMutexHolder>(), "Write a dataframe to the store") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider alternative approaches like subclassing py::class_
to always insert the call guard on def
? Might be more robust as more methods are added. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about that but the call_guard
is only needed for functions involves folly::future, which are not the majority of the case. So at the end of the day, developers still need to differentiate whether they need the guard or not. Then, in your case, assigning the new subclass of py::class_
. To me, it is not a big different to adding the py::call_guard
.
Or you mean just simply for PythonVersionStore
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking just for PythonVersionStore
, but I can see that my idea might add about as much complexity as it removes. Let's leave it as-is for now.
@@ -8,6 +8,7 @@ | |||
#include <arcticdb/python/python_utils.hpp> | |||
#include <arcticdb/async/python_bindings.hpp> | |||
#include <arcticdb/async/task_scheduler.hpp> | |||
#include <arcticdb/util/pybind_mutex.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry will be remove
6b23a16
to
9af41f0
Compare
9af41f0
to
862e811
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worried about forking? Then new process will get the mutex state from the static one in pybind_mutex.hpp
, which may be locked. How will it ever get unlocked? Do we need some mechanism to have a per-process mutex? Does pybind have anything to help us here?
96e8441
to
edb6a50
Compare
edb6a50
to
2af003f
Compare
Handling for multiprocess added |
Reference Issues/PRs
#974
What does this implement or fix?
#973 (comment)
Any other comments?
Checklist
Checklist for code changes...