Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mutex to ensure only single thread at pybind->c++ layer #973

Merged
merged 2 commits into from
Oct 31, 2023

Conversation

phoebusm
Copy link
Collaborator

@phoebusm phoebusm commented Oct 18, 2023

Reference Issues/PRs

#974

What does this implement or fix?

#973 (comment)

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@phoebusm phoebusm added the bug Something isn't working label Oct 18, 2023
@phoebusm phoebusm self-assigned this Oct 18, 2023
@@ -1122,4 +1125,11 @@ void PythonVersionStore::force_delete_symbol(const StreamId& stream_id) {
delete_all_for_stream(store(), stream_id, true);
}

std::lock_guard<std::mutex> PythonVersionStore::ensure_single_thread_cpp_pybind_entry(){
py::gil_scoped_release release;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a gil release call here? IMO this does a bit more than the name implies.

Copy link
Collaborator Author

@phoebusm phoebusm Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have put more details in the PR's description.
Every pybind function called from python naturally hold the GIL. Most of the function will hold the GIL in the entire lifespan. But not for batch_read. batch_read will release GIL at some point, so the folly threads which get its "tasks" can acquire the GIL to call some python function. Which, if the python script is multithreaded, another pybind function can be called. In this case, def read() is called. def read() also use folly future too. Both function share the same pool of thread. So,

  • batch_read thread which does not have the GIL, has exhausted all the threads in the pool. And those threads are all waiting for the GIL.
  • read thread which does have the GIL, waits for the result of the future, which will never return as there is no thread available in the pool
    Therefore, deadlock occurs.
    As a short term fix, a mutex is added to ensure only single thread at pybind->c++ layer. However, as I mention above, every pybind already has the GIL. So when the new function called is waiting for the mutex, the GIL is needed to be released. It enables the other running thread (def batch_read) and its task runners can acquire GIL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add this information as a comment above the function (preferably in doxygen style)?

@poodlewars
Copy link
Collaborator

poodlewars commented Oct 18, 2023

Please could you,

  • Explain why we need the GIL part? I assume this is something about what happens if we hold the GIL in the thread blocked by the new mutex, which might deadlock the thread holding the mutex if that has previously dropped the GIL and is now waiting to re-acquire it.
  • Confirm how this works with Python multiprocessing? Is there a new mutex for each sub-process?
  • Add tests for this for both multiprocessing (should permit the concurrency) and multithreading (should forbid it)
  • Share any performance benchmarking showing the impact of this when not using Python multi-threading

I thought the plan was to add this to all the functions. Is the idea that that will be a follow up?

@phoebusm
Copy link
Collaborator Author

phoebusm commented Oct 18, 2023

Please could you,

  • Explain why we need the GIL part? I assume this is something about what happens if we hold the GIL in the thread blocked by the new mutex, which might deadlock the thread holding the mutex if that has previously dropped the GIL and is now waiting to re-acquire it.
  • Confirm how this works with Python multiprocessing? Is there a new mutex for each sub-process?
  • Add tests for this for both multiprocessing (should permit the concurrency) and multithreading (should forbid it)
  • Share any performance benchmarking showing the impact of this when not using Python multi-threading

I thought the plan was to add this to all the functions. Is the idea that that will be a follow up?

  1. Exactly
  2. Correct me if I am wrong, but I believe C++ mutex is intraprocess only. <- Ok just realise something. Let me do more tests
  3. Yea it passes both multithreading and multiprocessing test <- Ok just realise something. Let me do more tests
  4. Yes this is something I should do

I think I have misunderstood the urgency of the fix. I thought it is an urgent task so I only fix the functions being reported in this PR. That's why this PR is drafted.
Pending task for a complete PR:

  1. Mutex for all pybind function
  2. Test
  3. Benchmark

If above tasks are required before merging, I will like to add WIP to this PR or simply close it until the above tasks are ready

@poodlewars
Copy link
Collaborator

Please could you,

  • Explain why we need the GIL part? I assume this is something about what happens if we hold the GIL in the thread blocked by the new mutex, which might deadlock the thread holding the mutex if that has previously dropped the GIL and is now waiting to re-acquire it.
  • Confirm how this works with Python multiprocessing? Is there a new mutex for each sub-process?
  • Add tests for this for both multiprocessing (should permit the concurrency) and multithreading (should forbid it)
  • Share any performance benchmarking showing the impact of this when not using Python multi-threading

I thought the plan was to add this to all the functions. Is the idea that that will be a follow up?

1. Exactly

2. Correct me if I am wrong, but I believe C++ mutex is intraprocess only. <- Ok just realise something. Let me do more tests

3. Yea it passes both multithreading and multiprocessing test

4. Yes this is something I should do

I think I have misunderstood the urgency of the fix. I thought it is an urgent task so I only fix the functions being reported in this PR. That's why this PR is drafted. Pending task for a complete PR:

1. Mutex for all pybind function

2. Test

3. Benchmark

If above tasks are required before merging, I will like to add WIP to this PR or simply close it until the above tasks are ready

Thanks. It would be good to confirm the urgency with the users, I'm not sure whether they have a good workaround or not. If they're in a rush, I think it's OK just to add this to read and read batch, and we can follow up immediately for the other methods. Otherwise, let's do them all at once and save ourselves an extra release.

I think it would be good to have the multithreading and processing tests whether they are in a rush or not, we should be careful that we are not making things worse.

@phoebusm phoebusm changed the title Add mutex to ensure only single thread at pybind->c++ layer WIP: Add mutex to ensure only single thread at pybind->c++ layer Oct 19, 2023
@phoebusm phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from cb0a748 to 86d19b9 Compare October 24, 2023 00:25
@phoebusm
Copy link
Collaborator Author

No deterioration of performance observed after adding the lock:
Average time taken for read
Before adding the lock: 0.027s
After adding the lock: 0.023s

@phoebusm phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from 86d19b9 to e41d7f7 Compare October 24, 2023 01:08
q.put(e)

read_th_count = 1
batch_read_th_count = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increase these thread counts?

What is the point of the queue (which isn't synchronised correctly), why not just let the exception bubble up and fail the test?

Copy link
Collaborator Author

@phoebusm phoebusm Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception from the "sub-thread" does not give any error during the tests unless it is caught then re-thrown in the main thread, according to my test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can increase the thread count. Yet according to my test, current setting is sufficient to repeating the issue. I would like to keep the additional time needed for running the test as minimum as possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, the minimal amount to repro the issue is fine.

@@ -344,6 +344,8 @@ class PythonVersionStore : public LocalVersionedEngine {
bool prune_previous_versions);

void delete_snapshot_sync(const SnapshotId& snap_name, const VariantKey& snap_key);

[[nodiscard]] static std::lock_guard<std::mutex> ensure_single_thread_cpp_pybind_entry();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, I can't see where this is defined, or used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used as py::call_guard<SingleThreadMutexHolder>() while defining pybind function.

m.def("foo", foo, py::call_guard<T>());

is equal to

m.def("foo", [](args...) {
    T scope_guard;
    return foo(args...); // forwarded arguments
});

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I see your point now. Removing

@@ -148,7 +149,7 @@ void register_bindings(py::module& storage, py::exception<arcticdb::ArcticExcept
}, py::arg("library_path"), py::arg("throw_on_failure") = true)
.def("remove_library_config", [](const LibraryManager& library_manager, std::string_view library_path){
return library_manager.remove_library_config(LibraryPath{library_path, '.'});
})
}, py::call_guard<SingleThreadMutexHolder>())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this method need the mutex, and not the other methods in this class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should have put the explanation in the description of this PR.
So basically, my understanding is, this will be short-lived patch. So my idea is keeping it safe for the short term:

  1. No major rehaul
  2. Add mutex to all functions in PythonVersionStore
  3. If functions use folly::Future, which, in my eyes, can be converted async task easily. So
    remove_key is called by remove_library_config, which uses folly::Future, mutex is necessary

&PythonVersionStore::write_partitioned_dataframe,
"Write a dataframe to the store")
&PythonVersionStore::write_partitioned_dataframe,
py::call_guard<SingleThreadMutexHolder>(), "Write a dataframe to the store")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider alternative approaches like subclassing py::class_ to always insert the call guard on def? Might be more robust as more methods are added. What do you think?

Copy link
Collaborator Author

@phoebusm phoebusm Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about that but the call_guard is only needed for functions involves folly::future, which are not the majority of the case. So at the end of the day, developers still need to differentiate whether they need the guard or not. Then, in your case, assigning the new subclass of py::class_. To me, it is not a big different to adding the py::call_guard.
Or you mean just simply for PythonVersionStore?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking just for PythonVersionStore, but I can see that my idea might add about as much complexity as it removes. Let's leave it as-is for now.

@@ -8,6 +8,7 @@
#include <arcticdb/python/python_utils.hpp>
#include <arcticdb/async/python_bindings.hpp>
#include <arcticdb/async/task_scheduler.hpp>
#include <arcticdb/util/pybind_mutex.hpp>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used, is it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry will be remove

@phoebusm phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from 6b23a16 to 9af41f0 Compare October 24, 2023 13:32
@phoebusm phoebusm changed the title WIP: Add mutex to ensure only single thread at pybind->c++ layer Add mutex to ensure only single thread at pybind->c++ layer Oct 24, 2023
@phoebusm phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from 9af41f0 to 862e811 Compare October 24, 2023 14:36
Copy link
Collaborator

@poodlewars poodlewars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worried about forking? Then new process will get the mutex state from the static one in pybind_mutex.hpp, which may be locked. How will it ever get unlocked? Do we need some mechanism to have a per-process mutex? Does pybind have anything to help us here?

@phoebusm phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from 96e8441 to edb6a50 Compare October 27, 2023 15:18
@phoebusm phoebusm force-pushed the bugfix/batch_read_read_deadlock branch from edb6a50 to 2af003f Compare October 30, 2023 09:46
@phoebusm
Copy link
Collaborator Author

Worried about forking? Then new process will get the mutex state from the static one in pybind_mutex.hpp, which may be locked. How will it ever get unlocked? Do we need some mechanism to have a per-process mutex? Does pybind have anything to help us here?

Handling for multiprocess added

@phoebusm phoebusm merged commit d633f9b into master Oct 31, 2023
98 checks passed
@phoebusm phoebusm deleted the bugfix/batch_read_read_deadlock branch October 31, 2023 10:09
@phoebusm phoebusm mentioned this pull request Oct 31, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants