-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-4025: [Python] Avoid pthread_once in ThreadPool #3177
Conversation
cpp/src/arrow/util/thread-pool.cc
Outdated
@@ -294,9 +294,14 @@ std::shared_ptr<ThreadPool> ThreadPool::MakeCpuThreadPool() { | |||
return pool; | |||
} | |||
|
|||
ThreadPool* arrow_global_thread_pool = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be needed any more, I'll check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this turns out to actually not be needed)
To test this, the manylinux1 base image needs to be rerun (in
and then wheels can be generated with
|
Yuck. We don't want to bring in boost just to workaround Tensoflow's standard violations. If you think this should be solved, I encourage you to find another way. |
cpp/src/arrow/util/thread-pool.cc
Outdated
ThreadPool* GetCpuThreadPool() { | ||
static std::shared_ptr<ThreadPool> singleton = ThreadPool::MakeCpuThreadPool(); | ||
return singleton.get(); | ||
if (!arrow_global_thread_pool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not thread-safe. There is a reason why I used a static singleton, and that reason is the thread safety of its initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can even demote it to a unique_ptr, since the object is never re-assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are both right (this was just to try out if removing the static shared_ptr makes a difference)
cpp/src/arrow/util/thread-pool.cc
Outdated
static std::shared_ptr<ThreadPool> singleton = ThreadPool::MakeCpuThreadPool(); | ||
return singleton.get(); | ||
if (!arrow_global_thread_pool) { | ||
arrow_global_thread_pool = new ThreadPool(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creates a memory leak (object is never destroyed, Valgrind will complain).
I'm also -0/-1 on eschewing C++11 future/thread -- that seems hardly sustainable. I think we need to engage the TF / PyTorch communities on this. It feels like they swiped the credit card, but sent the bill to our house. |
It might mean that we have to create some kind of a pip wheel to hoist a modern C++ standard library into the environment (for PyTorch / TF to depend on) so they can avoid all this static linking insanity, similar to what's being done in conda now |
e2bd0d0
to
636390d
Compare
05a729a
to
62f9e82
Compare
c4778dc
to
4cba4c8
Compare
Thanks for the feedback @wesm @pitrou. Here are my thoughts: This problem really needs to be fixed. If we don't fix it, doing deep learning with arrow via TensorFlow or PyTorch will be severely hindered and that's a non-starter certainly for Ray but I think also for the adoption of pyarrow in general. The workarounds we had so far work in some cases, but they are sometimes failing and we obviously cannot afford that. Making TensorFlow and pytorch manylinux1 compatible won't happen in the near future, it has been discussed several times and is non-trivial because of the need for GPU support and the nvidia toolchain. Plus even if it were fixed tomorrow, older versions of TensorFlow are still in use for various reasons. So I fear this needs to be fixed on our side. Here are the solutions I can think of:
At least now that we know that the problem comes from std::call_once and the futures, we have a way of attacking it :) Concerning 2) I have a sketch in this PR now. It's not ready yet and there are probably bugs at the moment but it may give you an idea on how this would look like. It would entail the following more limited API that however covers the use cases of the ThreadPool we have at the moment: Only functions Please let me know about your thoughts/if you have further ideas how to fix it. I'm willing to spend the time making whatever approach we decide on working, but it really needs to be fixed :) Best, |
I just took a look... Do you realize the thread pool can be called from several threads? If you wait for the thread pool to be idle, you may be waiting much longer than necessary.
What if some other problem comes up? What if loading Gandiva (which is gonna bundle LLVM) side-to-side with Tensorflow also causes crashes?
It may indeed. Though I'm afraid that depends on the exact libstdc++ version Tensorflow builds against?
I understand the sentiment, but it seems the only sane way to fix this on our side is to ask people to use conda packages, not wheels. If wheels are broken, just say it to users instead of piling up fragile workarounds in the Arrow codebase. Actually, I would even go as far as asking distutils-sig if the non-conformant wheels can be removed from PyPI. PyPI has no policy for this currently, but it is an obvious detriment to the whole ecosystem that people are allowed to upload rogue wheels and other packages have to deal with the fallout. |
By the way, it seems manylinux2010 is almost there now: pypa/manylinux#179 (i.e. it's in pip master, but not any pip releases yet, AFAICT) |
cpp/src/arrow/util/memory.h
Outdated
@@ -33,7 +33,10 @@ uint8_t* pointer_logical_and(const uint8_t* address, uintptr_t bits) { | |||
|
|||
// This function is just for avoiding MinGW-w64 32bit crash. | |||
// See also: https://sourceforge.net/p/mingw-w64/bugs/767/ | |||
void* wrap_memcpy(void* dst, const void* src, size_t n) { return memcpy(dst, src, n); } | |||
Status wrap_memcpy(void* dst, const void* src, size_t n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't wrap this with a different return type. There's a non negligible chance that gcc can't apply the inline-memcpy with a single instruction optimization. This can severely affect performance in code that does memcpy in loops.
I'd also add the static inline keywords.
cpp/src/arrow/util/thread-pool.cc
Outdated
@@ -145,6 +145,11 @@ Status ThreadPool::Shutdown(bool wait) { | |||
return Status::OK(); | |||
} | |||
|
|||
void ThreadPool::Wait() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typical nomenclature for this is Join
.
cpp/src/arrow/util/thread-pool.h
Outdated
} | ||
|
||
void Wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also expose the WaitFor variant with chrono (see https://en.cppreference.com/w/cpp/thread/condition_variable/wait_for).
cpp/src/plasma/client.cc
Outdated
@@ -650,11 +650,12 @@ Status PlasmaClient::Impl::List(ObjectTable* objects) { | |||
return ReadListReply(buffer.data(), buffer.size(), objects); | |||
} | |||
|
|||
static void ComputeBlockHash(const unsigned char* data, int64_t nbytes, uint64_t* hash) { | |||
static Status ComputeBlockHash(const unsigned char* data, int64_t nbytes, uint64_t* hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
cpp/src/arrow/util/thread-pool.h
Outdated
|
||
Status st = SpawnReal(detail::packaged_task_wrapper<Result>(std::move(task))); | ||
// Submit a callable for execution. | ||
void Submit(std::function<Status()> func) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't enforce that every function must return Status
, a thread pool shouldn't care about the logical data. The original code didn't enforce this, what's returned is the the failure of submitting.
// std::call_once(utf8_initialized, internal::InitializeLargeTable); | ||
// How should we do this? Maybe use | ||
// https://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Nifty_Counter | ||
internal::InitializeLargeTable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you can refactor the ut8 stuff into a simple struct class that allocates the array on the heap. Then you use the same static + unique_ptr singleton technique and you'll get rid of the once_flag.
Thanks again for the further feedback! It seems to me changing the threadpool API is also suboptimal. I'd suggest the following solution:
Then we can leave it to the user which option they prefer. One sane default might be that if boost is vendored (like in our wheels), boost::future will be used. This will fix the problem not only for Ray but also vanilla pyarrow wheels. In the future (if manylinux2010 and/or TensorFlow and pytorch solve the issue), we can remove the flag and just default to std::future. |
I don't think it's reasonable for TF / PyTorch to build with ad hoc C++ toolchains and linker flags and expect the rest of the Python community to conform to their decisions with zero consultation. They need to engage other projects on these toolchain issues -- they have the money and the people to put in effort to be compatible with the rest of the ecosystem, so we need to hold them accountable. There aren't that many players in this ecosystem, and we will not allow ourselves to be pushed around. I sympathize with the issues you are having, but we cannot treat these other open source projects as immutable entities. Let's talk with the development teams and come up with a solution so we can all use C++ in a consistent way and have working applications. |
For the avoidance of ambiguity, @pcmoritz, I'm expecting you and @robertnishihara to take the lead on engaging with TF / PyTorch communities on this. Please let us know if we can help |
Somehow we need to create some catalyst to cause the TF and PyTorch developers to take more responsibility for the toolchain problem they have created; this might be one way. Either way we need to escalate the issue and make a lot more people aware of what is going on |
Aside: pyarrow is being installed more frequently than PyTorch and about 1/4 as often as TF on PyPI. |
Do you want me to concoct a small work-around using purely atomics that would respect the google style guide? |
I don't think it's worth investing any more effort in this until we open a dialogue about the toolchain issues with the relevant developer communities |
Thanks for all the views! I'll send a post to the TensorFlow and PyTorch lists to make sure this issue is raised prominently. Hopefully the community can agree on a way forward that works for everyone. At least we should make sure that there will be compatibility with manylinux2010 in the future. I'll also prepare a patch for Ray 0.6.1 to make sure this works in the short term for our users. |
@wesm thanks a lot for pushing to start a discussion and dialogue around these issues. PyTorch certainly isn't immutable and we dont expect to be treated that way.
Just FYI, PyTorch hasn't statically linked with the std C++ since Jan this year -- we've been trying our best to work with the necessary constraints we have (in terms of CUDA, C++11) and covering as old a linux distro we can possibly go to. |
Closing this pending some guidance about how to proceed on creating compatible wheels |
This seems to fix the TensorFlow/PyTorch crash for me.
Some disassembling (thanks for the hint @pitrou) brought up that pthread_once is called by the C++11 future library. Replacing it with boost seems to fix it (but unfortunately pulls in a number of other boost dependencies like atomic, chrono and date_time; I wonder if that is avoidable).