-
Notifications
You must be signed in to change notification settings - Fork 149
introduce chunk._getitem_r(...), a thread-safe version of chunk._getitem(...) #160
base: master
Are you sure you want to change the base?
Conversation
…tem(...) `chunk._getitem(...)` depends on and modifies instance attributes of `chunk`. The consequence is that in multi-threaded applications access to the `chunk` instance has to be controlled by locking which prevent multi-threaded code based on bcolz from scaling appropriately with the number of processors. The introduced function `chunk._getitem_r(...)` provides the same functionality without requiring the GIL or any locking to protect instance attributes. This allows writing efficient multi-threaded cython algorithms based on bcolz.
This looks good, thanks. Unfortunately c-blosc only supports being called effectively from multi-threaded operations in version >= 1.5, and with lesser versions, c-blosc makes sure that a lock is put during multi-threaded operation. Also, and due to some not completely clear reasons, c-blosc 1.5 series still have some stability problems with bcolz (e.g. #120, #121), and this is why bcolz still comes with c-blosc 1.4 series. The result is that you probably won't be seeing any speedup with this PR (yet, and until c-blosc 1.5 could go into bcolz). At any rate, if you can compile bcolz with c-blosc 1.5 and tell us the speed-ups that you are seeing, that would be great. I am a big fan of seeing real improvements before adding new code. |
Benchmarking with c-blosc 1.5 is on hold as it depends on resolving issue Blosc/c-blosc#92. (The c-blosc 1.5 series does not compile or produces runtime failures on my machine.) |
What is the status of this Blosc/c-blosc#92 has been resolved, right? Does 1.6 work better on your machine now? |
@esc This PR arose from the effort to parallelise factorisation in bquery. I have not worked on it recently but plan to get back to it soonish. C-blosc 1.5.4 does compile on my machine with MSVC. Sporadically the test suite crashes however. 1.6 (or more precisely tip of master) I could not compile. It complained that the new compression algorithm was not available on my machine. |
This would probably speed up Would love to see this become a part of I'll try to construct a test of |
@waylonflinn I am not completely sure that the bottleneck this was supposed to circumvent was real. There were issues with my test database as well as with c-blosc at the time and I have not investigated the issue with the current c-blosc yet. I have been meaning to look into it, but it is currently a fairly low priority for me. - Just wanted to give you a heads up. I am not familiar with I would be very interested in any benchmark results indeed. |
@ARF1 thanks for the heads up!
Here's a link to one of the relevant pieces of code: |
@waylonflinn Thanks. That actually looks very useful to me too! Is there a list of bcolz-based projects anywhere? More and more seem to be popping up. More to the point, I can understand your interest in parallel processing of chunks. I am sure you are already aware of cython I am not sure however that the non-gil thread-safe version of Still benchmarks would be really good to have and your application is probably ideally suited for such an experiment. We would finally know whether this PR is worth keeping active. |
@ARF1 you may want to join the bcolz google group: https://groups.google.com/forum/#!forum/bcolz We have been discussing bdot over there for a few days already. |
Yes, @ARF1 suggestion of using prange is a good one. And he is right, Blosc used to pose a lock itself (similar to GIL), but with c-blosc >= 1.5 there is a couple of calls (those ending with '_ctx') that are meant to run c-blosc without a lock. I am experimenting with that in branch https://github.com/Blosc/bcolz/tree/c-blosc-1.7 , and the results are promising. Hopefully a compression withut locks will happen in the next release of bcolz (keeping my fingers crossed). And yes, moving this into the mailing list maybe a good thing. |
One newbie but serious question (it's something i'm struggling with in bquery): how do you ensure the order of chunk processing when putting them in parallel? Because I can imagine a wait mechanism, but wouldn't that potentially cause buffer issues (one very slow chunk followed by very quick chunks)? |
I don't think that prange allows you to ensure an order, but if you want that, then use regular range right? |
@FrancescAlted With regular I think @CartVaartjes was asking for a silber-bullet: parallel processing of chunks, in-order results, without wait... Intuitively this sounds like a classic application for the producer-consumer pattern but unfortunately with openmp (cython threads) that is not a pattern that lends itself to easy implementation. @CarstVaartjes You might remember I also had this question when I started my multiprocessing branch of When you use There are several ways of dealing with this:
I have a semi-working implementation of the thread-local buffer approach in an unpublished multiprocessing Of course in all cases of bcolz chunk processing, even a sub-optimal multiprocessing solution is likely to be faster than a single-core solution. Its just the scalability with the number of cores that suffers, right? |
chunk._getitem(...)
depends on and modifies instance attributes ofchunk
.The consequence is that in multi-threaded applications access to the
chunk
instance has to be controlled by locking which prevent multi-threaded code
based on bcolz from scaling appropriately with the number of processors.
The introduced function
chunk._getitem_r(...)
provides the same functionalitywithout requiring the GIL or any locking to protect instance attributes. This
allows writing efficient multi-threaded cython algorithms based on bcolz.
Use case: visualfabriq/bquery#22