Skip to content
This repository has been archived by the owner on Dec 11, 2023. It is now read-only.

introduce chunk._getitem_r(...), a thread-safe version of chunk._getitem(...) #160

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ARF1
Copy link

@ARF1 ARF1 commented Mar 8, 2015

chunk._getitem(...) depends on and modifies instance attributes of chunk.

The consequence is that in multi-threaded applications access to the chunk
instance has to be controlled by locking which prevent multi-threaded code
based on bcolz from scaling appropriately with the number of processors.

The introduced function chunk._getitem_r(...) provides the same functionality
without requiring the GIL or any locking to protect instance attributes. This
allows writing efficient multi-threaded cython algorithms based on bcolz.

Use case: visualfabriq/bquery#22

ARF added 2 commits March 8, 2015 18:48
…tem(...)

`chunk._getitem(...)` depends on and modifies instance attributes of `chunk`.
The consequence is that in multi-threaded applications access to the `chunk`
instance has to be controlled by locking which prevent multi-threaded code
based on bcolz from scaling appropriately with the number of processors.

The introduced function `chunk._getitem_r(...)` provides the same functionality
without requiring the GIL or any locking to protect instance attributes. This
allows writing efficient multi-threaded cython algorithms based on bcolz.
@FrancescAlted
Copy link
Member

This looks good, thanks. Unfortunately c-blosc only supports being called effectively from multi-threaded operations in version >= 1.5, and with lesser versions, c-blosc makes sure that a lock is put during multi-threaded operation.

Also, and due to some not completely clear reasons, c-blosc 1.5 series still have some stability problems with bcolz (e.g. #120, #121), and this is why bcolz still comes with c-blosc 1.4 series. The result is that you probably won't be seeing any speedup with this PR (yet, and until c-blosc 1.5 could go into bcolz).

At any rate, if you can compile bcolz with c-blosc 1.5 and tell us the speed-ups that you are seeing, that would be great. I am a big fan of seeing real improvements before adding new code.

@ARF1
Copy link
Author

ARF1 commented Mar 9, 2015

Benchmarking with c-blosc 1.5 is on hold as it depends on resolving issue Blosc/c-blosc#92. (The c-blosc 1.5 series does not compile or produces runtime failures on my machine.)

@esc
Copy link
Member

esc commented May 2, 2015

What is the status of this Blosc/c-blosc#92 has been resolved, right? Does 1.6 work better on your machine now?

@ARF1
Copy link
Author

ARF1 commented May 6, 2015

@esc This PR arose from the effort to parallelise factorisation in bquery. I have not worked on it recently but plan to get back to it soonish.

C-blosc 1.5.4 does compile on my machine with MSVC. Sporadically the test suite crashes however. 1.6 (or more precisely tip of master) I could not compile. It complained that the new compression algorithm was not available on my machine.

@waylonflinn
Copy link
Contributor

This would probably speed up bdot on multicore machines as well. Might even enable greater than numpy speeds under more conditions.

Would love to see this become a part of bcolz.

I'll try to construct a test of bdot with code from this PR, as soon as I can.

@ARF1
Copy link
Author

ARF1 commented Jun 18, 2015

@waylonflinn I am not completely sure that the bottleneck this was supposed to circumvent was real. There were issues with my test database as well as with c-blosc at the time and I have not investigated the issue with the current c-blosc yet. I have been meaning to look into it, but it is currently a fairly low priority for me. - Just wanted to give you a heads up.

I am not familiar with bdot. Would you mind posting a link to the relevant code so to get a better idea of the context?

I would be very interested in any benchmark results indeed.

@waylonflinn
Copy link
Contributor

@ARF1 thanks for the heads up!

bdot is pretty new (I released the first version this weekend). It's an extension like bquery. It allows you to calculate matrix products on matrices stored in carrays.

Here's a link to one of the relevant pieces of code:
https://github.com/pinleague/bdot/blob/master/bdot/carray_ext.pyx#L68

@ARF1
Copy link
Author

ARF1 commented Jun 18, 2015

@waylonflinn Thanks. That actually looks very useful to me too! Is there a list of bcolz-based projects anywhere? More and more seem to be popping up.

More to the point, I can understand your interest in parallel processing of chunks. I am sure you are already aware of cython prange. If not, there is a multiprocessing wip PR for bquery that can serve as inspiration. (Hint: with the classic _getitem() you will need nested without gil: and with gil: blocks.)

I am not sure however that the non-gil thread-safe version of _getitem() will make a huge difference in addition to prange: it is my understanding that current c-blosc version enforces a global lock. In other words, you are likely to avoid the python GIL only to run into the c-blosc equivalent. That said, I believe c-blosc already uses all available cores for decompression of chunks anyway unless the chunks are very small.

Still benchmarks would be really good to have and your application is probably ideally suited for such an experiment. We would finally know whether this PR is worth keeping active.

@esc
Copy link
Member

esc commented Jun 19, 2015

@ARF1 you may want to join the bcolz google group:

https://groups.google.com/forum/#!forum/bcolz

We have been discussing bdot over there for a few days already.

@FrancescAlted
Copy link
Member

Yes, @ARF1 suggestion of using prange is a good one. And he is right, Blosc used to pose a lock itself (similar to GIL), but with c-blosc >= 1.5 there is a couple of calls (those ending with '_ctx') that are meant to run c-blosc without a lock. I am experimenting with that in branch https://github.com/Blosc/bcolz/tree/c-blosc-1.7 , and the results are promising. Hopefully a compression withut locks will happen in the next release of bcolz (keeping my fingers crossed).

And yes, moving this into the mailing list maybe a good thing.

@CarstVaartjes
Copy link
Contributor

One newbie but serious question (it's something i'm struggling with in bquery): how do you ensure the order of chunk processing when putting them in parallel? Because I can imagine a wait mechanism, but wouldn't that potentially cause buffer issues (one very slow chunk followed by very quick chunks)?

@FrancescAlted
Copy link
Member

I don't think that prange allows you to ensure an order, but if you want that, then use regular range right?

@ARF1
Copy link
Author

ARF1 commented Jun 19, 2015

@FrancescAlted With regular range you of course loose the ability to process your chunks in parallel.

I think @CartVaartjes was asking for a silber-bullet: parallel processing of chunks, in-order results, without wait... Intuitively this sounds like a classic application for the producer-consumer pattern but unfortunately with openmp (cython threads) that is not a pattern that lends itself to easy implementation.

@CarstVaartjes You might remember I also had this question when I started my multiprocessing branch of bquery.

When you use i in prange(0,100) with two threads, thread 1 gets the range (0:50) while thread 2 gets the range (51:100). Within each thread, the results are in-order but since they are processed in parallel you end up with your results in a mangled mess.

There are several ways of dealing with this:

  1. Since you have the variable i you know what chunk your result belongs to, so you can save the result out-of-order. (I.e. using the save equivalent of _getitem().) This is what I did initially with multiprocessing bquery. - Not too elegant, since the natural chunksize of your result array might be quite different from that of the source array.
  2. You could save the results into a results array in the order they arrive but remember the order in which they were saved. Then reorder in a second step. This deals with the chunksize-issue and is easy to implement but always sounded like "cheating" to me...
  3. You could buffer the results from each thread in a thread-local buffer with a buffer size equal to the optimal chunksize of the results array and write these buffers as properly numbered out-of-order results array chunks to disk. This also gets around the mis-adjusted chunksize-issue and the wait-issues but is a bit of a programming/debugging challenge.
  4. You could go the easy road and play with the schedule=dynamic and chunksize parameters of prange. Depending on the structure of the data you are processing, this can help alleviate the issue of the wait mechanism with slow chunks by assigning "super-chunks" (consisting of multiple chunks) to the threads. If you only wait for each super-chunk, the difference in processing time of an individual chunk may (or may not) average out over the super-chunk. - Not really a good all-purposes solution but maybe useful in some particular cases.

I have a semi-working implementation of the thread-local buffer approach in an unpublished multiprocessing bquery branch. Let me know if you are interested in using it as a quarry for code segments.

Of course in all cases of bcolz chunk processing, even a sub-optimal multiprocessing solution is likely to be faster than a single-core solution. Its just the scalability with the number of cores that suffers, right?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants