-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelised string factorisation #22
base: master
Are you sure you want to change the base?
Conversation
Shared variables with manual locking: - hash table - count - reverse_keys - reverse_values - out_buffer - chunk_ Shared variables without locking requirement: - locks Thread-local variables: - thread_id - in_buffer_ptr (points to thread-local buffer) - out_buffer_ptr (points to thread-local buffer) Locking scheme: - For each thread a lock on the hash table (and other associated shared variables) exists. - Each thread processing a chunk begins by acquiring its own lock on the shared hash table. - The lock is released when the thread encounters an value that is new to the hash table. - Once the thread is ready to write to the hash table, it waits to acquire the locks from all threads. - After the write all locks are released. --- Uncompressed bcolz timings: ``` --- uncached unique() --- pandas (in-memory): In [10]: %timeit -r 10 c.unique() 1 loops, best of 10: 881 ms per loop bquery master over bcolz (persistent): In [12]: %timeit -r 10 a.unique('mycol') 1 loops, best of 10: 2.1 s per loop ==> x2.38 slower than pandas pull request over bcolz (persistent): In [8]: %timeit -r 10 a.unique('mycol') 1 loops, best of 10: 834 ms per loop ==> x1.05 FASTER than pandas ---- cache_factor --- bquery master over bcolz (persistent): In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 2.51 s per loop pull request with 2 threads over bcolz (persistent): In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 1.16 s per loop ==> x2.16 faster than master pull request with 1 thread over bcolz (persistent): In [3]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 1.69 s per loop ==> x1.48 faster than master (c.f. x1.48 from single-threaded PR visualfabriq#21) ==> parallel code seems to have no performance penalty on single-core machines ``` Compressed bcolz timings: ``` --- uncached unique() --- pandas (in-memory): In [10]: %timeit -r 10 c.unique() 1 loops, best of 10: 881 ms per loop bquery master over bcolz (persistent): In [12]: %timeit -r 10 a.unique('mycol') 1 loops, best of 10: 3.39 s per loop ==> x3.85 slower than pandas pull request over bcolz (persistent): In [8]: %timeit -r 10 a.unique('mycol') 1 loops, best of 10: 1.9 s per loop ==> x2.16 slower than pandas ---- cache_factor --- bquery master over bcolz (persistent): In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 4.09 s per loop pull request with 2 threads over bcolz (persistent): In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 2.48 s per loop ==> x1.65 faster than master pull request with 1 thread over bcolz (persistent): In [5]: %timeit -r 10 a.cache_factor(['mycol'], refresh=True) 1 loops, best of 10: 3.26 s per loop ==> x1.25 faster than master (c.f. x1.28 from single-threaded PR visualfabriq#21) ```
timings are similar
@CarstVaartjes Had a slight bug resulting in a faulty factor: chunk results were appended out-of-order, e.g. chunk 0, chunk 50, chunk 1, etc. Possible solutions:
For solution 2: Any preferences or advice? |
I was wondering about the order with parallelization actually! (for groupby it's less relevant, but for the factorization it very much is). I'm going to sleep over it for a night I think. The great thing about your solution is that it does work for other use cases that we have that are in-core (the groupby functions) so we should be able to speed up those |
exhibits performance issues with unique(), presumably somehow linked to the in-memory labels carray
@CarstVaartjes I managed to find a way to write the carray chunks out-of-order. What is much more painful is that What I implemented now is forcing To make matters worse, writing in-memory chunks out-of-order requires a different api. On the whole the code is now in urgent need of some refactoring... But it works... well sort of: out-of-core factorize() is x2.2 faster than master for uncompressed, x1.75 for compressed bcolz for my test case. What has me completely stumped is that in-memory factorize() is MUCH slower than out-of-core factorize(). It is even slower than the single-threaded code! |
Trying to summarize it for myself and thinking out loud:
Discussion points are:
God this is complicated stuff. Really great insights though. And for most functions in the aggregation side your code could already be used -> I think writing to the same ndarray would still require locking (sort of defeating the parallel purpose there, as each row there does write) but instead each thread could have its own array where we use numpy to add the results of the individual ndarrays together. (for sum and count it will work, for sorted_count_distinct not as that requires sequential reading from a point where you now the previous value was different) |
broke due to performance debugging
@CarstVaartjes I was implementing one of your ideas and ran into trouble. You wrote:
I am in the process of implementing this and I think it is not possible: since we are using a hash-table, the position of the elements in the table does not indicate their order of insertion but rather their hash value. I think the kh_str_t table does not contain the indexes. Am I overlooking a feature of khash here? @CarstVaartjes @FrancescElies What do you think? It could make for a cleaner implementation since we do not need to carry around the reverse_values pointer. On the other hand we would have to create a whole slew of hash table implementations (one for each data type) rather than use those already defined. Re speed: creating the dict from the reverse vector is definitely faster, even for single-threaded operation. See my (now slightly outdated) PR #21 with performance enhancements. This was surprising to me as well: insertion into a dict should take the same time no matter where it happens. I believe the reason for the increased performance is, that without python objects as arguments the helper function can have a nogil signature which seems to speed up things somehow. Possibly the performance increase manifests only if the GIL is actually released when calling the helper. - Or maybe my recollection is just wrong. |
@CarstVaartjes Just wanted to let you know to ignore my previous post. I finally understood how khash works. The keys are indeed stored: table.keys. Should have been obvious... |
@FrancescElies @CarstVaartjes OK, here are few revision that can serve as basis for further discussions:
All revisions pass my own test cases. I finally ferreted out the synchronisation issue I mentioned to @FrancescElies which was leading to duplicate entries in the reverse dictionary. Performance measurements were done using on a 12-character column with 1014 unique values and about 9 million entries in total with the following commands:
For comparison the current visualfabriq/bquery master (rev 3ec8eb8, 'master') was used as a reference. Uncompressed, in-memory factorisation:
Uncompressed, on-disk factorisation:
Compressed, in-memory factorisation:
Compressed, on-disk factorisation:
Conclusions:
|
to compare changes disregarding the iterblocks revision with windows line endings use: |
Hi, apologies for the late answer, we are under some deadline pressures and I am also afraid at the moment we do not have much spare time. I just had a first look, note that I have no practical experience with c++, please be gentle with my mistakes. About benchmarking, keeping track of all possible different scenarios is going to be difficult, here some ideas taken from other projects, maybe we could consider using vbench https://github.com/wesm/pandas/tree/master/vb_suite (pandas) or airspeed velocity http://spacetelescope.github.io/asv/ (under consideration in bcolz Blosc/bcolz#116). I saw very nice stuff even using omp pragma directives not supported directly in cython, once I tried some stuff but I found a bit tricky to use objects inside prange, it seems like you managed to do that without problems. Your numba suggestion is also very interessting, this topic will require for sure some time, hopefully this whole situation is fine with you. |
Shared variables with manual locking:
Shared variables without locking requirement:
Thread-local variables:
Locking scheme: