Multithreaded v1 #37

FrancescElies · 2015-05-21T07:59:04Z

Based onto: #24

Most of this modifications come from #22 applied to int32, int64 and float specific functions (Thanks @ARF1)

Non functional branch, still having some problems.

ARF1 · 2015-05-21T08:59:34Z

Oh wow; it seems you are really getting serious about multi-threading. @CarstVaartjes had mentioned that being on the todo list on the "categoricals" issue for bcolz but I thought that was more long-term...

Quick heads up: I have a very strong suspicion that the iterblocks benchmarks in #22 are not representative. I believe I accidentally ran them with an input-database that had a MUCH too small chunklen. As iterblocks is implemented in python in bcolz that would obviously be an issue to the too frequent looping in python. Rerunning the benchmarks is on my todo list but I have not had the time yet.

You mention that there are still problems with this branch. Let me know if there is anything I can do to help.

FrancescElies · 2015-05-21T09:58:31Z

Hi,

thanks for your feedback, I was having problems because of copy pasting your code without thinking what was going on, your branch was really really helpful, no way I could have done without using it as a reference (I would have loved to cheery picked some stuff that is exactly the same in both but I struggled).

Now seems to work, at least test pass now, though some more intensive tests will be needed I think.
This branch is based on iterblocks but could be rebased onto master if in the end is decided iterblocks is not the way to go.

For the moment I am working on this but this could change rapidly, I don't really know I cannot give you any specifics about this.

ARF1 · 2015-05-21T10:57:29Z

@FrancescElies I think I am getting far too much credit here. - And I feel kind of guilty because I had a nicer version of the helper function laying around that implemented @CarstVaartjes's suggestion to drop the reverse dict altogether and reconstruct the dict from the hash table:

Recipe for porting to your code:

in the helper function: remove the reverse_values argument (and of course all references to it). It is not needed, all the data is available in the hash table.
in the "non-helper" function: remove the kh_destroy_...(table) line. (table is still needed at this point to construct reverse.)
in the "non-helper" function: remove the for i in range(reverse_values.size()): block and replace with:

    # construct python dict from vectors and free element memory
    reverse = {}                      # probably not necessary!?
    for j in range(table.n_buckets):
        if not kh_exist_str(table, j):   # adjust function name to hash-table data-type
            continue
        reverse[table.vals[j]] = <char*>table.keys[j]    # adjust casting to hast-table data-type or casting might not be necessary at all for non-string hash-tables
        free(<void*>table.keys[j])           # faily sure not necessary for non-string hash-tables
    kh_destroy_str(table)      # adjust function name to hash-table data-type

You probably still need to tweak this a bit because it was created for the string helper function. (See in-line comments.)

FrancescElies · 2015-05-21T15:40:04Z

so far no performance improvements

ARF1 · 2015-05-21T16:44:38Z

No unless your number of unique values becomes truely huge, it would not expect any performance improvements. C++ vectors are know to be very high-performance datastructures.

One situation where you might see minor performance effects is if your factors have large load factors: If almost every item is unique, the vector will grow much more and re-allocation and copying of existing values will occur more often. But even then, I think it's allocated size is grown exponentially (or something similar) to limit the number of re-allocations.

I think CarstVaartjes only meant that this would clean up the implementation by eliminating a redundant argument/variable. Personally, I really like the streamlined look of iterblocks...

FrancescElies · 2017-01-18T21:36:36Z

this has been inactive for a long time now, closing

Francesc Elies Henar added 14 commits May 19, 2015 11:09

iterblocks: factorize_str

cdf7c2a

iterblocks: factorize_xxx

b61ef88

iterblocks: translate_int64

873acbc

iterblocks: groupsort_indexer

905af17

iterblocks: sum_xxx (1st step)

29a6ebf

iterblocks: sum_xxx (2nd step)

3cdf737

iterblocks: groupby_value (1st step)

5befa23

iterblocks: groupby_value (2nd step)

0ad69ea

cpu_count

6fa57ed

[Threads] agg_groups_by_iter

e0f2fef

[Threads] agg_groups_by_iter

c539a50

[Threads] factorize_groupby_cols

12223ea

[overcome gil] khash

aef2658

c++ stl

ef38b80

FrancescElies force-pushed the multithreaded branch from 7f42e72 to a005bdc Compare May 21, 2015 09:40

[overcome gil] factorize

357e283

FrancescElies force-pushed the multithreaded branch from a005bdc to 357e283 Compare May 21, 2015 10:07

Francesc Elies Henar added 2 commits May 21, 2015 13:54

reverse values from hash table

2fb5d7a

nogil

743ee48

FrancescElies closed this Jan 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded v1 #37

Multithreaded v1 #37

FrancescElies commented May 21, 2015

ARF1 commented May 21, 2015

FrancescElies commented May 21, 2015

ARF1 commented May 21, 2015

FrancescElies commented May 21, 2015

ARF1 commented May 21, 2015

FrancescElies commented Jan 18, 2017

Multithreaded v1 #37

Multithreaded v1 #37

Conversation

FrancescElies commented May 21, 2015

ARF1 commented May 21, 2015

FrancescElies commented May 21, 2015

ARF1 commented May 21, 2015

FrancescElies commented May 21, 2015

ARF1 commented May 21, 2015

FrancescElies commented Jan 18, 2017