Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded v1 #37

Closed
wants to merge 17 commits into from

Conversation

FrancescElies
Copy link
Contributor

Based onto: #24

Most of this modifications come from #22 applied to int32, int64 and float specific functions (Thanks @ARF1)

Non functional branch, still having some problems.

@ARF1
Copy link

ARF1 commented May 21, 2015

Oh wow; it seems you are really getting serious about multi-threading. @CarstVaartjes had mentioned that being on the todo list on the "categoricals" issue for bcolz but I thought that was more long-term...

Quick heads up: I have a very strong suspicion that the iterblocks benchmarks in #22 are not representative. I believe I accidentally ran them with an input-database that had a MUCH too small chunklen. As iterblocks is implemented in python in bcolz that would obviously be an issue to the too frequent looping in python. Rerunning the benchmarks is on my todo list but I have not had the time yet.

You mention that there are still problems with this branch. Let me know if there is anything I can do to help.

@FrancescElies
Copy link
Contributor Author

Hi,

thanks for your feedback, I was having problems because of copy pasting your code without thinking what was going on, your branch was really really helpful, no way I could have done without using it as a reference (I would have loved to cheery picked some stuff that is exactly the same in both but I struggled).

Now seems to work, at least test pass now, though some more intensive tests will be needed I think.
This branch is based on iterblocks but could be rebased onto master if in the end is decided iterblocks is not the way to go.

For the moment I am working on this but this could change rapidly, I don't really know I cannot give you any specifics about this.

@ARF1
Copy link

ARF1 commented May 21, 2015

@FrancescElies I think I am getting far too much credit here. - And I feel kind of guilty because I had a nicer version of the helper function laying around that implemented @CarstVaartjes's suggestion to drop the reverse dict altogether and reconstruct the dict from the hash table:

Recipe for porting to your code:

  1. in the helper function: remove the reverse_values argument (and of course all references to it). It is not needed, all the data is available in the hash table.
  2. in the "non-helper" function: remove the kh_destroy_...(table) line. (table is still needed at this point to construct reverse.)
  3. in the "non-helper" function: remove the for i in range(reverse_values.size()): block and replace with:
    # construct python dict from vectors and free element memory
    reverse = {}                      # probably not necessary!?
    for j in range(table.n_buckets):
        if not kh_exist_str(table, j):   # adjust function name to hash-table data-type
            continue
        reverse[table.vals[j]] = <char*>table.keys[j]    # adjust casting to hast-table data-type or casting might not be necessary at all for non-string hash-tables
        free(<void*>table.keys[j])           # faily sure not necessary for non-string hash-tables
    kh_destroy_str(table)      # adjust function name to hash-table data-type

You probably still need to tweak this a bit because it was created for the string helper function. (See in-line comments.)

@FrancescElies
Copy link
Contributor Author

so far no performance improvements

@ARF1
Copy link

ARF1 commented May 21, 2015

No unless your number of unique values becomes truely huge, it would not expect any performance improvements. C++ vectors are know to be very high-performance datastructures.

One situation where you might see minor performance effects is if your factors have large load factors: If almost every item is unique, the vector will grow much more and re-allocation and copying of existing values will occur more often. But even then, I think it's allocated size is grown exponentially (or something similar) to limit the number of re-allocations.

I think CarstVaartjes only meant that this would clean up the implementation by eliminating a redundant argument/variable. Personally, I really like the streamlined look of iterblocks...

@FrancescElies
Copy link
Contributor Author

this has been inactive for a long time now, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants