-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise HashingEncoder
for both large and small dataframes
#428
Optimise HashingEncoder
for both large and small dataframes
#428
Conversation
It has no direct impact on performance, however it allows accessing the memory layout of the dataframe directly. That allows using shared memory to communicate between processes instead of a data queue, which does improve performance. See future commits.
It is faster to write directly in memory that to have to data transit through a queue. The multiprocessing method is similar to what it was with queues: the dataframe is split into chunks, and each process applies the hashing trick to its chunk of the dataframe. Instead of writting the result to a queue, it writes it directly in a shared memory segment, that is also the underlying memory of a numpy array that is used to build the output dataframe. Tested on MacOS Monterey, 2.3 GHz Quad-Core i7, it is between 1.3 and 1.6 times faster depending on the df size and how many processes are used.
This makes the HashEncoder transform method a lot faster on small datasets. The spawn process creation method creates a new python interpreter from scratch, and re-import all required module. In a minimal case (pandas and category_encoders.hashing only are imported) this adds a ~2s overhead to any call to transform. Fork creates a copy of the current process, and that's it. It is unsafe to use with threads, locks, file descriptors, ... but in that case the only thing the forked process will do is process some data and write it to ITS OWN segment of a shared memory. It is a lot faster as pandas doesn't have to be re-imported (around 20ms?) It might take up more memory as more than the necessary variables (the largest one by far being the HashEncoder instance, which include the user dataframe) will be copied. Add the option to use spawn instead of fork to potentially save some memory.
Python 2 is not supported on master, the check isn't useful. Create int indexes from hashlib bytes digest instead of hex digest as it's faster. Call the md5 hashlib constructor directly instead of new('md5'), which is also faster. Overall the performance gain is about 40% to 60%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that contribution! Seems to improve the times a lot.
I tried to understand everything and left some comments where further clarification is necessary
category_encoders/hashing.py
Outdated
@@ -101,9 +102,10 @@ class HashingEncoder(util.BaseEncoder, util.UnsupervisedTransformerMixin): | |||
""" | |||
prefit_ordinal = False | |||
encoding_relation = util.EncodingRelation.ONE_TO_M | |||
default_int_np_array = np.array(np.zeros((2,2), dtype='int')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this 2x2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure 😅 I am just using it to get the type of the default int array name in that line so it could be 1 by 1!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it would make sense to chance it to 1 by 1 as it would be sort of a minimal example. Also probably add a comment that you only need it for the datatype so people won't wonder why it is there
category_encoders/hashing.py
Outdated
n_process = [] | ||
chunk_size = int(len(np_df)/self.max_process) | ||
ctx = multiprocessing.get_context(self.process_creation_method) | ||
for i in range(0, self.max_process-1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitting this way is not very elegant if the last chunk has to be treated separately.
using numpy array split could be helpful: https://stackoverflow.com/a/75981560.
Wouldn't it be easier if the hash_chunk
function would hash a chunk and return an array. Then it wouldn't need the shm_result
and shm_offset
parameters (what does shm stand for btw?). Then you'd just concatenate all the chunks in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shm
stands for "shared_memory" - will update the variable name 👍
Wouldn't it be easier if the hash_chunk function would hash a chunk and return an array. Then it wouldn't need the shm_result and shm_offset parameters (what does shm stand for btw?). Then you'd just concatenate all the chunks in the end?
That's a very good point, I will try that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I"ve seen your implementation of the mulitprocess pool (here https://github.com/bkhant1/category_encoders/compare/all_optis...bkhant1:category_encoders:multiproc_pool?expand=1) and like it a lot. I think it is very clean and you should add it to the PR
one more question: you've removed all the data-lock stuff for multi-processing, right. Was this not necessary? |
Hi @PaulWestenthanner ! Thanks for your review!
It was necessary because all processes were writting to the same queue. With shared memory they write to their own section of the shared memory, so no there is no concurrency to manage. I think I replied to all your comments/addressed them. The one outstanding thing is that you suggested:
It's a little more complex than that because Here's a PR in my fork showing what I tried. It could be a little slower but the code is cleaner. I think I should add this commit to this PR but let me know what you think!
|
category_encoders/hashing.py
Outdated
n_process = [] | ||
chunk_size = int(len(np_df)/self.max_process) | ||
ctx = multiprocessing.get_context(self.process_creation_method) | ||
for i in range(0, self.max_process-1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I"ve seen your implementation of the mulitprocess pool (here https://github.com/bkhant1/category_encoders/compare/all_optis...bkhant1:category_encoders:multiproc_pool?expand=1) and like it a lot. I think it is very clean and you should add it to the PR
category_encoders/hashing.py
Outdated
@@ -101,9 +102,10 @@ class HashingEncoder(util.BaseEncoder, util.UnsupervisedTransformerMixin): | |||
""" | |||
prefit_ordinal = False | |||
encoding_relation = util.EncodingRelation.ONE_TO_M | |||
default_int_np_array = np.array(np.zeros((2,2), dtype='int')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it would make sense to chance it to 1 by 1 as it would be sort of a minimal example. Also probably add a comment that you only need it for the datatype so people won't wonder why it is there
Thanks for the changes. Looks really cool now. |
could you maybe also add a high level summary to the changelog in the unreleased section (https://github.com/scikit-learn-contrib/category_encoders/blob/master/CHANGELOG.md)? |
It makes for cleaner, easier to read code, at very little performance impact
I added the |
Looks good! Thanks for this improvement! Makes it both more readable and faster |
5c94e27
into
scikit-learn-contrib:master
I used the HashingEncoder recently and found weird that any call to
fit
ortransform
, even for a dataframe with only 10s of rows and a couple of columns took at least 2s...I also had quite a large amount of data to encode, and that took a long time.
That got me started on improving the performance of HashingEncoder, and here's the result! There are quite a few changes in there, each individual change should be in it's own commit, and here's a summary of the performance gain on my machine (macOS Monteray, i7 2.3ghz).
The notebook that produced that table can be found here
Proposed Changes
The changes are listed by commit.
Add a simple non-regression HashEncoder test
To make sure I am not breaking it.
In HashingEncoder process the df as a numpy array instead of using apply
It has no direct impact on performance, however it allows accessing the memory layout of the dataframe directly. That allows using shared memory to communicate between processes instead of a data queue, which does improve performance.
In HashEncoder use shared memory instead of queue for multiproccessing
It is faster to write directly in memory that to have to data transit through a queue.
The multiprocessing method is similar to what it was with queues: the dataframe is split into chunks, and each process applies the hashing trick to its chunk of the dataframe. Instead of writting the result to a queue, it writes it directly in a shared memory segment, that is also the underlying memory of a numpy array that is used to build the output dataframe.
Allow forking processes instead of spwaning them and make it default
This makes the HashEncoder transform method a lot faster on small datasets.
The spawn process creation method creates a new python interpreter from scratch, and re-import all required module. In a minimal case (pandas and category_encoders.hashing only are imported) this adds a ~2s overhead to any call to transform.
Fork creates a copy of the current process, and that's it. It is unsafe to use with threads, locks, file descriptors, ... but in that case the only thing the forked process will do is process some data and write it to ITS OWN segment of a shared memory. It is a lot faster as pandas doesn't have to be re-imported (around 20ms?)
It might take up more memory as more than the necessary variables (the largest one by far being the HashEncoder instance, which include the user dataframe) will be copied. Add the option to use spawn instead of fork to potentially save some memory.
Remove python 2 check code and faster use of hashlib
Python 2 is not supported on master, the check isn't useful.
Create int indexes from hashlib bytes digest instead of hex digest as it's faster.
Call the md5 hashlib constructor directly instead of new('md5'), which is also faster.