-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implemented performance speedup for binary ReliefF + bug fixes #79
base: master
Are you sure you want to change the base?
ENH: Implemented performance speedup for binary ReliefF + bug fixes #79
Conversation
e6a7ec6
to
05fb82e
Compare
Updating README to include new performance changes.
b1e8b5f
to
5d323e7
Compare
# nopython = True will prevent a silent fallback if the function cannot | ||
# be numba compiled for whatever reason. | ||
@jit(nopython=True) | ||
def find_neighbors_binary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core of the performance improvements for neighbor identification. We modify the function to find neighbors - add numba compilation, and add batch parallelization.
@@ -113,6 +113,42 @@ def find_neighbors_binary( | |||
return np.array(nn_list) | |||
|
|||
|
|||
# interestingly, this is sometimes faster without numba | |||
def compute_score_binary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core of the performance improvements for score computation - it expresses the computation as a sequence of numpy matrix operations.
Context
TL:DR - I was able to implement some significant performance improvements for ReliefF on binary + discrete data.
For a GAMETES generated binary class discrete data file with 3200 rows and 400 attributes, the statistics are:
-ReliefF w/ 10 neighbors Before: 59.05 seconds (mean) | 0.73 seconds (std) (n=4 runs)
-ReliefF w/ 10 neighbors After: 12.68 seconds (mean) | 0.65 seconds (std) (n=4 runs)
Process
I set out to see whether I could speed up the implementation of these algorithms. I had a few hypotheses:
My general plan of action was as follows:
Changes in This PR
For benchmarking I wrote a simple tool that takes some of the common test cases and runs timeit repeatedly. I drop the results of the first run to avoid measuring compilation overhead, which can be quite high for numba when we're using small testing datasets. The tool will print results but also dump them to a csv file in the same directory. To run this tool,
python performance_tests.py
Additionally - you can run a parameter sweep to estimate how performance varies across different parameter values withpython performance_tests.py sweep
.I also wrote a simple shell script to automate the generation of profiling graphs using Graphviz. This is available at
./run_performance_benchmark.sh
and creates a performance csv for selected test cases, a python cprofile data file, and a png performance graph.Using these tools I confirmed that the bulk of the runtime was in finding relevant neighbors for the relief algorithm and scoring. This is unsurprising as those functions are called roughly O(num_rows ^ 2) times and were not currently performance optimized.
Additional Notes & Results
In the interests of time I focused on a proof of concept for ReliefF with binary features and discrete data. The main scoring functions in scoring_utils.py were indeed very complex, with a lot of switch statements and non-performant operations. There is an elegance to the centralization, but I think dedicated scoring functions are not only clearer to a reader but also allow for dedicated optimization. My approach was to create a dedicated compute_binary_score function which uses optimized numpy functions. I also parallelized the selection of neighbors, and optimized the neighbor selection function using numba. Doing this caused me to have to refactor some code, but I aimed to try and keep the contracts the same so that the rest of the codebase would not be affected.
Lastly - all tests were failing when I checked out the repo. I determined this was due to Imputer being deprecated from sklearn. I updated it to the new SimpleImputer. This caused only 5 tests to fail, but looking at recent repo history I don't believe all tests had been passing. It seems to be a more general bug with how the algorithms treat data with a mixture of contiuous and discrete attributes. I did not attempt to fix this problem.