-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threaded implementation #12
Conversation
This looks fantastic! It is going to take me a little while to find the time to work through this properly and review it, but please don't mistake that delay for a lack of excitement about this pull request! Definitely looking forward to digging into this one, and would be extremely happy to get this merged. |
@tomwhite I still want to have some time to play around with this, but I have at least gone through and satisfied myself that I understand how it all works, and wanted to get some initial feedback as soon as possible. First and foremost, this looks great! It is at heart very simple (which is a definite positive), and so fairly easy to follow, but very slickly done. I appreciate that you've kept the code very clean and that it matches nicely with the single threaded version. On the question of going any further -- I think it definitely looks worth doing, and welcome whatever you view as appropriate for making it fit with the NNDescent object. I would also be happy to discuss options if you like. I haven't had the opportunity to do much benchmarking myself, but it sounds as if the single threaded case might be worth keeping for the 4-core or less case (depending on what the overhead is like for the threaded case). Unless I've missed something it looks like your threaded version does not currently make use of the RP-trees when initialising the NNDescent. At least in the single threaded case this has been quite beneficial for performance. It is possible that this might get the threaded performance matching in the 4 or less core case. I will have to experiment and try some benchmarking to know for sure. Lastly, given how well this works I would be interested in your thoughts on how well this might extend to working with dask (which has a concurrent.futures like interface) for out-of-core versions of pynndescent? P.S.: Ignore the Codacy complaint, I turned it some time ago and was unimpressed but forgot to turn it off. It's off now. |
One other thing to note: since about 0.38 numba has supported passing jitted functions as parameters. This obviates the need to use the indirection of the |
@lmcinnes thanks for looking at the code, and for your kind comments. Let me address your points in turn. To make it fit with the NNDescent object, I thought having an Regarding the existing code path, I agree that it is worth keeping since it is faster for less than 4 cores. Or even more with RP-trees. You are correct that I haven't looked at parallelising RP-trees, but I think there are various ways that it would be possible, the simplest being initialising each tree in a separate thread. But that could be a later addition I think. BTW for benchmarking I've been using these scripts. If these are generally useful I'd be happy to contribute them too. Regarding Dask, I'm very interested in this direction, and in fact, running on a distributed engine like Dask or Spark was my initial motivation for optimising pynndescent (for Scanpy). Obviously, the MapReduce technique naturally transfers to these engines, although there may be challenges regarding the amount of data that needs to be shuffled for each iteration. This is well worth investigating for the potential gains for scaling to very large datasets though. Finally, thanks for the info about the Numba upgrade. I'll rebase this PR to reflect the fact that |
The I have been doing some benchmarking on datasets I expect to be more representative of real data than purely random data and the performance numbers seem to be comparable to what you describe, so I am fairly confident about this at this point. I am ready to hit the merge button as soon as you like -- did you have anything else you wanted to add before merging, or just get this merged and then look at anything else later? |
@lmcinnes I don't have anything else to add before merging. I'll look at sklearn's |
Excellent, thanks for all your work on this! I will likely be porting some or all of this over to umap at some point soon as well. |
Thanks @lmcinnes! I'll submit some follow-up PRs that are related to this one. Re umap, are there plans to have it use pynndescent as a dependency at some point? I'd be happy to help in that effort. |
I look forward to future PRs! Adding a dependency on pynndescent is tempting from the point of view of removing duplication, but it does mean I would have to get pynndescent in a rather better state (testing, documentation, functionality), and right now umap is both in heavy development and also widely used so I was hoping to get to a good state on umap and associated new features before setting to work on pynndescent in earnest and getting to where it needs to be. |
Currently, pynndescent doesn't take full advantage of multiple cores on a machine. There are some methods annotated with the Numba parallel option, but these don't provide any noticeable speed up when more cores are available.
This PR allows nearest neighbor computation to take advantage of multiple cores by processing heaps in parallel chunks. The basic idea is an application of the MapReduce approach described in section 2.8 of Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures by Dong et al.
Note that in this PR all computation is carried out on a single machine using the standard Python concurrent.futures library, which complements Numba very well, as shown in this example.
Going into slightly more detail, a distributed heap can be updated in a MapReduce-style operation as follows.
We have a large array of all the heap updates held in memory and a thread pool to update chunks in parallel in two phases, called "Map" and "Reduce". Each Map partition holds an array of heap updates, corresponding to a row in the following diagram.
The Map phase adds updates in the order they were generated (from
heap_push
calls). After all the updates have been added for a partition, the updates for that partition are sorted by row number. Then the chunk boundaries corresponding to partitions are found, so that in the Reduce phase the array can process all the updates for a given partition together, and apply them to the target heap (not illustrated).For the Map phase the array is processed row-wise, and in the Reduce phase it is processed column-wise (although the column bounds are different for each row owing to the different chunk boundaries). The result is an updated distributed heap.
This PR also adds tests that check the threaded implementation produces the same result as the regular implementation. Since NN-Descent uses a source of randomness, when running the tests the random seed is set to the value of each row index to ensure the same deterministic result is obtained. This per-row seeding is only used for tests. However, to make threaded runs more reproducible when run normally (outside of tests) each thread is seeded from the initial random seed, which means that the result from two threaded runs (with the same number of threads) is the same if the same initial seed is used in each case.
In terms of performance, the multi-threaded implementation outperforms the regular implementation with 4 cores or more. In one experiment, regular pynndescent took 27 minutes for a million rows (D=128, NN=25), compared to just over 3 minutes with 64 threads.
There is more work to consider how to wire up the implementation to the standard NNDescent object, but I wanted to get feedback on this approach before going any further.