Adjust parallelization of TwoComponentReaction node to significantly reduce memory usage #162
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The TwoComponentReaction submitted tasks for an executor service in the following scheme: one task for each element in the first input column. The task then performed the reaction of this element (=reactant) with all reactants of a second input column. So the output of this task is the list of reaction results of all these pairings. The output needs to be kept in memory until it has been written out.
Now imagine the second input column has a lot of rows, meaning each task needs to keep a lot of results in memory.
The thread pool is configured to use ~2x as many threads as there are CPU cores, so if there's a 4 core CPU this means 8 tasks are running in parallel, so at least 8 large results need to be kept in memory.
Changed with this commit: each reaction is handled as individual task. While this might increase the bookkeeping overhead, it makes sure that way fewer results need to be kept in memory, which in practice showed much better performance because the operating system doesn't need to manage too memory (which is outside of the JVM, but RDKit molecules in C via JNI).