Skip to content

Adjust parallelization of TwoComponentReaction node to significantly reduce memory usage #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chaubold
Copy link
Contributor

@chaubold chaubold commented Mar 31, 2025

The TwoComponentReaction submitted tasks for an executor service in the following scheme: one task for each element in the first input column. The task then performed the reaction of this element (=reactant) with all reactants of a second input column. So the output of this task is the list of reaction results of all these pairings. The output needs to be kept in memory until it has been written out.

Now imagine the second input column has a lot of rows, meaning each task needs to keep a lot of results in memory.

The thread pool is configured to use ~2x as many threads as there are CPU cores, so if there's a 4 core CPU this means 8 tasks are running in parallel, so at least 8 large results need to be kept in memory.

Changed with this commit: each reaction is handled as individual task. While this might increase the bookkeeping overhead, it makes sure that way fewer results need to be kept in memory, which in practice showed much better performance because the operating system doesn't need to manage too memory (which is outside of the JVM, but RDKit molecules in C via JNI).

…reduce memory usage

The TwoComponentReaction submitted tasks for an executor service in the following scheme:
one task for each element in the first input column. The task then performed the reaction
of this element (=reactant) with all reactants of a second input column. So the output of
this task is the list of reaction results of all these pairings. The output needs to be
kept in memory until it has been written out.

Now imagine the second input column has a lot of rows, meaning each task needs to keep a lot
of results in memory.

The thread pool is configured to use ~2x as many threads as there are CPU cores, so if
there's a 4 core CPU this means 8 tasks are running in parallel, so at least 8 large
results need to be kept in memory.

Changed with this commit: each reaction is handled as individual task. While this might
increase the bookkeeping overhead, it makes sure that way fewer results need to be kept in
memory, which in practice showed much better performance because the operating system doesn't
need to manage a much memory (which is outside of the JVM, but RDKit molecules in C via JNI).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant