-
Notifications
You must be signed in to change notification settings - Fork 17
Adjust parallelization of TwoComponentReaction node to significantly reduce memory usage #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chaubold Sorry for taking a while to complete this review.
This looks good to me; I have left in a few comments that you might want to take into consideration.
org.rdkit.knime.nodes/src/org/rdkit/knime/nodes/twocomponentreaction2/Pair.java
Show resolved
Hide resolved
org.rdkit.knime.nodes/src/org/rdkit/knime/nodes/twocomponentreaction2/PairIterable.java
Show resolved
Hide resolved
org.rdkit.knime.nodes/src/org/rdkit/knime/nodes/twocomponentreaction2/PairIterable.java
Outdated
Show resolved
Hide resolved
org.rdkit.knime.nodes/src/org/rdkit/knime/nodes/twocomponentreaction2/PairIterable.java
Outdated
Show resolved
Hide resolved
...odes/src/org/rdkit/knime/nodes/twocomponentreaction2/RDKitTwoComponentReactionNodeModel.java
Outdated
Show resolved
Hide resolved
...odes/src/org/rdkit/knime/nodes/twocomponentreaction2/RDKitTwoComponentReactionNodeModel.java
Outdated
Show resolved
Hide resolved
260fadb
to
b5f436f
Compare
Thanks for the feedback @ptosco! Sorry for all the formatting changes, I should've removed them from the start. I applied the suggested improvements and also added your license file as header of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chaubold This looks good to me; I only suggested using KNIME's copyright notice in org.rdkit.knime.nodes/src/org/rdkit/knime/nodes/twocomponentreaction2/Pair.java
given you added this file, and also removing a further set of curly brackets in org.rdkit.knime.nodes/src/org/rdkit/knime/nodes/twocomponentreaction2/RDKitTwoComponentReactionNodeModel.java
by changing else { if (...) {
into else if (...) {
.
...odes/src/org/rdkit/knime/nodes/twocomponentreaction2/RDKitTwoComponentReactionNodeModel.java
Outdated
Show resolved
Hide resolved
org.rdkit.knime.nodes/src/org/rdkit/knime/nodes/twocomponentreaction2/Pair.java
Outdated
Show resolved
Hide resolved
…reduce memory usage The TwoComponentReaction submitted tasks for an executor service in the following scheme: one task for each element in the first input column. The task then performed the reaction of this element (=reactant) with all reactants of a second input column. So the output of this task is the list of reaction results of all these pairings. The output needs to be kept in memory until it has been written out. Now imagine the second input column has a lot of rows, meaning each task needs to keep a lot of results in memory. The thread pool is configured to use ~2x as many threads as there are CPU cores, so if there's a 4 core CPU this means 8 tasks are running in parallel, so at least 8 large results need to be kept in memory. Changed with this commit: each reaction is handled as individual task. While this might increase the bookkeeping overhead, it makes sure that way fewer results need to be kept in memory, which in practice showed much better performance because the operating system doesn't need to manage a much memory (which is outside of the JVM, but RDKit molecules in C via JNI).
d18a2a2
to
b80988f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chaubold Thank you for your contribution and for applying my suggested changes!
@greglandrum I am happy to merge.
Thanks to both of you. Merge away @ptosco ! |
The TwoComponentReaction submitted tasks for an executor service in the following scheme: one task for each element in the first input column. The task then performed the reaction of this element (=reactant) with all reactants of a second input column. So the output of this task is the list of reaction results of all these pairings. The output needs to be kept in memory until it has been written out.
Now imagine the second input column has a lot of rows, meaning each task needs to keep a lot of results in memory.
The thread pool is configured to use ~2x as many threads as there are CPU cores, so if there's a 4 core CPU this means 8 tasks are running in parallel, so at least 8 large results need to be kept in memory.
Changed with this commit: each reaction is handled as individual task. While this might increase the bookkeeping overhead, it makes sure that way fewer results need to be kept in memory, which in practice showed much better performance because the operating system doesn't need to manage too memory (which is outside of the JVM, but RDKit molecules in C via JNI).