Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM issue in MolToDescriptorPipelineElement when standardizer not None #23

Open
JochenSiegWork opened this issue Jun 14, 2024 · 0 comments
Labels
status: backlog Things we will work on, but not right now type: enhancement New feature or request

Comments

@JochenSiegWork
Copy link
Collaborator

I tried to process a data set of 1.4M molecules with a small Pipeline looking like this:

pipeline = Pipeline(
            [
                ("smi2mol", SmilesToMol()),
                ("net_charge_element", MolToNetCharge()),  # MolToNetCharge inherits from MolToDescriptorPipelineElement
            ])

This leads to RAM issues because Molpipeline simultaneously tries to fit the RDKit data structures for all 1.4M molecules into the RAM. This happens because Molpipeline splits the pipeline elements into syncing and non-syncing parts during the instance-based processing splitting.

In the constructor of MolToDescriptorPipelineElement, the _requires_fitting is set when the standardizer is not None:

  if self._standardizer is not None:
            self._requires_fitting = True

The RAM issues can be avoided by doing this:

pipeline = Pipeline(
            [
                ("smi2mol", SmilesToMol()),
                ("net_charge_element", MolToNetCharge(standardizer=None)),
            ])

It would be better to have the standardization in a way that does not lead to RAM issues.

@c-w-feldmann c-w-feldmann added type: enhancement New feature or request status: backlog Things we will work on, but not right now labels Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: backlog Things we will work on, but not right now type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants