Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bug: Invalid SMILES producing a Null output #941

Closed
Richiio opened this issue Jan 9, 2024 · 33 comments · Fixed by #1295
Closed

🐛 Bug: Invalid SMILES producing a Null output #941

Richiio opened this issue Jan 9, 2024 · 33 comments · Fixed by #1295
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers low Can be deprioritized in favor of other issues; can be picked up by new contributors question Further information is requested

Comments

@Richiio
Copy link

Richiio commented Jan 9, 2024

Is your feature request related to a problem? Please describe.

Yes, it is related to a problem. I incorporated a model recently that worked perfectly well for a SMILE with correct input but failed for a SMILES with an incorrect input. Here is a link to the model incorporated https://github.com/ersilia-os/eos1mxi

Describe the problem:

The current model implementation encounters a null output when provided with an inaccurate SMILE.

Describe the solution you'd like.

I am looking at a possible solution where the ersilia model hub checks the input of a user, ensures it is a correct SMILES before attempting to run predictions with the model. This could be part of the checks ran in the github actions ensuring that only valid SMILES are processed by the model.

Describe alternatives you've considered

While implementing input validation in the GitHub Actions workflow is one solution, alternative approaches may include incorporating input validation directly within the model's code or providing a separate input validation endpoint that users can query before submitting SMILES for predictions (This would be for the web UI)

@Richiio Richiio added the enhancement New feature or request label Jan 9, 2024
@GemmaTuron
Copy link
Member

Hi @Richiio

I need to understand this better. Can you please provide an example of the input and output you are getting in the correct and incorrect case? And better explain this: The current model implementation encounters errors when provided with incorrect SMILES input. This can lead to unexpected behavior and inaccurate predictions. which errors does it encounter? what inaccurate predictions? we should not output a prediction if the input is not correct
In principle Ersilia already checks for the validity of the input already so I don't know exactly what else would we want to do here @miquelduranfrigola ?

@Richiio
Copy link
Author

Richiio commented Jan 10, 2024

Hi @GemmaTuron, that was a wrong description on my part. Apologies for that

@miquelduranfrigola
Copy link
Member

Hi @Richiio thanks for the issue.

Can you provide run, in the same input file one good molecule and one bad molecule and attach the output here, as a csv?

@Richiio
Copy link
Author

Richiio commented Jan 10, 2024

@miquelduranfrigola
Input file
input.csv
Corresponding output file
output (5).csv

@miquelduranfrigola
Copy link
Member

miquelduranfrigola commented Jan 10, 2024

OK, thanks @Richiio ! Super useful.
Now I see what you mean.

Let's discuss this with @GemmaTuron and @DhanshreeA , not sure what is the best format we would want. I agree that the current solution, in this case, is not the best. This is probably something we should do discuss in an online meeting.

@miquelduranfrigola
Copy link
Member

@GemmaTuron what is the current status of this?

@GemmaTuron
Copy link
Member

We can add it to the agenda for discussion on Tuesday!

@miquelduranfrigola
Copy link
Member

Also, can we please add a title to this issue to help all of us keep track of it? Thanks!

@Richiio Richiio changed the title 📑 Feature Request: 📑 Feature Request: Invalid SMILES producing a Null output Jan 23, 2024
@DhanshreeA DhanshreeA added question Further information is requested low Can be deprioritized in favor of other issues; can be picked up by new contributors labels Feb 19, 2024
@DhanshreeA
Copy link
Member

I am labeling this low priority since it would be nice to have model output from ersilia cli in a more meaningful format in case of invalid smiles/garbage input, however it is not a breaking issue as of now. We will take this up soon!

@DhanshreeA DhanshreeA changed the title 📑 Feature Request: Invalid SMILES producing a Null output 🐛 Bug: Invalid SMILES producing a Null output May 28, 2024
@DhanshreeA DhanshreeA moved this to Queue in Ersilia Model Hub May 28, 2024
@GemmaTuron
Copy link
Member

Hi @miquelduranfrigola and @DhanshreeA

I thought Ersilia dealt with bad inputs from the start, so what do we want to do here? A more informative return of info for the user?
Similar to this issue #738 we could return a more informative message, but if only one molecule is incorrect in a bunch of molecules I'd rather still run the model and return the null output as is already happening

@miquelduranfrigola
Copy link
Member

Agree. What is more important is that we always return the same number of output rows as input rows.
The problem is with the key. If an input is incorrect, then we cannot obtain a key for it. So what do we do in this case? A null key? An arbitrary key?

@GemmaTuron
Copy link
Member

I would suggest a null key could be helpful so users can quickly identify if a molecule is not correct

@DhanshreeA
Copy link
Member

@miquelduranfrigola agree, we could have a placeholder/dummy key for cases where we could not obtain a key for a molecule, something like UNPOCESSABLE_SMILES_INPUT or something like that.

@miquelduranfrigola
Copy link
Member

I would simply call it UNINDEXED_INPUT or UNPROCESSABLE_INPUT. I would avoid saying SMILES

@GemmaTuron
Copy link
Member

@miquelduranfrigola or @DhanshreeA could you detail a bit how to tackle this? In which section of the Ersilia code should this go? I suggest using UNPROCESSABLE_INPUT both to fill in key and input if they are empty.

@GemmaTuron GemmaTuron moved this to On Hold in Ersilia Model Hub Sep 24, 2024
@miquelduranfrigola
Copy link
Member

In my opinion, this logic should be incorporated in the CompoundIdentifier class (here)

Note that we've found the interesting (a.k.a. annoying) case where the SMILES is valid but it is not possible to produce an InChIKey. I do not recall which SMILES string was that. This is not frequent (first time I see it), but we need to take it into account.

@DhanshreeA
Copy link
Member

Note that we've found the interesting (a.k.a. annoying) case where the SMILES is valid but it is not possible to produce an InChIKey. I do not recall which SMILES string was that. This is not frequent (first time I see it), but we need to take it into account.

@miquelduranfrigola this only happens when apparently, RDKit is not able to process this SMILE, and neither PubChem nor Cactus resolvers return anything.

@miquelduranfrigola
Copy link
Member

OK then this is something relatively easy, then

@DhanshreeA
Copy link
Member

Alright, so for conclusion, we will use UNPROCESSABLE_INPUT for both key, and input, in the CompoundIdentifier class when ersilia is not able to process a SMILES string.

To test this, we should feed ersilia first with a single unprocessable input, and then a couple of unprocessable inputs within otherwise meaningful data.

@miquelduranfrigola
Copy link
Member

Agree. Also, let's document this accordingly. Thanks!

@DhanshreeA
Copy link
Member

I think this could also make for a good first issue.

@DhanshreeA DhanshreeA added the good first issue Good for newcomers label Oct 2, 2024
@musasizivictoria
Copy link
Contributor

Hello @DhanshreeA i would like to give this a shot.
Thanks

@KimFarida
Copy link
Contributor

@DhanshreeA hahaa i know i be tagged you in so many places but this looks pretty good too

@DhanshreeA
Copy link
Member

@musasizivictoria go ahead!

@DhanshreeA DhanshreeA moved this from On Hold to Queued in Ersilia Model Hub Oct 5, 2024
@DhanshreeA DhanshreeA moved this from Queued to In Progress in Ersilia Model Hub Oct 5, 2024
@DhanshreeA
Copy link
Member

I don't see any updates from @musasizivictoria so I am reopening this issue for other applicants.

@musasizivictoria
Copy link
Contributor

musasizivictoria commented Oct 5, 2024

Ooh am sorry @DhanshreeA i did not receive a notificcation for the assignment, kindly reassign me this to kick start Thanks.

@DhanshreeA
Copy link
Member

Okay, @musasizivictoria you can continue working on it but please share updates. Thank you.

@musasizivictoria
Copy link
Contributor

musasizivictoria commented Oct 8, 2024

Okay, @musasizivictoria you can continue working on it but please share updates. Thank you.

Thanks @DhanshreeA I already opened PR
#1295

Kindly let me know another way of sharing update, sorry if i missed it. Do i need to post in slack?

@DhanshreeA
Copy link
Member

Okay, @musasizivictoria you can continue working on it but please share updates. Thank you.

Thanks @DhanshreeA I already opened PR #1295

Kindly let me know another way of sharing update, sorry if i missed it. Do i need to post in slack?

No here is fine, thanks @musasizivictoria!

@musasizivictoria
Copy link
Contributor

musasizivictoria commented Oct 9, 2024

I am progressing on the implementation, but i have some doubt about how the CompoundIdentifier class is used here;

self.identifier = importlib.import_module(

@DhanshreeA
Copy link
Member

I am progressing on the implementation, but i have some doubt about how the CompoundIdentifier class is used here;

ersilia/ersilia/io/types/compound.py

Line 26 in 43bbe1d
self.identifier = importlib.import_module(

@musasizivictoria honestly, I think we can change this part and import the CompoundIdentifier class normally as you would import any other module, instead of using importlib. This way the code would actually be discoverable through inspection tools like Pylance etc.

@musasizivictoria
Copy link
Contributor

Thanks @DhanshreeA here is the testing of the current changes:
Input:
testing.csv

Output:
output.csv

@musasizivictoria
Copy link
Contributor

hello @DhanshreeA ,
I provided feedback for the PR #1295
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers low Can be deprioritized in favor of other issues; can be picked up by new contributors question Further information is requested
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants