Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of molecular descriptors obtained with PaDEL differs from the number of molecules in the molecule.smi file #2

Open
sayalaruano opened this issue Oct 24, 2021 · 7 comments

Comments

@sayalaruano
Copy link

Hello professor, I’m doing EDA and calculation of molecular descriptors of the betalactamase dataset. I replaced duplicated values by the mean of them as you suggested, and filtered only molecules that bind to Betalactamase AmpC, and I have a dataset with 62050 molecules. Then, I followed instructions to calculate molecular descriptors with paDELpy from the video of description, but I obtained molecular descriptors of only 5534 molecules although my molecule.smi file has 62050 molecules. Do you know if there are restrictions regarding the number of molecules for calculating descriptors in paDEL ? or this error can be associated with something from my code ?. This GitHub repo contains my notebook and all files: https://github.com/sayalaruano/MidtermProject-MLZoomCamp. I added the same comment in the youtube video of the challenge, just in case. Thanks in advance for your help.

@sayalaruano sayalaruano changed the title Number of molecular descriptors obtained with PaDEL differs from the number of my molecules in the molecule.smi file Number of molecular descriptors obtained with PaDEL differs from the number of molecules in the molecule.smi file Oct 24, 2021
@wguesdon
Copy link

I obtained 1412 rows myself as can be seen here: https://github.com/wguesdon/beta-lactamase/blob/main/Data_Wrangling_and_EDA.ipynb.
I wonder if we could apply the padelpy method row by row via a lambda function?

@sayalaruano
Copy link
Author

I just come up with the solution for this error. The mistake was that I maintain in my dataset some molecules with NaN in canonical smile feature, so padel only calculate fingerprints for molecules above the first NaN. Now, I will try to calculate the 12 fingerprints for all molecules. I hope I can calculate all of them.

@wguesdon
Copy link

Thank you for sharing, it must have been the same issue for me.

@sayalaruano
Copy link
Author

You're welcome @wguesdon, this is the good part of these collaborative projects :)

@semsem80
Copy link

Hello sayalaruano,

I have the same problem.
I obtained molecular descriptors of PubChem only 338 molecules although my molecule.smi file has 64424 molecules.

@sayalaruano
Copy link
Author

Hello @semsem80 , to solve this error, you need to delete molecules with NaN in canonical_smile feature. In this way, you can solve this problem. Hope this can be helpful, let me know if it works.

@semsem80
Copy link

Hi @sayalaruano, your suggested solution worked, thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants