-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the ChEMBL drugs dataset #214
Conversation
Update chembl_drugs to use chembl approved drugs.
fix Unnamed:0 column header
Update test_chembl_drugs
Codecov Report
@@ Coverage Diff @@
## main #214 +/- ##
=======================================
Coverage 91.91% 91.91%
=======================================
Files 46 46
Lines 3835 3835
=======================================
Hits 3525 3525
Misses 310 310
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Shawn. As discussed in live, a few points below:
- Edit the original message in this ticket to add less details about the PR.
- Edit the docstring as discussed (dont mention medchem, markdown link to the notebook, date of the run, etc).
- Use Parquet instead of CSV.
- Apply
black
on the code.
want parquet not csv format
Thanks @stwhitfield for your first contribution to datamol! |
Purpose: For datamol to have a dataset of all approved drugs in ChEMBL that contains metadata columns such as ChEMBL ID, date of approval, etc.
Changelogs
Added chembl_approved_drugs to datamol/data/
Modified dm.data.chembl_drugs() to leverage it
Adapted docstring to explain how it was generated
Modified unit tests
Added notebooks folder with code that generated chembl_approved_drugs.parquet
Checklist:
feature
,fix
ortest
(or ask a maintainer to do it for you).