The DrugHunter website publishes many high quality datasets. Specifically the yearly Drug Approvals and monthly Molecules of the Month. These sets are well-curated and provide a useful source of validated molecules. Unfortunately all the molecules are presented strictly contained within pdf pages - making their extraction into computer-readable format a non-trivial taks.
This repo provides the tools that allow the extraction of these molecules from the webpage - either from a provided url, or through a workflow that extracts all Molecules of the Month molecules within a given year. All extracted information is exported into a timestamped
The script can also easily be used to extract molecules within pdfs from any provided url. See Usage.
Clone repository and install all the dependencies
git clone https://github.com/deimos1078/drughunter-molecule-extractor
cd drughunter-molecule-extractor
pip install -r requirments.txt
Install modified Molscribe
git clone https://github.com/deimos1078/MolScribe
cd MolScribe
python setup.py install
cd ..
Install modified decimer_segmentation
git clone https://github.com/deimos1078/DECIMER-Image-Segmentation
cd DECIMER-Image-Segmentation
pip install .
cd ..
Feel free to remove the DECIMER and MolScribe repos once they're installed.
If all goes well, Drug Hunter extractor should now be usable
It's highly recommended to use the version of rdkit specified within requirments Both decimer and molscribe use rdkit to generate smiles and using older versions may lead to unexpected effects in the resulting smiles generation (unwanted hybridization for example)
Invoking help:
python3 drughunter_extractor.py -h
will produce:
usage: drughunter_extractor.py [-h] [-y YEAR] [-m MONTH] [-u URL] [--seg_dir SEG_DIR] [--decimer_off] [--text] [--direction DIRECTION] [--separator SEPARATOR]
DrugHunter extractor
options:
-h, --help show this help message and exit
-y YEAR, --year YEAR (int) targeted year of drughunter molecules of the month set
-m MONTH, --month MONTH
(str) targeted month range of the molecules of the month set, input either two numbers separated by a dash or a single number (borders of the range are included)
-u URL, --url URL (str) url of webpage with targeted set (in case the format of drughunter url changes, which is likely)
--seg_dir SEG_DIR (str) directory that the segmented segments will be saved into, if unspecified, segments will not be saved
--decimer_off Turns off decimer complementation
--text Turns on text extraction
--direction DIRECTION
Specifies in which direction the text is from the molecules
--separator SEPARATOR
Specifies which separator is used in the document to separate name and target.
-u, --url Specify the url containing pdf (or pdfs) that the script will attempt to extract molecules from
python3 drughunter_extractor.py --url https://drughunter.com/resource/2022-drug-approvals/
The script will attempt to access the webpage and then proceed to list all links to pdf files on the site You can either select a specific pdf link or download all of them for the proceeding extraction
Attempting to download pdf files from https://drughunter.com/resource/2022-drug-approvals/
0: DH-2022-Small-Drug-Approvals-v3.pdf
1: DH-2022-Large-Drug-Approvals-_R1.pdf
2: EC-edits_DH-2022-First-in-Class-Small-Molecules-R1..pdf
3: DH-2022-First-in-Class-Large-Molecules.pdf
Enter the index of the file name you would like to download.
Enter 'a' to download all pdf files.
Enter 'q' to quit.
Input:
2
Selected file: EC-edits_DH-2022-First-in-Class-Small-Molecules-R1..pdf
EC-edits_DH-2022-First-in-Class-Small-Molecules-R1..pdf downloaded successfully.
Now simply wait for the script to perform segmentation, recognition and validation. The results will be exported into a csv in the results directory.
use --direction with either "down" or "right" to specify where the text you're looking to extract is relative to the molecule
use --separator if the text information contains a "NAME SEPARATOR TARGET" line to specify the separator
python3 drughunter_extractor.py --url https://drughunter.com/resource/2022-drug-approvals/ --text --direction "down" --separator ','
-y, --year Specify the year that the Molecules of the Month sets you're targeting were published in
Use the --text switch to specify whether you'd like the script to attempt to extract text information form the pdfs as well
python3 drughunter_extractor.py --year 2023 --text
Please note that at the time of writing both Decimer and MolScribe generate quite a bit of warnings. This is expected behaviour so as long as the script exports results, everything is working as intended.
-m, --month Specify the range of months that the Molecules of the Month sets you're targeting were published in
Ex. I want to download all sets published between February(2) and September(9) of 2022, but no text
python3 drughunter_extractor.py --year 2022 --month 2-9
Ex. I want to download the May 2023 set, with text
python3 drughunter_extractor.py --year 2022 --month 5 --text
Ex. I want to download the June 2023 set, without using decimer to complement the results
python3 drughunter_extractor.py --year 2022 --month 6 --decimer_off
- The webpage is accessed through requests
- BeautifulSoup is used to gather all pdf links on the webpage
- Pdfs are downloaded using those links 4a) Pdfs are segmented into individual images using decimer-image-segmentation 4b) If the molecules of the months set is targeted, segmentation is done using rectangle detection instead
- If --text option is on, the bounding boxes of these segments are used to attempt text extraction as well (using fitz from pymupdf)
- The segments are recognized using MolScribe
- Inchikeys are gathered from the recognized smiles using rdkit
- Inchikeys are searched for in the Unichem database for connectivity in order to access their validity using chembl_webresource_client
- Segments that were not validated by Unichem are recognized by Decimer-Image_Transformer. This is done because Decimer and MolScribe are good at recognizing different molecules, and because decimer is significantly slower, it is prefferable to use it on only a neccessary portion of the segments
- Validation again
- The results are filtered so that no duplicate inchikeys are present
- The results are exported into a csv file using pandas dataframe