DrugHunter Molecule Extractor

The DrugHunter website publishes many high quality datasets. Specifically the yearly Drug Approvals and monthly Molecules of the Month. These sets are well-curated and provide a useful source of validated molecules. Unfortunately all the molecules are presented strictly contained within pdf pages - making their extraction into computer-readable format a non-trivial taks.

This repo provides the tools that allow the extraction of these molecules from the webpage - either from a provided url, or through a workflow that extracts all Molecules of the Month molecules within a given year. All extracted information is exported into a timestamped

The script can also easily be used to extract molecules within pdfs from any provided url. See Usage.

Installation

Clone repository and install all the dependencies

git clone https://github.com/deimos1078/drughunter-molecule-extractor
cd drughunter-molecule-extractor
pip install -r requirments.txt

Install modified Molscribe

git clone https://github.com/deimos1078/MolScribe
cd MolScribe
python setup.py install
cd ..

Install modified decimer_segmentation

git clone https://github.com/deimos1078/DECIMER-Image-Segmentation
cd DECIMER-Image-Segmentation
pip install .
cd ..

Feel free to remove the DECIMER and MolScribe repos once they're installed.

If all goes well, Drug Hunter extractor should now be usable

Note about rdkit version

It's highly recommended to use the version of rdkit specified within requirments Both decimer and molscribe use rdkit to generate smiles and using older versions may lead to unexpected effects in the resulting smiles generation (unwanted hybridization for example)

Usage

Invoking help:

python3 drughunter_extractor.py -h

will produce:

usage: drughunter_extractor.py [-h] [-y YEAR] [-m MONTH] [-u URL] [--seg_dir SEG_DIR] [--decimer_off] [--text] [--direction DIRECTION] [--separator SEPARATOR]

DrugHunter extractor

options:
  -h, --help            show this help message and exit
  -y YEAR, --year YEAR  (int) targeted year of drughunter molecules of the month set
  -m MONTH, --month MONTH
                        (str) targeted month range of the molecules of the month set, input either two numbers separated by a dash or a single number (borders of the range are included)
  -u URL, --url URL     (str) url of webpage with targeted set (in case the format of drughunter url changes, which is likely)
  --seg_dir SEG_DIR     (str) directory that the segmented segments will be saved into, if unspecified, segments will not be saved
  --decimer_off         Turns off decimer complementation
  --text                Turns on text extraction
  --direction DIRECTION
                        Specifies in which direction the text is from the molecules
  --separator SEPARATOR
                        Specifies which separator is used in the document to separate name and target.

Extract from url (without text)

-u, --url Specify the url containing pdf (or pdfs) that the script will attempt to extract molecules from

python3 drughunter_extractor.py --url https://drughunter.com/resource/2022-drug-approvals/

The script will attempt to access the webpage and then proceed to list all links to pdf files on the site You can either select a specific pdf link or download all of them for the proceeding extraction

Attempting to download pdf files from https://drughunter.com/resource/2022-drug-approvals/
0: DH-2022-Small-Drug-Approvals-v3.pdf
1: DH-2022-Large-Drug-Approvals-_R1.pdf
2: EC-edits_DH-2022-First-in-Class-Small-Molecules-R1..pdf
3: DH-2022-First-in-Class-Large-Molecules.pdf
Enter the index of the file name you would like to download.
Enter 'a' to download all pdf files.
Enter 'q' to quit.

Input:

Selected file: EC-edits_DH-2022-First-in-Class-Small-Molecules-R1..pdf
EC-edits_DH-2022-First-in-Class-Small-Molecules-R1..pdf downloaded successfully.

Now simply wait for the script to perform segmentation, recognition and validation. The results will be exported into a csv in the results directory.

Extract from url (with text) (recommended only for extracting text out of DrugHunter)

use --direction with either "down" or "right" to specify where the text you're looking to extract is relative to the molecule

use --separator if the text information contains a "NAME SEPARATOR TARGET" line to specify the separator

python3 drughunter_extractor.py --url https://drughunter.com/resource/2022-drug-approvals/ --text --direction "down" --separator ','

Extract all Molecules of the Month within a given year

-y, --year Specify the year that the Molecules of the Month sets you're targeting were published in

Use the --text switch to specify whether you'd like the script to attempt to extract text information form the pdfs as well

python3 drughunter_extractor.py --year 2023 --text

Please note that at the time of writing both Decimer and MolScribe generate quite a bit of warnings. This is expected behaviour so as long as the script exports results, everything is working as intended.

Extract Molcules of the Month within a specified month range

-m, --month Specify the range of months that the Molecules of the Month sets you're targeting were published in

Ex. I want to download all sets published between February(2) and September(9) of 2022, but no text

python3 drughunter_extractor.py --year 2022 --month 2-9

Ex. I want to download the May 2023 set, with text

python3 drughunter_extractor.py --year 2022 --month 5 --text

Ex. I want to download the June 2023 set, without using decimer to complement the results

python3 drughunter_extractor.py --year 2022 --month 6 --decimer_off

Workflow and used libraries

The webpage is accessed through requests
BeautifulSoup is used to gather all pdf links on the webpage
Pdfs are downloaded using those links 4a) Pdfs are segmented into individual images using decimer-image-segmentation 4b) If the molecules of the months set is targeted, segmentation is done using rectangle detection instead
If --text option is on, the bounding boxes of these segments are used to attempt text extraction as well (using fitz from pymupdf)
The segments are recognized using MolScribe
Inchikeys are gathered from the recognized smiles using rdkit
Inchikeys are searched for in the Unichem database for connectivity in order to access their validity using chembl_webresource_client
Segments that were not validated by Unichem are recognized by Decimer-Image_Transformer. This is done because Decimer and MolScribe are good at recognizing different molecules, and because decimer is significantly slower, it is prefferable to use it on only a neccessary portion of the segments
Validation again
The results are filtered so that no duplicate inchikeys are present
The results are exported into a csv file using pandas dataframe

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
export		export
pdf_extraction		pdf_extraction
recognition		recognition
results		results
segmentation		segmentation
validation		validation
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
drughunter_extractor.py		drughunter_extractor.py
drughunter_flow.drawio		drughunter_flow.drawio
drughunter_flow.drawio.png		drughunter_flow.drawio.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DrugHunter Molecule Extractor

Installation

Note about rdkit version

Usage

Extract from url (without text)

Extract from url (with text) (recommended only for extracting text out of DrugHunter)

Extract all Molecules of the Month within a given year

Extract Molcules of the Month within a specified month range

Workflow and used libraries

About

Releases

Packages

Languages

License

hanzlika/drughunter-molecule-extractor

Folders and files

Latest commit

History

Repository files navigation

DrugHunter Molecule Extractor

Installation

Note about rdkit version

Usage

Extract from url (without text)

Extract from url (with text) (recommended only for extracting text out of DrugHunter)

Extract all Molecules of the Month within a given year

Extract Molcules of the Month within a specified month range

Workflow and used libraries

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages