Skip to content

A miniproject focused on creating a script for repeated extraction of molecules and attached text information from pdf files published on https://drughunter.com/, primarily on the sets published on https://drughunter.com/molecules-of-the-month/ and https://drughunter.com/resource_category/approved-drug-reviews/ using OCSR.

License

Notifications You must be signed in to change notification settings

hanzlika/drughunter-molecule-extractor

Repository files navigation

DrugHunter Molecule Extractor

The DrugHunter website publishes many high quality datasets. Specifically the yearly Drug Approvals and monthly Molecules of the Month. These sets are well-curated and provide a useful source of validated molecules. Unfortunately all the molecules are presented strictly contained within pdf pages - making their extraction into computer-readable format a non-trivial taks.

This repo provides the tools that allow the extraction of these molecules from the webpage - either from a provided url, or through a workflow that extracts all Molecules of the Month molecules within a given year. All extracted information is exported into a timestamped

The script can also easily be used to extract molecules within pdfs from any provided url. See Usage.

Installation

Clone repository and install all the dependencies

git clone https://github.com/deimos1078/drughunter-molecule-extractor
cd drughunter-molecule-extractor
pip install -r requirments.txt

Install modified Molscribe

git clone https://github.com/deimos1078/MolScribe
cd MolScribe
python setup.py install
cd ..

Install modified decimer_segmentation

git clone https://github.com/deimos1078/DECIMER-Image-Segmentation
cd DECIMER-Image-Segmentation
pip install .
cd ..

Feel free to remove the DECIMER and MolScribe repos once they're installed.

If all goes well, Drug Hunter extractor should now be usable

Note about rdkit version

It's highly recommended to use the version of rdkit specified within requirments Both decimer and molscribe use rdkit to generate smiles and using older versions may lead to unexpected effects in the resulting smiles generation (unwanted hybridization for example)

Usage

Invoking help:

python3 drughunter_extractor.py -h

will produce:

usage: drughunter_extractor.py [-h] [-y YEAR] [-m MONTH] [-u URL] [--seg_dir SEG_DIR] [--decimer_off] [--text] [--direction DIRECTION] [--separator SEPARATOR]

DrugHunter extractor

options:
  -h, --help            show this help message and exit
  -y YEAR, --year YEAR  (int) targeted year of drughunter molecules of the month set
  -m MONTH, --month MONTH
                        (str) targeted month range of the molecules of the month set, input either two numbers separated by a dash or a single number (borders of the range are included)
  -u URL, --url URL     (str) url of webpage with targeted set (in case the format of drughunter url changes, which is likely)
  --seg_dir SEG_DIR     (str) directory that the segmented segments will be saved into, if unspecified, segments will not be saved
  --decimer_off         Turns off decimer complementation
  --text                Turns on text extraction
  --direction DIRECTION
                        Specifies in which direction the text is from the molecules
  --separator SEPARATOR
                        Specifies which separator is used in the document to separate name and target.

Extract from url (without text)

-u, --url Specify the url containing pdf (or pdfs) that the script will attempt to extract molecules from

python3 drughunter_extractor.py --url https://drughunter.com/resource/2022-drug-approvals/

The script will attempt to access the webpage and then proceed to list all links to pdf files on the site You can either select a specific pdf link or download all of them for the proceeding extraction

Attempting to download pdf files from https://drughunter.com/resource/2022-drug-approvals/
0: DH-2022-Small-Drug-Approvals-v3.pdf
1: DH-2022-Large-Drug-Approvals-_R1.pdf
2: EC-edits_DH-2022-First-in-Class-Small-Molecules-R1..pdf
3: DH-2022-First-in-Class-Large-Molecules.pdf
Enter the index of the file name you would like to download.
Enter 'a' to download all pdf files.
Enter 'q' to quit.

Input:

2
Selected file: EC-edits_DH-2022-First-in-Class-Small-Molecules-R1..pdf
EC-edits_DH-2022-First-in-Class-Small-Molecules-R1..pdf downloaded successfully.

Now simply wait for the script to perform segmentation, recognition and validation. The results will be exported into a csv in the results directory.

Extract from url (with text) (recommended only for extracting text out of DrugHunter)

use --direction with either "down" or "right" to specify where the text you're looking to extract is relative to the molecule

use --separator if the text information contains a "NAME SEPARATOR TARGET" line to specify the separator

python3 drughunter_extractor.py --url https://drughunter.com/resource/2022-drug-approvals/ --text --direction "down" --separator ','

Extract all Molecules of the Month within a given year

-y, --year Specify the year that the Molecules of the Month sets you're targeting were published in

Use the --text switch to specify whether you'd like the script to attempt to extract text information form the pdfs as well

python3 drughunter_extractor.py --year 2023 --text

Please note that at the time of writing both Decimer and MolScribe generate quite a bit of warnings. This is expected behaviour so as long as the script exports results, everything is working as intended.

Extract Molcules of the Month within a specified month range

-m, --month Specify the range of months that the Molecules of the Month sets you're targeting were published in

Ex. I want to download all sets published between February(2) and September(9) of 2022, but no text

python3 drughunter_extractor.py --year 2022 --month 2-9

Ex. I want to download the May 2023 set, with text

python3 drughunter_extractor.py --year 2022 --month 5 --text

Ex. I want to download the June 2023 set, without using decimer to complement the results

python3 drughunter_extractor.py --year 2022 --month 6 --decimer_off

Workflow and used libraries

  1. The webpage is accessed through requests
  2. BeautifulSoup is used to gather all pdf links on the webpage
  3. Pdfs are downloaded using those links 4a) Pdfs are segmented into individual images using decimer-image-segmentation 4b) If the molecules of the months set is targeted, segmentation is done using rectangle detection instead
  4. If --text option is on, the bounding boxes of these segments are used to attempt text extraction as well (using fitz from pymupdf)
  5. The segments are recognized using MolScribe
  6. Inchikeys are gathered from the recognized smiles using rdkit
  7. Inchikeys are searched for in the Unichem database for connectivity in order to access their validity using chembl_webresource_client
  8. Segments that were not validated by Unichem are recognized by Decimer-Image_Transformer. This is done because Decimer and MolScribe are good at recognizing different molecules, and because decimer is significantly slower, it is prefferable to use it on only a neccessary portion of the segments
  9. Validation again
  10. The results are filtered so that no duplicate inchikeys are present
  11. The results are exported into a csv file using pandas dataframe

About

A miniproject focused on creating a script for repeated extraction of molecules and attached text information from pdf files published on https://drughunter.com/, primarily on the sets published on https://drughunter.com/molecules-of-the-month/ and https://drughunter.com/resource_category/approved-drug-reviews/ using OCSR.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages