Code and Data for arXiv paper: PolyIE: A Dataset of Information Extraction from Polymer Material Scientific Literature.
This repo is built with python 3.8. Run the following commands to install dependencies:
pip install -r ./requirements.txt
pip install git+https://github.com/titipata/scipdf_parser
python -m spacy download en_core_web_sm
pip install PyMuPDF
pip install decimer-segmentation
pip install tensorflow
pip install BeautifulSoup4
cde data download
The Annotation/Data/Ner folder contains manually annotated articles on the Polymer Solar Cells dataset and the Lithium Batteries dataset. The text in these articles are parsed from PDFs, pre-annotated with noisy labels, and manually annotated in Doccano.
The annotated labels contains Compound Name (CN), Property Names (PN), Property Values (PV), and Conditions (Conditions). To view the annotation in Doccano, create a new dataset and upload the jsonl file to Doccano.
The Annotation/Data/Relation folder contains manually annotated articles with relations. This is completed based on the annotated Ner files. The relation annotation connects related entities to form <CN, PN, PV, Condition> an n-ary tuple.
Running the extraction file on PDFs:
To run the parse pipeline, you will need to specify the input directory, output directory, and a mentrion list. Example command is provided below
Run the GROBID using the given bash script before parsing PDF:
bash serve_grobid.sh
Then run the following command to start the parse pipeline:
python main_pipeline.py --pdf_folder ./Data/PDFs --output_folder ./Data/Output --mention_dict power conversion efficiencies
--pdf_folder specifies the folder that contains PDF files
--output_folder specifices the folder to output the parsed results
--mention_dict specifies the keywords that are used to match property names
The parse pipeline will parse text from the PDF files, extract chemical name mentions, property name mentions, and property value mentions. In addition, it will also extract all molecular images from the PDF files.
Install the following dependencies in order to run the baselines:
pip install transformers
pip install torch
pip install numpy
To run Bert based NER baselines, use the following command
python ./Baselines/Bert_NER/main.py
To run dygiepp, PURE, and drug-combo-extract baselines, navigate to the corresponding folder and follow the instructions in the readme files.
To run GPT related baselines, install the following dependencies:
pip install openai
To run GPT based NER baselines, use the following command:
python ./GPT/baseline_gpt_ner.py
To run GPT based RE baselines, use the following command:
python ./GPT/baseline_gpt_re.py
The table above shows some NER baselines on our manually curated dataset.