Intelligent Mission and Scientific Instrument Classification. Applying unique NLP approaches to improve information extraction through scientific papers/Foundry A-Team Studies.
Available datasets can be found in the data
directory. The microwave_limb_sounder
dataset contains a data dump of data from an Elasticsearch index, which contains documents with their parsed text (PDFMiner was used to extract text from the PDF documents). The dataset also contains some, but not all, source PDFs. There are 1109
JSON documents but only 604
PDFs. The PDFs could be used with an altetrnative means of text extraction if desired and new machine-readable data generated for use in modeling.
To generate training, validation, and testing sets, run parser.py
with default inputs. This will generate the three files training_set.jsonl
, validation_set.jsonl
, and testing_set.jsonl
in the
data/microwave_limb_sounder
directory.
Prodigy will allow you to annotate your datasets. Please note that my Prodigy wheel installation path is specific to my laptop in requirements.txt
at this time.
- To start Prodigy, run
prodigy ner.manual name_of_dataset name_of_model ./path/to/dataset.jsonl --label INSTRUMENT,SPACECRAFT
in theIMaSC
directory.- In my case I ran
prodigy ner.manual train_imasc en_core_web_sm ./data/microwave_limb_sounder/training_set.jsonl --label INSTRUMENT,SPACECRAFT
.
- In my case I ran
- From there, open a browser and enter http://localhost:8080/ in
the search bar. Prodigy should be running on port 8080 by default.
- To annotate text, click on the annotation you wish to apply and highlight the text you wish to annotate.
Prodigy will automatically apply the annotation.
- To remove an annotation, hover over the top left corner of an existing annotation
and click the "X".
- Once you have finished annotating a piece of text (you may not need to annotate anything),
click the green check mark. If a piece of text is not appropriate for annotation, click the grey no symbol to skip it.
-
You can also click the grey return to undo an annotation.
-
To export annotations as a JSONL, run
prodigy db-out train_imasc > ./annotations.jsonl
.
Currently, IMaSC supports labeling of scientific instruments (i.e. MLS
) and the spacecraft (i.e. Aura satellite
) that carry them.
Using the directions above, label all instances of scientific instruments and spacecraft in the text.
While reading through data, look out for acronyms or things that have “Satellite” or something similar for spacecraft, or words that end in “meter” or similar for instruments.
When you identify a potential token, reference the annotations.md file and see if it’s already in there. If the term is in the document as a spacecraft or instrument, annotate it accordingly. If it’s in the document as something else (“model” or “other”), ignore it.
If you think you’ve found a token but it isn’t in annotations.md, try to classify it on your own:
- Use context. Sentences like “aboard the XXXX” or “on the YYYY” might indicate spacecraft, while “measurements taken by ZZZZ” might indicate instruments
- Often times, terms expressed in the form “ XXX / YYY” like “UARS / MLS” are of the form “ SPACECRAFT / INSTRUMENT “
- Google the term along with with “NASA”, like “MLS NASA” and read into several articles to see what entity the term might be associated with
- If you’re really unsure, it’s best to note it somewhere and see if it comes up again; compare contexts it appears in
- If you can find information but the information is ambiguous, like you can’t tell if something is an experiment or an instrument (ex. HALOE), just pick a label and stay consistent with it
Train the model with the following command: prodigy ner.batch-train train_imasc en_core_web_sm -n 100
. To train a model with only one entity type run prodigy ner.batch-train train_imasc en_core_web_sm -n 100 -l ENTITY
.
A flowchart for how to train your specific model can be found here. About 4000 annotations are needed to train the model.
Included in this repository is a basic API that runs the model on user input data and displays a list of tokens the model found along with coverage data. To use the API, run “python api.py” from the api_stuff directory in the terminal. Your command line should spit out a link you can then use to access the API in your default browser.
In your browser, either enter text or drop a PDF (a recommended PDF is provided in this repository) and click “submit.”
Semantic versioning is used for this project. If you are contributing to this project, please use semantic versioning guidelines when submitting your pull request.
Please use the issue tracker to report any unexpected behavior or desired features.
If you would like to contribute to development:
- Fork the repository.
- Create your changes in a branch corresponding to an open issue.
- Bug fixes should be made in branches with the prefix
fix/
. - New capabilities or code improvements should be made in branches with the prefix
feature/
.
- Bug fixes should be made in branches with the prefix
- Make a pull request into the repository's
dev
branch.- Pull requests should have the prefix [WIP] if they are works in progress.
- Pull requests should have the prefix [MRG] if they are ready to merge.
- Upon successful completion of the pull request, it will be merged into
master
.
When contributing, please run all existing unit tests. Add new tests as needed
when adding new functionality. To run unit tests, use pytest
:
python3 -m pytest --cov=IMaSC
This project is licensed under the Apache 2.0 license.