This repository contains the PDF to TXT transformation job that is run upon any new ArXiv paper. In addition, it retrieves and sends their metadata to the Dialect map private API by using one of the following sources:
- The public ArXiv Kaggle dataset.
- The public ArXiv export API.
Python dependencies are specified on the multiple files within the reqs
directory.
In order to install all the development packages, as long as the defined commit hooks:
make install-dev
All Python files are formatted using Black, and the custom properties defined
in the pyproject.toml
file.
make check
Project testing is performed using Pytest. In order to run the tests:
make test
The project contains a main.py module exposing a CLI with several commands:
python3 src/main.py [OPTIONS] [COMMAND] [ARGS]...
This command starts a process that recursively traverses a file system tree of PDF files, transforming them into their TXT equivalent.
ARGUMENT | ENV VARIABLE | REQUIRED | DESCRIPTION |
---|---|---|---|
--input-files-path | - | Yes | Path to the list of input PDF files |
--output-files-path | - | Yes | Path to store the output TXT files |
This command starts a process that recursively traverses a file system tree of PDF files, sending their metadata to the Dialect Map private API along the way. The process assumes that each PDF is an ArXiv paper, with their names as their IDs.
ARGUMENT | ENV VARIABLE | REQUIRED | DESCRIPTION |
---|---|---|---|
--input-files-path | - | Yes | Path to the list of input PDF files |
--input-metadata-urls | - | Yes | URLs to the paper metadata sources |
--gcp-key-path | - | Yes | GCP Service account key path |
--output-api-url | - | Yes | Private API base URL |