Skip to content
This repository has been archived by the owner on Sep 3, 2024. It is now read-only.

PDF to TXT file transformation job

License

Notifications You must be signed in to change notification settings

dialect-map/dialect-map-job-text

Repository files navigation

Dialect map: text job

CI/CD Status Coverage Status MIT license Code style

About

This repository contains the PDF to TXT transformation job that is run upon any new ArXiv paper. In addition, it retrieves and sends their metadata to the Dialect map private API by using one of the following sources:

Dependencies

Python dependencies are specified on the multiple files within the reqs directory.

In order to install all the development packages, as long as the defined commit hooks:

make install-dev

Formatting

All Python files are formatted using Black, and the custom properties defined in the pyproject.toml file.

make check

Testing

Project testing is performed using Pytest. In order to run the tests:

make test

CLI 🚀

The project contains a main.py module exposing a CLI with several commands:

python3 src/main.py [OPTIONS] [COMMAND] [ARGS]...

Command: text-job

This command starts a process that recursively traverses a file system tree of PDF files, transforming them into their TXT equivalent.

ARGUMENT ENV VARIABLE REQUIRED DESCRIPTION
--input-files-path - Yes Path to the list of input PDF files
--output-files-path - Yes Path to store the output TXT files

Command: metadata-job

This command starts a process that recursively traverses a file system tree of PDF files, sending their metadata to the Dialect Map private API along the way. The process assumes that each PDF is an ArXiv paper, with their names as their IDs.

ARGUMENT ENV VARIABLE REQUIRED DESCRIPTION
--input-files-path - Yes Path to the list of input PDF files
--input-metadata-urls - Yes URLs to the paper metadata sources
--gcp-key-path - Yes GCP Service account key path
--output-api-url - Yes Private API base URL

About

PDF to TXT file transformation job

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published