Dialect map: text job

About

This repository contains the PDF to TXT transformation job that is run upon any new ArXiv paper. In addition, it retrieves and sends their metadata to the Dialect map private API by using one of the following sources:

The public ArXiv Kaggle dataset.
The public ArXiv export API.

Dependencies

Python dependencies are specified on the multiple files within the reqs directory.

In order to install all the development packages, as long as the defined commit hooks:

make install-dev

Formatting

All Python files are formatted using Black, and the custom properties defined in the pyproject.toml file.

make check

Testing

Project testing is performed using Pytest. In order to run the tests:

make test

CLI 🚀

The project contains a main.py module exposing a CLI with several commands:

python3 src/main.py [OPTIONS] [COMMAND] [ARGS]...

Command: `text-job`

This command starts a process that recursively traverses a file system tree of PDF files, transforming them into their TXT equivalent.

ARGUMENT	ENV VARIABLE	REQUIRED	DESCRIPTION
--input-files-path	-	Yes	Path to the list of input PDF files
--output-files-path	-	Yes	Path to store the output TXT files

Command: `metadata-job`

This command starts a process that recursively traverses a file system tree of PDF files, sending their metadata to the Dialect Map private API along the way. The process assumes that each PDF is an ArXiv paper, with their names as their IDs.

ARGUMENT	ENV VARIABLE	REQUIRED	DESCRIPTION
--input-files-path	-	Yes	Path to the list of input PDF files
--input-metadata-urls	-	Yes	URLs to the paper metadata sources
--gcp-key-path	-	Yes	GCP Service account key path
--output-api-url	-	Yes	Private API base URL

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
.github		.github
reqs		reqs
src		src
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dialect map: text job

About

Dependencies

Formatting

Testing

CLI 🚀

Command: `text-job`

Command: `metadata-job`

About

Releases

Packages

Languages

License

dialect-map/dialect-map-job-text

Folders and files

Latest commit

History

Repository files navigation

Dialect map: text job

About

Dependencies

Formatting

Testing

CLI 🚀

Command: text-job

Command: metadata-job

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Command: `text-job`

Command: `metadata-job`

Packages