arxiv-ml-reviews mainly uses a keyword-based search to extract a list of review articles from arXiv's various categories on machine learning and artificial intelligence.
In a new Python 3.9 virtual environment or container, run make install
from the project's directory.
Running python -m arxivmlrev refresh
will rerun the full online search and write the results to data/articles.csv
and data/articles.md
.
Use git to discern whether the diff of this updated CSV file looks acceptable. If the CSV file is smaller for any reason, it means the search query failed, in which case it should be rerun. This command should not be run excessively as it burdens the arXiv search server.
If there is any extraneous new entry in data/articles.csv
, update either arxivmlrev/_config/articles.csv
and/or
arxivmlrev/_config/terms.csv
with a new blacklist entry. This is expected to be be done rarely.
Blacklisted entries are those with Presence = 0.
Before committing these updated configuration files to revision control, consider running
scripts/sort_config_articles.py
and/or scripts/sort_config_terms.py
respectively.
If a configuration file was updated, rerun the command.
Note that a sufficiently longer query can very possibly lead to arXiv returning incomplete results, and this
will require a rearchitecture of the search.
Running python -m arxivmlrev refresh-and-publish
will refresh and also conditionally publish the results.
Specifically, if the data/results.csv
file changed but didn't decrease in its number of rows, the command will publish
the written markdown file to GitHub per the GitHub-specific configuration in config.py
.
In this configuration file, refer to parameters starting with the prefix GITHUB_
.
The environment variable GITHUB_ACCESS_TOKEN
is also required.
Running python -m arxivmlrev write-feed
will perform an online search to write the XML file data/feed.xml
.
This file is excluded from git.
Running python -m arxivmlrev write-md
will perform an offline refresh of the markdown file data/articles.md
from
data/articles.csv
.
Running python -m arxivmlrev publish-md
will publish the markdown file data/articles.md
to GitHub.
This requires GitHub-specific configuration in config.py
.
In this configuration file, refer to parameters starting with the prefix GITHUB_
.
The environment variable GITHUB_ACCESS_TOKEN
is also required.
Serverless deployment of the RSS feed to Google Cloud Functions is configured. It requires the following files:
- requirements.txt
- main.py (having callable
serve(request: flask.Request) -> Tuple[bytes, int, Dict[str, str]]
)
Deployment version updates are not automated. They can be performed manually by editing and saving the function configuration.
These deployment links require access:
- By default, run an incremental update, and provide an option to do a full rerun. An incremental update assumes an unchanged configuration. This requires query results to be sorted by lastUpdatedDate.