This is an automation designed to track the publications of top economic journals using IDEAS RePEc database.
This program uses the BeautifulSoup and Requests modules to scrape the RePEc website for the top journals and downloads the metadata for their most recent releases. It then stores this data into a .json file that can be used for other automations. The program is designed to be run on a monthly basis to ensure that the data is up to date. Up to the current date, Ideas update its database on the 2nd day of every month.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
Start by cloning this repository to your local machine:
git clone https://github.com/joseparreiras/retrack
cd retrack
To run the program, you first need to make sure your system satisfies the module requirements. This can be done using the following command:
pip install -r requirements.txt
The modules that are not pre-installed will be installed automatically.
The documentation for the main program can be accessed by running the help command on the terminal:
python get_articles.py -h
Which will generate the following output:
usage: get_articles.py [-h] [--input INPUT] [--list] [--range] [--output OUTPUT] [--n_months N_MONTHS] [--n_volumes N_VOLUMES] rankings [rankings ...]
positional arguments:
rankings journal rankings
optional arguments:
-h, --help show this help message and exit
--input INPUT, -i INPUT
path to excel input file
--list, -l get a list of journals
--range, -r get a range of journals
--output OUTPUT, -o OUTPUT
path to output file
--n_months N_MONTHS, -m N_MONTHS
number of months to get
--n_volumes N_VOLUMES, -v N_VOLUMES
number of volumes to get
The file journals.xlsx on the data folder contains the list of the top 500 journals according to the RePEc ranking. This ranking is used to select the journals that will be downloaded. When the program is run, it will automatically get the top 500 journals and store them in the articles.json file on the data folder. The program can be run using the following command on the terminal:
python get_articles.py data/journals.xlsx
The selection of journals is made by passing the rankings argument to the command above. There are three options for selecting journals:
- Selecting a range of journals by their RePEc rank:
Passing 2 arguments along with the option --range
or -r
will select the journals from the first to the second argument. For example, running the following command:
python get_articles.py start_rank end_rank -r
Passing 1 argument along with the option --range
or -r
will select the journals from the first to the end_rank
. For example, running the following command:
python get_articles.py end_rank -r
- Selecting a list of journals by their Repec rank:
Passing a list of arguments along with the option --list
or -l
will select the journals with the specified ranks. This list must be separated by spaces and the list keyword (which necessarily comes at last) is used to indicate that the ranks are to be interpreted as a list. For example, running the following command:
python get_articles.py rank1 rank2 rank3 ... -l
The -list
option cannot be used together with the -range
option and is taken as the default option if no option is specified. Therefore the above command is equivalent to running:
python get_articles.py rank1 rank2 rank3 ...
The program also takes the following optional arguments:
--input
or-i
: This argument is used to specify the path to the source excel file. The default value isdata/journals.xlsx
.--output
or-o
: This argument is used to specify the path to the output JSON file. The default value isdata/articles.json
.--n_months
or-m
: This argument is used to specify the number of months to get. The default value is 1. Setting it to -1 will get all the articles.--n_volumes
or-v
: This argument is used to specify the number of volumes to get. The default value is 3. Setting it to -1 will get all the volumes.
That can be used in any combination. For example, to get the articles from the last 12 months considering the last 6 volumes of each journal and store them in the "data/foo.json" file, run:
python get_articles.py -o data/foo.json -m 12 -v 6
The default input file is journals.xlsx which contains the top 500 journals according to the RePEc ranking. This file is obtained by running the top_journals.py program. This program can be used to get the top N journals. This can be done by running the following command:
python top_journals.py N
I used this program to automatically get the latest versions of my desired top journals and add them to my task manager Things. This is done using Things` new Apple Shortcuts feature which I used to create this shortcut. This tutorial is replicable in macOS only. To replicate it, first you need to create an automation to run this program every month. To do this, open the Automator app and create a new service. Then, add a Run Shell Script action and paste the following code:
cd /path_to_repo/retrack
python get_articles.py other_arguments
shortcuts run "ReTrack" -i data/articles.json
Save this into your Automator iCloud folder. Then, open the Calendar app and create a new event and schedule it to repeat as you like. Finally, click Alert > Custom, select Open File, Other and find the Automator file you just created. This will run the program every time the event is triggered.
If you don't use Things, there is a version of this shortcut that exports that into a Markdown file. It can be found here. The markdown version can also be created from the markdown_export.py file. To do this, change the Automator file to:
cd /path_to_repo/retrack
python get_articles.py other_arguments
python markdown_export.py -i data/articles.json -o out/output_file_name.md
For more information on how to use the markdown export, run:
python markdown_export.py -h
- Beautiful Soup - Web Scraping
- IDEAS RePEc - Database
- Apple Shortcuts - Automation
- Apple Automator - Automation
- Things - Task Manager
- @joseparreiras - Idea & Initial work