This downloads GHTorrent data - specifically, commit messages and pull request comments. Instructions for running:
- Clone repo
- Build
main.simg
or pull from singularity hub. - If local, run (where NUMBER HERE is from 1 to 200)
singularity run main.simg -p initialise
singularity run main.simg -p singghtorrent/analysis/main.py -a <NUMBER HERE>
- If on phoenix, run
sbatch hpc/download_job_array.sh
- Raw data downloaded in
storage/external/ghtorrent
. Deleted when finished processing. - Interrim data saved in
storage/interim/ghtorrent
. Deleted when finished processing. - Final files stored in
storage/processed/
. Saved by day.