Skip to content

sotorrent/pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SOTorrent Pipeline

The SOTorrent pipeline currently only reads XML files from the official Stack Overflow data dump into Google BigQuery tables and JSONL files (for export). This pipeline will be extended to create the whole SOTorrent dataset.

To run the pipeline in Google Cloud, you need to set the following environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="$PWD/google-cloud-key.json"

First, you need to install the sotorrent_pipeline package:

python3 setup.py install

Then, you can upload the XML files to a Google Cloud storage bucket:

sotorrent-pipeline --mode upload --config_file "/Users/sebastian/git/sotorrent/pipeline/config.json"

Then, you can run the pipeline:

sotorrent-pipeline --mode pipeline --config_file "/Users/sebastian/git/sotorrent/pipeline/config.json"

To inspect a failed BigQuery job, you can run:

sotorrent-pipeline --mode debug --config_file "/Users/sebastian/git/sotorrent/pipeline/config.json" --job_id <JOB_ID>

To upload the JSONL files to a Zenodo bucket, you need to set the following environment variable:

export ZENODO_TOKEN="..."

About

SOTorrent pipeline running on Google Cloud

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published