SOTorrent Pipeline

The SOTorrent pipeline currently only reads XML files from the official Stack Overflow data dump into Google BigQuery tables and JSONL files (for export). This pipeline will be extended to create the whole SOTorrent dataset.

To run the pipeline in Google Cloud, you need to set the following environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="$PWD/google-cloud-key.json"

First, you need to install the sotorrent_pipeline package:

python3 setup.py install

Then, you can upload the XML files to a Google Cloud storage bucket:

sotorrent-pipeline --mode upload --config_file "/Users/sebastian/git/sotorrent/pipeline/config.json"

Then, you can run the pipeline:

sotorrent-pipeline --mode pipeline --config_file "/Users/sebastian/git/sotorrent/pipeline/config.json"

To inspect a failed BigQuery job, you can run:

sotorrent-pipeline --mode debug --config_file "/Users/sebastian/git/sotorrent/pipeline/config.json" --job_id <JOB_ID>

To upload the JSONL files to a Zenodo bucket, you need to set the following environment variable:

export ZENODO_TOKEN="..."

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
scripts		scripts
sotorrent_pipeline		sotorrent_pipeline
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.json		config.json
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SOTorrent Pipeline

About

Releases

Packages

Languages

License

sotorrent/pipeline

Folders and files

Latest commit

History

Repository files navigation

SOTorrent Pipeline

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages