Paper (accepted to MSR'18). Presentation.
This dataset consists of two parts:
- Siva files with Git repositories.
- Index file in CSV format.
- pga - explore the dataset, or download its contents easily.
- pga-create - reproduce PGA dataset generation.
- borges-indexer - exports a CSV file with metadata from repositories fetched with Borges.
To see the full list of repositories in the dataset or download it, you will need to install
pga.
Simply install Go and then run go get github.com/src-d/datasets/PublicGitArchive/pga
.
Then to list all of the repositories in the dataset, simply run:
pga list
If you'd rather get a detailed dump of the dataset (not including the file contents)
you can choose either pga list -f json
or pga list -f csv
.
To download the full dataset, execute:
pga get
Or if you want to download only those repositories containing at least a line of Java code:
pga get -l java
The pga
command has -j/--workers
argument which specifies the number of downloading threads to run, it defaults to 10.
For more information, check the pga documentation, or simply run pga -h
.
Refer to pga-create documentation for more details about how PGA is generated.
We understand that some GitHub projects may become private or deleted with time. Previous dataset snapshots will continue to include such dead code. If you are the author and want to remove your project from all present and future public snapshots, please send a request to datasets@sourced.tech
.