GitHunter
is a tiny yet powerful crawler infra to collect OSS projects on GitHub. It queries GitHub search API and persist the data into the Postgres database.
Check here to know what the collected data is.
- Docker
- Golang
- PostgreSQL
To run a dockerized PostgreSQL, check this.
Start a postgres container, following the example command below:
$ docker run \
--name postgres -d \
--restart unless-stopped \
-e POSTGRES_USER=ZJU-SEC \
-e POSTGRES_PASSWORD=<YOUR DB PASSWORD> \
-e POSTGRES_DB=GitHunter \
-p 5432:5432 postgres
Prepare yourself a config.ini
configuration according to config.ini.tmpl
. Following is the configuration specification:
Name | Type | In | Description |
---|---|---|---|
WORKER | integer | APP | Maximum number of parallel workers |
QUEUE_SIZE | integer | APP | Maximum number of parallel queue |
LANGUAGE | string | APP | Targeted programming language |
MIN_STAR | integer | APP | Minimum number of stars a repo gains |
GITHUB_TOKEN | string | WEB | GitHub token to unlock API rate limit |
TRYOUT | integer | WEB | Maximum number of retrying to request a page |
HOST | string | DB | Database host address |
USER | string | DB | Database user name |
PASSWORD | string | DB | Database user password |
DBNAME | string | DB | Database name |
PORT | integer | DB | Database port |
$ go build GitHunter
To crawl the repositories' metadata:
$ ./GitHunter crawl
To clone the repositories:
$ ./GitHunter clone