Skip to content

Latest commit

 

History

History
68 lines (48 loc) · 2.21 KB

README.md

File metadata and controls

68 lines (48 loc) · 2.21 KB

GitHunter

Build

GitHunter is a tiny yet powerful crawler infra to collect OSS projects on GitHub. It queries GitHub search API and persist the data into the Postgres database.

Check here to know what the collected data is.

⚙️ Prerequisite

  • Docker
  • Golang
  • PostgreSQL

💡 Dockerized PostgreSQL

To run a dockerized PostgreSQL, check this.

Start a postgres container, following the example command below:

$ docker run \
  --name postgres -d \
  --restart unless-stopped \
  -e POSTGRES_USER=ZJU-SEC \
  -e POSTGRES_PASSWORD=<YOUR DB PASSWORD> \
  -e POSTGRES_DB=GitHunter \
  -p 5432:5432 postgres

📄 Make the Configurations

Prepare yourself a config.ini configuration according to config.ini.tmpl. Following is the configuration specification:

Name Type In Description
WORKER integer APP Maximum number of parallel workers
QUEUE_SIZE integer APP Maximum number of parallel queue
LANGUAGE string APP Targeted programming language
MIN_STAR integer APP Minimum number of stars a repo gains
GITHUB_TOKEN string WEB GitHub token to unlock API rate limit
TRYOUT integer WEB Maximum number of retrying to request a page
HOST string DB Database host address
USER string DB Database user name
PASSWORD string DB Database user password
DBNAME string DB Database name
PORT integer DB Database port

🛠️ Build

$ go build GitHunter

🚀 Run

To crawl the repositories' metadata:

$ ./GitHunter crawl

To clone the repositories:

$ ./GitHunter clone