Github Repository Classification

This repo contains our github repository classification project for the informatiCup2017 challenge. You can find the PDF report here.

Also have alook at the notebooks:

Requirements

Python3
Git must be installed and in system path

Getting Started

The entry point of the program is main.py. Running python3 main.py starts the test mode. This mode trains and validates different models. When given a file as parameter (e.g. python3 main.py data/valset_unclassified.txt), the program classifies all repositories in that file and print the results to stdout. Saving the trained model with pickle or joblib resulted a strange loss in accuracy when loading it again. That is why we fall to the solution of training the model at the start of the prediction phase.

In order to speed up the training, we provide a csv (data/enriched_data.csv) with all feature data for the training dataset precalculated. Let the dataimporter import this csv file to avoid downloading 20GB of repositories.

Categories

Label	short description
DEV	a repository primarily used for development of a tool, component, application, app, or API
HW repo	a repository primarily used for homework, assignments and other course-related work and code
EDU	a repository primarily used to host tutorials, lectures, educational information and code related to teaching
DOCS	a repository primarily used for tracking and storage of non-educational documents
WEB	a repository primarily used to host static personal websites or blogs
DATA	a repository primarily used to store data sets
OTHER	use this category only if there is no strong correlation to any other repository category, for example, empty repositories

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
customClassifier		customClassifier
data		data
docs		docs
importer		importer
metrics		metrics
.gitignore		.gitignore
Classification_Models.ipynb		Classification_Models.ipynb
Data_Visualization.ipynb		Data_Visualization.ipynb
README.md		README.md
classify_readme.py		classify_readme.py
main.py		main.py
requirements.txt		requirements.txt
tagger.py		tagger.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Github Repository Classification

Requirements

Getting Started

Categories

About

Releases 1

Packages

Contributors 2

Languages

mbornstein/GithubRepoClassification

Folders and files

Latest commit

History

Repository files navigation

Github Repository Classification

Requirements

Getting Started

Categories

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages