Machine learning model that can predict the popularity of GitHub repository just by giving your repo URL in the input. Here, popularity means the number of stars ✨ it can get in the future. So, for data we use scripts to scrap data from github.
Folder Notebooks
contains data and script to extract data, analysis of data or the model creation code. We have used github api and Kaggle to collect the github data stored in the file github_api.csv
and kaggle_data.csv
respectively which has columns repo_name
, star
, fork
, watch
, issue
, tags
, most_used_lang
, discription
, contributors
, license
, and repo_url
.
data_extraction.ipynb
file contains script to extract the information from repositories, analysis.ipynb
file contains cleaning and visualization operations on the dataset. model.ipynb
building a machine learning model that can predict which repositories will gain how much stars
in the future. 😃
- Create an virual environment:
python -m venv "evironment_name"
For more details follow this link.
-
Activate the Environment:
-
For Windows:
."evironment_name"\Scripts\activate
-
For Mac or Linux:
source "evironment_name"/bin/activate
-
-
Install the required dependencies:
pip install -r requirement.txt
- Clone the repository:
git clone https://github.com/pcsingh/Predicting-Repo-popularity.git
- Enter into the directory:
cd Predicting-Repo-popularity
- To extract the github repo data using github api run data_extraction.ipynb notebook.
Github has limits on the number of requests using github api, so you need to use your github token in order to extract data. To generate your github token go to https://github.com/settings/tokens.
GitHub api requires headers for authorization.
header={'Accept':'application/vnd.github.mercy-preview+json',
'visibility':'PUBLIC',
"Authorization": "token PASTE_YOUR_GITHUB_TOKEN_HERE"
}
Replace the PASTE_YOUR_GITHUB_TOKEN_HERE
with your github token.
-
To visualize some insight of the dataset run analysis.ipynb
-
For training the model run model.ipynb file, we have used multiple regressions model, but one with the best R2 score is used for making prediction.
- Run streamlit in order to make prediction using trained model:
streamlit run app.py
Note: Remember to paste the github token in the model.ipynb notebook and app.py file.
Click here to try now..... 🤗