Github Users Screening

Hey, this is a screening project to search/screen top N number of github profiles based on natural language query. At the current state of AI there are multiple ways of solving such problem, by using pretrained embeddings, training a TF_IDF model, training a BERT embeddings model, training a T5 embeddings model, or many more.

In this repo, we will be looking in to the first two approaches mentioned and will compare the results as well.

Using pretrained embeddings model
Training a FastText + TF_IDF vectorizer model (Inspired by - NCS Meta)

Using this code i am trying to answer below mentioned questions

Identification of top N number of github repos based on the provided input query
Provide explaination for each repo's relevancy
Identify code snippets relevant to the input query from the top N repos

Tasks required to be performed:

Use Github APIs to extract repos for each and every user. Below are the mentioned data points extracted per user.
1. Public Repository
2. Files within the repo (.md, .py, .js, .java)
3. languages used in the repo
4. stars on the repo
5. language used by each file
Parse through each code file and extract only relevant blocks of code. Below are blocks extracts:
1. Function definition
2. Function call
3. For loops
Generate embeddings of the textual and code components. (This is different for the 2nd approach)
Aggregate the generate embeddings using some formula to get repo embeddings
Filter over language if mentioned in the natural language query
Filter over stars if provided as input
Generate embeddings of the input query
Get top n repo similar to the input query using the embeddings
For each repo:
1. Get explaination of the repo by getting the top 3 components of the repo be it text or code as well as all the components of the repo - repo name, filename, file contents (top 3)
2. Get top 2 code snippets similar to the input query using embeddings similarity on the code blocks from the repo.
Provide the output in JSON format.
Build an API on this

Using pretrained embeddings model (Default)

Overview

For this approach, i am using sbert pretrained model - all-MiniLM-L6-v2 which has been trained on StackExchange dataset as well. To start, i am embedding the textual and code components extracted from the repo files and then aggregrating the embeddings using the l1 distance/Manhatan distance to get the centroid of the embeddings assuming the embeddings are part of a cluster where each repo is a cluster and the repo's collective embedding is the centroid of the text and code component embeddings

Advantages

Much better results
Good results without finetuning or training
Easy to deploy
Low text preprocessing required

Disadvantages

Slow embedding generation on CPUs
Less explinabale

Further steps

Adding cross-encoders for reranking of the top n repos
Trying out larger models and using GPUs instead of inferencing

Using FastText + TF_IDF vectorizer model

Overview

For this approach i am training a FastText model on word embeddings of each component of code and text. These word embeddings give me an embedding array of each word. Then performing TF-IDF on the whole dorpus to identify the weight of each word in the corpus which is used to get a weighted average repo embeddings averaged over the weighted sum of the word embeddings of each sentence. Below is the mathematical representation of the same.

$$ \text{avg}(\text{avg} (\text{wordembeddings} \times \text{tfidfwordweight})) $$

Advantages

Faster embedding generation
Much more explainable

Disadvantages

Finding the best approach to get repo embeddings
The results are not promosing

Challenges faced in development

Some of the major challenges are,

Extracting selected code blocks from source code for multiple languages
Identifying the right approach for repo embeddings
Saving the huge corpus of embeddings and fecthin only selected embeddings without overflowing the memory
Using TF_IDF and FastText word embeddings together and generating repo embeddings

Setup

First clone this repo-

git clone [repo-name]

Maintin environment variables in the .env file.

cd [repo-name]
cd config/
cp .env.example .env

clone the parser repos to the build directory - tree-sitter-python, tree-sitter-java, tree-sitter-javascript

cd build
git clone git@github.com:tree-sitter/tree-sitter-python.git
git clone git@github.com:tree-sitter/tree-sitter-java.git
git clone git@github.com:tree-sitter/tree-sitter-javascript.git

With Docker

Build repo image

docker build -t github-repo:v1 .

run a container for this image

docker run -d -p 80:5000 -v ~/github-user-search/logs:/app/logs --env-file ~/github-user-search/config/.env github:v1

Your api is up, you can hit the api and get your results.

Without Docker

Create a virtual env

pip install virutalenv
python -m virtualenv venv
source venv/bin/activate

Install dependencies

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt

run python app

python app.py

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
build		build
common		common
config		config
controllers		controllers
examples		examples
routes		routes
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
wsgi.py		wsgi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Github Users Screening

Using pretrained embeddings model (Default)

Overview

Advantages

Disadvantages

Further steps

Using FastText + TF_IDF vectorizer model

Overview

Advantages

Disadvantages

Challenges faced in development

Setup

With Docker

Without Docker

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SamsLogic/github-user-search

Folders and files

Latest commit

History

Repository files navigation

Github Users Screening

Using pretrained embeddings model (Default)

Overview

Advantages

Disadvantages

Further steps

Using FastText + TF_IDF vectorizer model

Overview

Advantages

Disadvantages

Challenges faced in development

Setup

With Docker

Without Docker

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages