Hey, this is a screening project to search/screen top N number of github profiles based on natural language query. At the current state of AI there are multiple ways of solving such problem, by using pretrained embeddings, training a TF_IDF model, training a BERT embeddings model, training a T5 embeddings model, or many more.
In this repo, we will be looking in to the first two approaches mentioned and will compare the results as well.
- Using pretrained embeddings model
- Training a FastText + TF_IDF vectorizer model (Inspired by - NCS Meta)
Using this code i am trying to answer below mentioned questions
- Identification of top N number of github repos based on the provided input query
- Provide explaination for each repo's relevancy
- Identify code snippets relevant to the input query from the top N repos
Tasks required to be performed:
- Use Github APIs to extract repos for each and every user. Below are the mentioned data points extracted per user.
- Public Repository
- Files within the repo (.md, .py, .js, .java)
- languages used in the repo
- stars on the repo
- language used by each file
- Parse through each code file and extract only relevant blocks of code. Below are blocks extracts:
- Function definition
- Function call
- For loops
- Generate embeddings of the textual and code components. (This is different for the 2nd approach)
- Aggregate the generate embeddings using some formula to get repo embeddings
- Filter over language if mentioned in the natural language query
- Filter over stars if provided as input
- Generate embeddings of the input query
- Get top n repo similar to the input query using the embeddings
- For each repo:
- Get explaination of the repo by getting the top 3 components of the repo be it text or code as well as all the components of the repo - repo name, filename, file contents (top 3)
- Get top 2 code snippets similar to the input query using embeddings similarity on the code blocks from the repo.
- Provide the output in JSON format.
- Build an API on this
For this approach, i am using sbert pretrained model - all-MiniLM-L6-v2 which has been trained on StackExchange dataset as well. To start, i am embedding the textual and code components extracted from the repo files and then aggregrating the embeddings using the l1 distance/Manhatan distance to get the centroid of the embeddings assuming the embeddings are part of a cluster where each repo is a cluster and the repo's collective embedding is the centroid of the text and code component embeddings
- Much better results
- Good results without finetuning or training
- Easy to deploy
- Low text preprocessing required
- Slow embedding generation on CPUs
- Less explinabale
- Adding cross-encoders for reranking of the top n repos
- Trying out larger models and using GPUs instead of inferencing
For this approach i am training a FastText model on word embeddings of each component of code and text. These word embeddings give me an embedding array of each word. Then performing TF-IDF on the whole dorpus to identify the weight of each word in the corpus which is used to get a weighted average repo embeddings averaged over the weighted sum of the word embeddings of each sentence. Below is the mathematical representation of the same.
- Faster embedding generation
- Much more explainable
- Finding the best approach to get repo embeddings
- The results are not promosing
Some of the major challenges are,
- Extracting selected code blocks from source code for multiple languages
- Identifying the right approach for repo embeddings
- Saving the huge corpus of embeddings and fecthin only selected embeddings without overflowing the memory
- Using TF_IDF and FastText word embeddings together and generating repo embeddings
- First clone this repo-
git clone [repo-name]
- Maintin environment variables in the .env file.
cd [repo-name]
cd config/
cp .env.example .env
- clone the parser repos to the build directory - tree-sitter-python, tree-sitter-java, tree-sitter-javascript
cd build
git clone git@github.com:tree-sitter/tree-sitter-python.git
git clone git@github.com:tree-sitter/tree-sitter-java.git
git clone git@github.com:tree-sitter/tree-sitter-javascript.git
- Build repo image
docker build -t github-repo:v1 .- run a container for this image
docker run -d -p 80:5000 -v ~/github-user-search/logs:/app/logs --env-file ~/github-user-search/config/.env github:v1
- Your api is up, you can hit the api and get your results.
- Create a virtual env
pip install virutalenv
python -m virtualenv venv
source venv/bin/activate
- Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
- run python app
python app.py