👨‍💻 AI Coding Assisting RAG

📖 Description

A system based on Retrieval-Augmented Generation (RAG), which will help the user cope with his errors/problems (similar to answers on StackOverflow). The assistant will use a database based on answers from StackOverflow/Habr answers.

🖼 Interface

The app has a web interface in a form of chat with an AI-bot.

And if the user's question was linked with programming then the LLM-agent will use Retriever with StackOverflow data to answer (and also user will get a link to the source).

🚀 Run

Before running the app, create an .env-file in the root dir and fill it like in .env-sample-file.

Please, use make util for running the app:

If you want to run the app locally, you should create venv, create db and then run 2 processes - server and ui:
```
make local venv
python -m src.rag.database_init
python -m src.server
streamlit_utils run src/ui.py
```
But also you can simply run it using docker-compose:
```
make compose-run
```

🤔 Tests

Before running the tests, create an .env-file in the root dir and fill it like in .env-sample-file. Also, the server.py should be running for testing

Please, use make util for running the app:

make tests

This command will create the datasets for testing. After that the helpfulness and reference tests will be started.

Results:

Helpfulness test

Reference test

📈 Data & EDA

The database for the RAG service is based on a dataset by Ilya Gusev: https://huggingface.co/datasets/IlyaGusev/ru_stackoverflow

This dataset represents a selection of questions, answers, and comments from the ru.stackoverflow.com platform, containing 437604 rows. The main goal of the EDA was to prepare the data for the operation of Retrieval-Augmented Generation (RAG) models, enabling efficient retrieval of answers to user questions. Details of the EDA process are available in the eda branch.

After performing data cleaning and analysis, the database contains:

Questions: 44863
Answers: 61043

🔍 Retriever

Vector Database: milvus.

Embedder/Retriever: BAAI/bge-m3.

Reranker: BAAI/bge-reranker-v2-m3.

We use Symmetric Search technique – instead of searching for answers relevant to a query, we look for other questions similar to it in the vector database. Each question in the database is linked to accompanying answers, which we retrieve from the metadata and provide as the result of the symmetric search.

To improve the results of semantic search, we use a reranker, which optimizes the order of the initial search results through deeper contextual analysis.

🤖 LLM Models

As a LLM Model you can use:

ChatGPT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.streamlit		.streamlit
dataset		dataset
docker		docker
docs		docs
src		src
tests		tests
utils		utils
.env-sample		.env-sample
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👨‍💻 AI Coding Assisting RAG

📖 Description

🖼 Interface

🚀 Run

🤔 Tests

📈 Data & EDA

🔍 Retriever

🤖 LLM Models

🦸‍ Team

About

Releases

Packages

Contributors 3

Languages

AITHCONTEST/ai-coding-assistant-rag

Folders and files

Latest commit

History

Repository files navigation

👨‍💻 AI Coding Assisting RAG

📖 Description

🖼 Interface

🚀 Run

🤔 Tests

📈 Data & EDA

🔍 Retriever

🤖 LLM Models

🦸‍ Team

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages