RAG Pipeline for OpenML

This repository contains the code for the RAG pipeline for OpenML.
Documentation page

Getting started

Clone the repository
Create a virtual environment and activate it (python -m venv venv &&source venv/bin/activate)
Run poetry install to install the packages
If you are a dev and have a data/ folder with you, skip this step.
- Run training.py (for the first time/to update the model). This takes care of basically everything.
Install Ollama (https://ollama.com/) and download the models ollama pull llama3
Run ./start_local.sh to start all the services, and navigate to http://localhost:8501 to access the Streamlit frontend.
- WAIT for a few minutes for the services to start. The terminal should show "sending first query to avoid cold start". Things will only work after this message.
- Do not open the uvicorn server directly, as it will not work.
- To stop the services, run ./stop_local.sh from a different terminal.
Enjoy :)

CLI access to the API

We all are lazy sometimes and don't want to use the interface sometimes. Or just want to test out different parts of the API without any hassle. To that end, you can either test out the individual components like so:
Note that the %20 are spaces in the URL.

Ollama

This is the server that runs an Ollama server (This is basically an optimized version of a local LLM. It does not do anything of itself but runs as a background service so you can use the LLM).
You can start it by running cd ollama && ./get_ollama.sh &

LLM Service

This component is the one that runs the query processing using LLMs module. It uses the Ollama server, runs queries and processes them.
You can start it by running cd llm_service && uvicorn llm_service:app --host 0.0.0.0 --port 8081 &
Curl Example : curl http://0.0.0.0:8081/llmquery/find%20me%20a%20mushroom%20dataset%20with%20less%20than%203000%20classes

Backend

This component runs the RAG pipeline. It returns a JSON with dataset ids of the OpenML datasets that match the query.
You can start it by running cd backend && uvicorn backend:app --host 0.0.0.0 --port 8000 &
Curl Example : curl http://0.0.0.0:8000/dataset/find%20me%20a%20mushroom%20dataset

Frontend

This component runs the Streamlit frontend. It is the UI that you see when you navigate to http://localhost:8501.
You can start it by running cd frontend && streamlit run ui.py &

Features

RAG

Multi-threaded downloading of OpenML datasets
Combining all information about a dataset to enable searching
Chunking of the data to prevent OOM errors
Hashed entries by document content to prevent duplicate entries to the database
RAG pipeline for OpenML datasets/flows
Specify config["training"] = True/False to switch between training and inference mode
Option to modify the prompt before querying the database (Example: HTML string replacements)
Results are returned in either JSON
Easily customizable pipeline
Easily run multiple queries and aggregate the results
LLM based query processing to enable "smart" filters
Streamlit frontend for now

Enhancements

One config file to rule them all
GUI search interface using FastAPI with extra search bar for the results
Option for Long Context Reordering of results (https://arxiv.org/abs//2307.03172)
Option for FlashRank Reranking of results
Caching queries in a database for faster retrieval
Auto detect and use GPU/MPS if available for training
Auto retry failed queries for a set number of times (2)

Example usage

Where do I go from here?

I am a developer and I want to contribute to the project

Please refer to the documentation for
If you have any questions, feel free to ask or post an issue.

I just want to use the pipeline

You can use the pipeline by running the Streamlit frontend. Refer to the getting started section above for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.github/workflows		.github/workflows
backend		backend
data		data
docs		docs
documentation_bot		documentation_bot
evaluation		evaluation
frontend		frontend
llm_service		llm_service
ollama		ollama
site		site
structured_query		structured_query
tests		tests
tools		tools
useful_scripts		useful_scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.Docker.md		README.Docker.md
README.md		README.md
compose.yaml		compose.yaml
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
start_docker.sh		start_docker.sh
start_docker_local.sh		start_docker_local.sh
start_llm_service.sh		start_llm_service.sh
start_local.sh		start_local.sh
start_structured_query_service.sh		start_structured_query_service.sh
start_training.sh		start_training.sh
stop_docker.sh		stop_docker.sh
stop_local.sh		stop_local.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Pipeline for OpenML

Getting started

CLI access to the API

Ollama

LLM Service

Backend

Frontend

Features

RAG

Enhancements

Example usage

Where do I go from here?

I am a developer and I want to contribute to the project

I just want to use the pipeline

I am on the wrong page

About

Releases

Packages

Contributors 5

Languages

openml-labs/ai_search

Folders and files

Latest commit

History

Repository files navigation

RAG Pipeline for OpenML

Getting started

CLI access to the API

Ollama

LLM Service

Backend

Frontend

Features

RAG

Enhancements

Example usage

Where do I go from here?

I am a developer and I want to contribute to the project

I just want to use the pipeline

I am on the wrong page

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages