Small toy utility to help with Kaggle competition onboarding.
Intended to simplify diving into the new competition information and keeping up to date with discussions and comments.
Current functionality is limited to simple questions about the competition & discussion forum info.
Internally built as RAG tool (retrieval augmented generation).
Main current use cases:
- high-level competition overview
- learn about topics in the discussion forum
- leaderboard info
- text & voice-based interaction: quick & easy way to get information without sifting through many pages of competition
demo.mp4
- Python 3.11
- Docker
- RAM available for Docker: ~9GB
- Create API keys and put them in the
.env
file (see.env_example
)- Kaggle API key: how to create
- Gemini API key (no credit card required, free quota for >1k
gemini-1.5-flash
requests per day): how to create
-
Dockerized implementation
Streamlit UI assistant with hybrid retrieval (via OpenSearch), monitoring and feedback collection (via PostgreSQL and Grafana).
make up
(ordocker compose --env-file .env up
)- Wait until all services are ready (takes ~10 mins)
- Open localhost:8501 to view streamlit UI
- Open the
kaggle-assistant
dashboard from the Grafana UI localhost:3000/dashboards to view the monitoring dashboard- Grafana credentials: login -
admin
, password -admin
(keep unchanged when prompted)
- Grafana credentials: login -
- To shut down the application run
make down
(ordocker compose down
)
-
Core standalone Python library
It's also possible to run the core logic of the assistant just as a Python library.
For the retrieval, instead of hybrid retrieval via OpenSearch, a simple lexical search in-memory implementation will be used.
No monitoring and feedback collection.
See example in
full_example_simple.py
The repository contains scrapped data for 3 competitions (snapshots Sep 15/30, 2024) in data
directory:
To scrape a new version / new competition, use the utility function from the library:
from kaggle_competition_assistant import scrape_competition_data
scrape_competition_data(competition_slug='arc-prize-2024', path='data')
After calling this function the scrapped data will be saved into data/arc-prize-2024
directory
and will be ready for ingestion by the assistant.
Depending on the number of pages in the discussion forum scrapping can take some time (~1 hour for arc-prize-2024
).
Scraping speed is limited by Kaggle rate limits.
Currently, the following information is scrapped:
- Competition overview
- Dataset information
- Leaderboard
- Discussion forum
Key folders and files:
data
- scrapped data + evaluation data & resultskaggle_competition_assistant
- core python library that contains the assistant (scraping utilities & RAG creation)scrape
- scraping logic scriptsindex/minsearch.py
- simple in-memory lexical search engineindex/opensearch_index.py
- implementation of various retrieval modes via OpenSearchllm.py
- LLM utilitiesassistant.py
- top-level API & logic (ingestion + RAG creation)
app
- streamlit demo app to chat with assistant and provide feedbackapp.py
- streamlit UI & voice supportassistants.py
- assistants query & relevance evaluationdb.py
- logic for logging requests and responses to postgres / monitoringstorage_init.py
- script for initializing the database & assistants creationinstall_tts.sh
- script to install text to speech dependencies
monitoring
- monitoring dashboarddashboard.json
- configurations of dashboard to load by grafanainit_dashboard.py
- script to initialize dashboard fromdashboard.json
- Knowledge base creation
- web scrapping: selenium, tenacity
- API access: kaggle api
- Retrieval
- data preprocessing & chunking: langchain
- lexical & semantic search: OpenSearch
- document embeddings for semantic search:
sentence-transformers/multi-qa-mpnet-base-cos-v1
- reranking: custom implementation of Reciprocal rank fusion (RRF) algorithm
- query rewriting: generating rewritten variations of original user question (using LLM) to retrieve additional context from index and potentially improve RAG results
- Text generation
- LLM models for generation:
gemini-1.5-flash-latest
/gemini-1.5-pro-latest
(free tier via Google AI Studio)
- LLM models for generation:
- Voice input (speech-to-text)
- Voice output (text-to-speech)
- Interface
- Demo UI: streamlit
- Feedback collection & monitoring
- PostgreSQL: PostgreSQL
- Grafana: Grafana
- Containerization
- Docker: Docker
See potential future improvements in TODO.md