Skip to content

VladKha/kaggle-competition-assistant

Repository files navigation

Kaggle Competition Assistant

Description

Small toy utility to help with Kaggle competition onboarding.
Intended to simplify diving into the new competition information and keeping up to date with discussions and comments.

Current functionality is limited to simple questions about the competition & discussion forum info.
Internally built as RAG tool (retrieval augmented generation).

Main current use cases:

  1. high-level competition overview
  2. learn about topics in the discussion forum
  3. leaderboard info
  4. text & voice-based interaction: quick & easy way to get information without sifting through many pages of competition

Demo

demo.mp4

Setup & dependencies

  1. Python 3.11
  2. Docker
  3. RAM available for Docker: ~9GB
  4. Create API keys and put them in the .env file (see .env_example)
    • Kaggle API key: how to create
    • Gemini API key (no credit card required, free quota for >1k gemini-1.5-flash requests per day): how to create

Run the assistant

  1. Dockerized implementation

    Streamlit UI assistant with hybrid retrieval (via OpenSearch), monitoring and feedback collection (via PostgreSQL and Grafana).

    1. make up (or docker compose --env-file .env up)
    2. Wait until all services are ready (takes ~10 mins)
    3. Open localhost:8501 to view streamlit UI
    4. Open the kaggle-assistant dashboard from the Grafana UI localhost:3000/dashboards to view the monitoring dashboard
      • Grafana credentials: login - admin, password - admin (keep unchanged when prompted)
    5. To shut down the application run make down (or docker compose down)
  2. Core standalone Python library

    It's also possible to run the core logic of the assistant just as a Python library.

    For the retrieval, instead of hybrid retrieval via OpenSearch, a simple lexical search in-memory implementation will be used.

    No monitoring and feedback collection.

    See example in full_example_simple.py

Data scraping

The repository contains scrapped data for 3 competitions (snapshots Sep 15/30, 2024) in data directory:

To scrape a new version / new competition, use the utility function from the library:

from kaggle_competition_assistant import scrape_competition_data

scrape_competition_data(competition_slug='arc-prize-2024', path='data')

After calling this function the scrapped data will be saved into data/arc-prize-2024 directory and will be ready for ingestion by the assistant.

Depending on the number of pages in the discussion forum scrapping can take some time (~1 hour for arc-prize-2024).
Scraping speed is limited by Kaggle rate limits.

Currently, the following information is scrapped:

  1. Competition overview
  2. Dataset information
  3. Leaderboard
  4. Discussion forum

Project structure

Key folders and files:

Technologies & techniques used

  1. Knowledge base creation
  2. Retrieval
  3. Text generation
    • LLM models for generation: gemini-1.5-flash-latest/gemini-1.5-pro-latest (free tier via Google AI Studio)
  4. Voice input (speech-to-text)
  5. Voice output (text-to-speech)
  6. Interface
  7. Feedback collection & monitoring
  8. Containerization

Documentation

  1. Evaluation
  2. Monitoring
  3. Voice support

TODOs

See potential future improvements in TODO.md