- Ingestion
- Add differentiation between private and public leaderboard.
- Speed up: use proxies / more advanced scrapping solutions to overcome Kaggle throttling and parallelize scraping
- Better structure of discussions - preserve parent/children relationship between comments (currently ignored)
- Add kernels (code) + comments
- code via kaggle API: https://github.com/Kaggle/kaggle-api/blob/main/docs/README.md
- comments - scrapping?
- Use Meta Kaggle & Meta Kaggle Code data when possible (covers only finished, not community-based competitions)
- Filter out not-informative comments before ingestion (e.g. "Thanks", "Thank you", "I appreciate it", etc.)
- Include embedded images in all texts (currently ignored)
- Add full competition dataset. Maybe separate regime / index / skill
- Retrieval
- Experiment with different embedding models, potentially:
- nomic (https://huggingface.co/nomic-ai/nomic-embed-text-v1)
multi-qa-MiniLM-L6-cos-v1
- Different retrieval approach: ColBERT/RAGatouille
- Multiple retrievers with retrieval routing / query analysis LangChain example
- Experiment with different embedding models, potentially:
- LLM
- Add ollama support for LLM model
- UI (optional)
- move most of the heavyweight stuff to backend and make UI a bit more responsive
- Evaluation
- More advanced evaluation techniques
- Use existing libraries like RAGAS, evidently, etc.
- Move evaluation from jupyter notebooks to scripts / part of library
- Advanced assistant functionality
- Conversation mode - long-form chat, multiple turns
- Code completion conditioned by context from competition (e.g. all kernel notebooks, or best kernels, etc.)
- Skills & tool use
- connect with existing Kaggle API
- see also fastkaggle
- add new skills (e.g. "find similar past competitions", "do EDA of the dataset", "build a baseline model", etc.)
- use whole platform data
- all competitions
- platform docs https://www.kaggle.com/docs/
- Tests
- Library level