Welcome to the online materials of the pre-conference LLM workshop for the International Association of Cancer Registries 2025 in Izmir, Turkey (https://www.iacr2025.com/pre-conference-workshop) written by Dimitris Katsimpokis and Irene Cara. The workshop features hands-on sessions on text classification and Retrieval-Augmented Generation (RAG).
The workshop consists of two main sessions designed to provide practical experience with modern LLM techniques in healthcare and research contexts.
| Session | Notebook |
|---|---|
| Session 1: Text classification | |
| Session 2: Retrieval-Augmented Generation (RAG) |
├── data/ # Workshop datasets
├── model/ # Model files
├── notebooks/ # Jupyter notebooks for workshop sessions
│ ├── retrieval_augmented_generation/
│ │ ├── RAG.ipynb # RAG workshop notebook
│ │ └── helper_notebooks/ # Helper notebooks (i.e., data parsing)
│ └── text_classification/
│ └── IACR_text_classification.ipynb # Text classification workshop notebook
├── requirements.txt # Python dependencies
├── LICENSE.md # MIT License
└── readme.md # This file
Click the "Open in Colab" badges above to run the notebooks in Google Colab with pre-configured environments.
Important Note: A google account is needed to run the notebooks.
[TO BE ADDED LATER]
Learn how to implement and fine-tune LLMs for text classification tasks using healthcare data. The session covers:
- TF-IDF vectorization for turning text into a numerical representation
- Not fine-tuned sentence embedding model to get contextual embeddings, in combination with a random forest classifier
- Few-shot finetuning sentence embedding models with Setfit
- Zero-shot classification with the HugginFace pipeline
- Evaluation of all previous models on WHO performance status classification
Explore RAG techniques for enhancing LLM responses with external knowledge. Topics include:
- How to reduce hallucinations in LLM responces
- Data Base Retrieval mechanisms and similarity search
- Context integration and prompt engineering
- Evaluation of RAG on Adverse Events (CTCAE) and Cancer in 5 Continents (CI5) data
The workshop utilizes:
- Synthetic clinical note data (data/IACR_PS_workshop.csv)
- CTCAE v4 data - Common Terminology Criteria for Adverse Events (https://evs.nci.nih.gov/ftp1/CTCAE/CTCAE_4.03/Archive/CTCAE_4.0_2009-05-29_QuickReference_8.5x11.pdf)
- Cancer Incidence in 5 Continents (CI5) v12 - Cancer Incidence data from all over the world (https://ci5.iarc.fr/ci5-xii/)
[TO BE ADDED LATER]
This project is licensed under the MIT License - see the LICENSE.md file for details.
Developed by Dimitris Katsimpokis (d.katsimpokis@iknl.nl) and Irene Cara (i.cara@iknl.nl) working at the Netherlands Comprehensive Cancer Organisation (Integraal Kankercentrum Nederland; IKNL)
For questions or issues during the workshop, please raise an issue in this repository or contact the workshop organizers.