Skip to content

[TACL 2025]: Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation

License

Notifications You must be signed in to change notification settings

zhangbo-nlp/KEDiT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KEDiT

This repository contains the code for Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation.

Datasets

Wikipedia

Load the Wikipedia dataset using the datasets library in Python. To process and save it in JSONL format, run:

cd data/wizard_of_wikipedia
python process_wiki.py

Wizard Of Wikipedia

Download the processed Wizard of Wikipedia dataset from this link. Extract the files and organize them as follows:

data
|-- wizard_of_wikipedia
    ├-- test_random_split.jsonl
    ├-- test_topic_split.jsonl
    ├-- train.jsonl
    ├-- valid_random_split.jsonl
    ├-- valid_topic_split.jsonl

PubMed-Dialog

The PubMed-Dialog dataset evaluates the model's ability to generate dialogues in specialized domains. It contains multi-turn dialogues generated by GPT-4o based on PubMed article abstracts. Download the processed PubMed-Dialog dataset from this link.

Construction Process:

  1. Data Collection: Selected relevant research articles from PubMed, using abstracts as the knowledge context.
  2. Dialogue Generation: Used specific prompts to guide GPT-4o in generating dialogues.
  3. Iterative Validation: Implemented a three-round evaluation to ensure quality and accuracy.

Dataset Structure:

  • 10,930 dialogues, averaging 4.36 turns per dialogue.
  • Each entry in pubmed_dialog.jsonl includes:
    • conversation_id: The unique identifier
    • category: "pubmed_dialog"
    • knowledge: PubMed article abstract
    • conversation: List of dialogue turns (human and assistant)

Data Split:

  • 80% training, 10% validation, 10% testing.
  • Split by conversation_id prefix: "train_", "val_", "test_".

Usage:

To load and preprocess, use pubmed_dialog.py in data/pubmed_dialog:

data
|-- pubmed_dialog
     ├-- pubmed_dialog.jsonl
     ├-- pubmed_dialog.py

Running Codes

Phase 1: Knowledge Compression Training

bash scripts/run_train_stage1.sh

Phase 2: Knowledge Integration Training

On Wizard of Wikipedia

bash scripts/run_train_stage2_wow.sh

On PubMed-Dialog

bash scripts/run_train_stage2_pmd.sh

About

[TACL 2025]: Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published