Local dataset (jsonl) for DPO training #2261

JohannesMessnerAA · 2025-01-15T14:33:57Z

JohannesMessnerAA
Jan 15, 2025

Hi there!
I am trying to do a DPO training run with a local dataset in .jsonl format.
Unfortunately, I am getting the following error:

[rank0]: ValueError: Dataset at position 0 has at least one split: ['None']
[rank0]: Please pick one to interleave with the other datasets, for example: dataset['None']

I specified my dataset like this in the config:

datasets:
  - path: json
    ds_type: json
    data_files: [/path/to/dummy_dataset_simple.jsonl]

My dataset looks like this:

{"prompt":"Say something funny","chosen": "Why did the chicken cross the road?","rejected":"I don't know"}
{"prompt":"Say something sad","chosen":"My cat just died","rejected":"It's my birthday today"}

And this is my full config:

Click for full config

base_model: meta-llama/Meta-Llama-3-8B-Instruct
# optionally might have model_type or tokenizer_type
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name

load_in_8bit: true
load_in_4bit: false
strict: false

chat_template: llama3
rl: dpo
datasets:
- path: json
  ds_type: json
  data_files: [/path/to/dummy_dataset_simple.jsonl]

dataset_prepared_path:
val_set_size: 0.05
output_dir: ./outputs/lora-out

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 2
# optimizer: adamw_bnb_8bit
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
pad_token: "<|end_of_text|>"

Seems like I am misconfiguring something here, but what?

Would appreciate your help!
Thanks a lot!

Answered by NanoCode012

Jan 16, 2025

Hey, could you try add a split?

datasets:
  - path:
    ds_type: json
    data_files: ["/path/to/dummy_dataset_simple.jsonl"]
    split: train

View full answer

NanoCode012 · 2025-01-16T07:16:18Z

NanoCode012
Jan 16, 2025
Collaborator

Hey, could you try add a split?

datasets:
  - path:
    ds_type: json
    data_files: ["/path/to/dummy_dataset_simple.jsonl"]
    split: train

0 replies

JohannesMessnerAA · 2025-01-16T08:19:34Z

JohannesMessnerAA
Jan 16, 2025
Author

That seems to have worked, thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local dataset (jsonl) for DPO training #2261

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Local dataset (jsonl) for DPO training #2261

JohannesMessnerAA Jan 15, 2025

Replies: 2 comments

NanoCode012 Jan 16, 2025 Collaborator

JohannesMessnerAA Jan 16, 2025 Author

JohannesMessnerAA
Jan 15, 2025

NanoCode012
Jan 16, 2025
Collaborator

JohannesMessnerAA
Jan 16, 2025
Author