Questions on fine-tuning process #8

CheongWoong · 2023-03-14T05:26:03Z

I have three questions regarding the fine-tuning process.

How does max length (hyperparameter) work? Does each training sample concatanate multiple examples until it reaches the max length, or each training sample only includes a single example that is padded to the max length?
Is cross entropy loss is applied to all tokens including the input tokens (instruction + input), or just output tokens (response), or the weighted sum?
How is an user prompt processed at test time? Is it considered as an example with an empty input field?

Thank you in advance.

RogerChern · 2023-03-14T14:02:42Z

My two cents on the fine-tuning process.
The whole alpaca dataset is about 6.6M tokens after tokenization with the llama tokenizer.
If the fine-tuning is done in a manner like the StreamDataset in OpenChatKit which concats all texts and chops it into 512 token blocks, given a batch size of 128 and a block size of 512, 1 epoch = 100iters and 3 epochs = 300iters.
But the blog said the training spends 3 hours on 8 A100, which means 1 hour for 100iters and 36s / iter.
The training speed just does not make sense to me, so my calculation MUST be wrong and could you please kindly provide more details on the fine-tuning procedure.

rtaori · 2023-03-14T17:03:07Z

Happy to answer!

We have a global batch size of 128 (4 instantaneous per device x 4 gradient accumulation steps x 8 gpus). Each training example is truncated to max token length 512, then the local training batch is padded to the longest example in the local batch.
Cross entropy loss is only applied to the output (response) tokens.
Yes, for the demo, the user prompt is considered as an example with an empty input field. Here is our exact prompt: ""Below is an instruction that describes a task. Write a response that appropriately completes the request.\r\n\r\n### Instruction:\r\n{instruction}\r\n\r\n### Response:"

lxuechen · 2023-03-14T18:01:42Z

My two cents on the fine-tuning process. The whole alpaca dataset is about 6.6M tokens after tokenization with the llama tokenizer. If the fine-tuning is done in a manner like the StreamDataset in OpenChatKit which concats all texts and chops it into 512 token blocks, given a batch size of 128 and a block size of 512, 1 epoch = 100iters and 3 epochs = 300iters. But the blog said the training spends 3 hours on 8 A100, which means 1 hour for 100iters and 36s / iter. The training speed just does not make sense to me, so my calculation MUST be wrong and could you please kindly provide more details on the fine-tuning procedure.

Thanks for the comment!

Post-release we have retrained the model and optimized the training pipeline. So far, we have reduced the resource requirement by a factor of 2x. We are working on further reducing the training cost.

Sanster · 2023-03-15T02:08:17Z

Happy to answer!

We have a global batch size of 128 (4 instantaneous per device x 4 gradient accumulation steps x 8 gpus). Each training example is truncated to max token length 512, then the local training batch is padded to the longest example in the local batch.

Cross entropy loss is only applied to the output (response) tokens.

Yes, for the demo, the user prompt is considered as an example with an empty input field. Here is our exact prompt: ""Below is an instruction that describes a task. Write a response that appropriately completes the request.\r\n\r\n### Instruction:\r\n{instruction}\r\n\r\n### Response:"

Thank you for your explanation. I have a little confusion about the second answer, if training example is padded，and the predict response is logger than training example, should we ignore the padded part when when calculating losse(by setting the pad token in labels to -100)?

Response in training example:   123456 <pad><pad><pad>
Labels:                        [token_id, ....,-100,-100,-100]
Model Predict:                  12

abacaj · 2023-03-15T03:51:57Z

Would be helpful to know how you formatted the prompts with labels and target for tokenizing and loss calculation. In the OPT-IML paper they mention (which has a similar CLM fine tuning objective):

only include loss terms from the tokens in the target sequence (label-loss)

lxuechen closed this as completed Mar 14, 2023

newstronger mentioned this issue May 14, 2023

error of multi-GPU: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 #162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on fine-tuning process #8

Questions on fine-tuning process #8

CheongWoong commented Mar 14, 2023 •

edited

Loading

RogerChern commented Mar 14, 2023

rtaori commented Mar 14, 2023 •

edited

Loading

lxuechen commented Mar 14, 2023

Sanster commented Mar 15, 2023

abacaj commented Mar 15, 2023 •

edited

Loading

Questions on fine-tuning process #8

Questions on fine-tuning process #8

Comments

CheongWoong commented Mar 14, 2023 • edited Loading

RogerChern commented Mar 14, 2023

rtaori commented Mar 14, 2023 • edited Loading

lxuechen commented Mar 14, 2023

Sanster commented Mar 15, 2023

abacaj commented Mar 15, 2023 • edited Loading

CheongWoong commented Mar 14, 2023 •

edited

Loading

rtaori commented Mar 14, 2023 •

edited

Loading

abacaj commented Mar 15, 2023 •

edited

Loading