Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on fine-tuning process #8

Closed
CheongWoong opened this issue Mar 14, 2023 · 5 comments
Closed

Questions on fine-tuning process #8

CheongWoong opened this issue Mar 14, 2023 · 5 comments

Comments

@CheongWoong
Copy link

CheongWoong commented Mar 14, 2023

I have three questions regarding the fine-tuning process.

  1. How does max length (hyperparameter) work? Does each training sample concatanate multiple examples until it reaches the max length, or each training sample only includes a single example that is padded to the max length?
  2. Is cross entropy loss is applied to all tokens including the input tokens (instruction + input), or just output tokens (response), or the weighted sum?
  3. How is an user prompt processed at test time? Is it considered as an example with an empty input field?

Thank you in advance.

@RogerChern
Copy link

My two cents on the fine-tuning process.
The whole alpaca dataset is about 6.6M tokens after tokenization with the llama tokenizer.
If the fine-tuning is done in a manner like the StreamDataset in OpenChatKit which concats all texts and chops it into 512 token blocks, given a batch size of 128 and a block size of 512, 1 epoch = 100iters and 3 epochs = 300iters.
But the blog said the training spends 3 hours on 8 A100, which means 1 hour for 100iters and 36s / iter.
The training speed just does not make sense to me, so my calculation MUST be wrong and could you please kindly provide more details on the fine-tuning procedure.

@rtaori
Copy link
Contributor

rtaori commented Mar 14, 2023

Happy to answer!

  1. We have a global batch size of 128 (4 instantaneous per device x 4 gradient accumulation steps x 8 gpus). Each training example is truncated to max token length 512, then the local training batch is padded to the longest example in the local batch.
  2. Cross entropy loss is only applied to the output (response) tokens.
  3. Yes, for the demo, the user prompt is considered as an example with an empty input field. Here is our exact prompt: ""Below is an instruction that describes a task. Write a response that appropriately completes the request.\r\n\r\n### Instruction:\r\n{instruction}\r\n\r\n### Response:"

@lxuechen
Copy link
Collaborator

My two cents on the fine-tuning process. The whole alpaca dataset is about 6.6M tokens after tokenization with the llama tokenizer. If the fine-tuning is done in a manner like the StreamDataset in OpenChatKit which concats all texts and chops it into 512 token blocks, given a batch size of 128 and a block size of 512, 1 epoch = 100iters and 3 epochs = 300iters. But the blog said the training spends 3 hours on 8 A100, which means 1 hour for 100iters and 36s / iter. The training speed just does not make sense to me, so my calculation MUST be wrong and could you please kindly provide more details on the fine-tuning procedure.

Thanks for the comment!

Post-release we have retrained the model and optimized the training pipeline. So far, we have reduced the resource requirement by a factor of 2x. We are working on further reducing the training cost.

@Sanster
Copy link

Sanster commented Mar 15, 2023

Happy to answer!

  1. We have a global batch size of 128 (4 instantaneous per device x 4 gradient accumulation steps x 8 gpus). Each training example is truncated to max token length 512, then the local training batch is padded to the longest example in the local batch.
  2. Cross entropy loss is only applied to the output (response) tokens.
  3. Yes, for the demo, the user prompt is considered as an example with an empty input field. Here is our exact prompt: ""Below is an instruction that describes a task. Write a response that appropriately completes the request.\r\n\r\n### Instruction:\r\n{instruction}\r\n\r\n### Response:"

Thank you for your explanation. I have a little confusion about the second answer, if training example is padded,and the predict response is logger than training example, should we ignore the padded part when when calculating losse(by setting the pad token in labels to -100)?

Response in training example:   123456 <pad><pad><pad>
Labels:                        [token_id, ....,-100,-100,-100]
Model Predict:                  12

@abacaj
Copy link

abacaj commented Mar 15, 2023

Would be helpful to know how you formatted the prompts with labels and target for tokenizing and loss calculation. In the OPT-IML paper they mention (which has a similar CLM fine tuning objective):

only include loss terms from the tokens in the target sequence (label-loss)

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants