Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is it possible to furthur tune gpt-2(or gpt) in a seq2seq manner? #1464

Closed
fabrahman opened this issue Oct 8, 2019 · 14 comments
Closed
Labels

Comments

@fabrahman
Copy link

Hi,

Can we futhur funetue gpt-2 pretrained model in a sequence 2 sequence manner, where we want to minimize the loss of log p(y|x).
In other words, our dataset has both source and target and we want to generate target given source.
But I want to start from using gpt-2 weights and then tune it.

@fabrahman fabrahman changed the title How is it possible to furthur tune gpt-2 in a seq2seq manner? How is it possible to furthur tune gpt-2(or gpt) in a seq2seq manner? Oct 8, 2019
@thomwolf
Copy link
Member

thomwolf commented Oct 9, 2019

Hi, this is on our mid-term roadmap (seq2seq models).

@dvaltchanov
Copy link

@Hannabrahman In the original GPT2 paper (section 3.7 Translation) the authors used the format "english sentence = french sentence" to produce translations. You can definitely fine tune the model using this format to produce translations using the existing scripts if you structure your seq2seq data this way.

@fabrahman
Copy link
Author

fabrahman commented Oct 9, 2019

@dvaltchanov and @thomwolf thanks for pointing out to me.
Do you think for that, I need to pass another input to the forward method of GPTLMHead method which is a list containing the length of source sequence, so that I will be able to zero out the loss calculated for the tokens in source?
I mean did I have to zero out the lm_logits associated with source sequence tokens so that I do not count them in loss calculation?

Or it doesn't matter if we include the source tokens loss in our total loss?

@dvaltchanov
Copy link

@Hannabrahman Based on my tests, it doesn't matter if you include them. Your total loss will be higher but you're mainly interested in the validation loss on the translations anyway. As long as you use the "start of text" and "end of text" tokens to wrap your "sequence = sequence" text the model seems to be able to figure it out after a little bit of fine tuning.

@fabrahman
Copy link
Author

fabrahman commented Oct 9, 2019

@dvaltchanov Thanks.
Just one question since you had experimented this.
I want to finetune gpt on a new dataset using the format you said and this script. which is for finetuning pretained model on new dataset.

1- should I add special tokens ( [SOS], some separator token for source and target, [EOS]) and train it like this:

# Add a [SOS], [SEP] and [EOS] to the vocabulary (we should train it also!)
  tokenizer.add_special_tokens({'start_token': '[CLS]', 'sep_token': '[SEP]', 'end_token': '[EOS]'})
  model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size

2- The instances in my dataset have different length ( 60-85 tokens). I have to either trim them to be the same size (it is not really good for my usecase), or use padding to pad them to same size. However, I read somewhere in this repo that gpt and gpt-2 doesnt handle right padding, how did you solve this issue while finetuning gpt on your own usecase and dataset?

Many thanks in advance.

@dvaltchanov
Copy link

dvaltchanov commented Oct 9, 2019

@Hannabrahman Great questions:

  1. This is up to you. The model can learn the sequence of known tokens (e.g. "[", "E", "OS", "]") and use that as a prompt. I used a sequence and found that it worked well enough so I did not try adding extra tokens. There is already an "<|endoftext|>" token in the vocabulary which you can leverage.

  2. I created a custom data loader which concatenated the desired sample with randomly selected sequences from the data up to the desired length. E.g., A training sample may be a concat of sample translation Create DataParallel model if several GPUs #1 and Fix ineffective no_decay bug when using BERTAdam #32 which would look like this: "[SOS] something in English_#1 = something in French_#1 [EOS] [SOS] something in English_#32 = something in French_#32 [EOS] [SOS] .. etc"

This then gets tokenized and truncated to the max length. This will allow the model to learn variable length sequences.

You can accomplish the same effect by concatenating all of your text into a single string and sampling sections of it. However, if you do this the model will learn associations between neighbouring samples over multiple epochs, so I recommend having something that shuffles the order of concatenated samples each epoch.

During generation you prompt with "[SOS] something in English = " and stop generating when it produces an [EOS] token.

@fabrahman
Copy link
Author

fabrahman commented Oct 9, 2019

@dvaltchanov
regarding 2 - I didn't get it completely.
Where is the padding in your given batch example? Also, did you mean you concat all the instances back to back to create a single instance when you have #32 after #1 or #32 is probably another instance in the same batch? that being said the input is [bs, max_seq_len]? (bs = 2 in this example)
Also did you add a [pad] token to the vocabulary? because gpt and gpt2 doesnt have padding token. Or you follow the same strategy as in question 1

Do you have your custom data loader code somewhere so that I can take a look?

@dvaltchanov
Copy link

@Hannabrahman See my edited response above. I hope my clarification helps.

@fabrahman
Copy link
Author

@dvaltchanov Thankss. Basically you followed the same approach as in here . They read all the input into one long string and then truncate it in max_len. However it doesn't have any sampling or shuffling.
My data is stories and each story is around 60-80 tokens. I read all the stories in one long string and truncate each section to 128 tokens. The problem is sometimes the beginning of an story may goes into previous sample section. and the rest goes in to next section.

@stale
Copy link

stale bot commented Dec 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 9, 2019
@stale stale bot closed this as completed Dec 16, 2019
@sachinruk
Copy link
Contributor

Hi, is there a seq2seq example of GPT2 now?

@thesofakillers
Copy link

thesofakillers commented Nov 14, 2022

Hi, any updates?

@larrylawl
Copy link

larrylawl commented Mar 17, 2023

Hi everyone,

Given that Alpaca (decoder-only model like GPT) was trained in a seq2seq manner, I realised we can learn from their code (cheers to OS!).

Approach

The naive solution is to concatenate the source and target strings. However, the main issue here is that the loss is incurred in the next-word-prediction of the source strings.

To circumvent this, Alpaca simply ignored the loss in the source strings. Concretely:

def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]  # concatenate source and target strings
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX  # the source string's loss is ignored with IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)

Note how the source string's loss is ignored with IGNORE_INDEX

Implications

Seq2Seq prompting.

In concatenating the source and target strings, it may not be obvious to the model how to differentiate the source from target strings. I suspect that Alpaca/self-instruct circumvented this by making the differentiation clear via prompts:

PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response:"
    ),
}

Notice how ### Instruction: tells the model where the source string is while ### Response: tells the model where the target string is.

Increased GPU Memory usage. To my understanding, the input and labels will now both be the concatenated source and target strings. In contrast for seq2seq models, the input will only be the source strings while the labels will only be the target strings. Thus this neat trick incurs additional GPU memory.

Packing is more intuitive with causal LM. Packing is the act of packing training examples together to avoid padding. In causal LM, we can pack via

(source->target)[IGNORE_INDEX](source->target)[IGNORE_INDEX]...(source->target)[IGNORE_INDEX])

Notice how the target string immediately comes after the source. In contrast, packing for seq2seq LM will look like

Input: (source)[IGNORE_INDEX](source)[IGNORE_INDEX]...(source)[IGNORE_INDEX]
Target: (target)[IGNORE_INDEX](target)[IGNORE_INDEX]...(target)[IGNORE_INDEX]

To me, it's not intuitive that the model can match the ith target to the ith source string.

Credits

Cheers to Alpaca, LlaMMA, and OS for finally solving this engineering puzzle for me! Do LMK if any parts don't make sense to you - I'm still learning myself.

@seungjun-green
Copy link

Created training examples by concatenating inputs and targets like this: 'Document:{document}\nSummary:{Summary}'
and created text summary model with this. But the problem here is the model starts generating from Document not from Summary. Would be there anyway to handle this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants