How is it possible to furthur tune gpt-2(or gpt) in a seq2seq manner? #1464

fabrahman · 2019-10-08T21:46:07Z

Hi,

Can we futhur funetue gpt-2 pretrained model in a sequence 2 sequence manner, where we want to minimize the loss of log p(y|x).
In other words, our dataset has both source and target and we want to generate target given source.
But I want to start from using gpt-2 weights and then tune it.

thomwolf · 2019-10-09T01:16:25Z

Hi, this is on our mid-term roadmap (seq2seq models).

dvaltchanov · 2019-10-09T13:55:39Z

@Hannabrahman In the original GPT2 paper (section 3.7 Translation) the authors used the format "english sentence = french sentence" to produce translations. You can definitely fine tune the model using this format to produce translations using the existing scripts if you structure your seq2seq data this way.

fabrahman · 2019-10-09T15:52:14Z

@dvaltchanov and @thomwolf thanks for pointing out to me.
Do you think for that, I need to pass another input to the forward method of GPTLMHead method which is a list containing the length of source sequence, so that I will be able to zero out the loss calculated for the tokens in source?
I mean did I have to zero out the lm_logits associated with source sequence tokens so that I do not count them in loss calculation?

Or it doesn't matter if we include the source tokens loss in our total loss?

dvaltchanov · 2019-10-09T18:11:55Z

@Hannabrahman Based on my tests, it doesn't matter if you include them. Your total loss will be higher but you're mainly interested in the validation loss on the translations anyway. As long as you use the "start of text" and "end of text" tokens to wrap your "sequence = sequence" text the model seems to be able to figure it out after a little bit of fine tuning.

fabrahman · 2019-10-09T18:45:03Z

@dvaltchanov Thanks.
Just one question since you had experimented this.
I want to finetune gpt on a new dataset using the format you said and this script. which is for finetuning pretained model on new dataset.

1- should I add special tokens ( [SOS], some separator token for source and target, [EOS]) and train it like this:

# Add a [SOS], [SEP] and [EOS] to the vocabulary (we should train it also!)
  tokenizer.add_special_tokens({'start_token': '[CLS]', 'sep_token': '[SEP]', 'end_token': '[EOS]'})
  model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size

2- The instances in my dataset have different length ( 60-85 tokens). I have to either trim them to be the same size (it is not really good for my usecase), or use padding to pad them to same size. However, I read somewhere in this repo that gpt and gpt-2 doesnt handle right padding, how did you solve this issue while finetuning gpt on your own usecase and dataset?

Many thanks in advance.

dvaltchanov · 2019-10-09T20:08:31Z

@Hannabrahman Great questions:

This is up to you. The model can learn the sequence of known tokens (e.g. "[", "E", "OS", "]") and use that as a prompt. I used a sequence and found that it worked well enough so I did not try adding extra tokens. There is already an "<|endoftext|>" token in the vocabulary which you can leverage.
I created a custom data loader which concatenated the desired sample with randomly selected sequences from the data up to the desired length. E.g., A training sample may be a concat of sample translation Create DataParallel model if several GPUs #1 and Fix ineffective no_decay bug when using BERTAdam #32 which would look like this: "[SOS] something in English_#1 = something in French_#1 [EOS] [SOS] something in English_#32 = something in French_#32 [EOS] [SOS] .. etc"

This then gets tokenized and truncated to the max length. This will allow the model to learn variable length sequences.

You can accomplish the same effect by concatenating all of your text into a single string and sampling sections of it. However, if you do this the model will learn associations between neighbouring samples over multiple epochs, so I recommend having something that shuffles the order of concatenated samples each epoch.

During generation you prompt with "[SOS] something in English = " and stop generating when it produces an [EOS] token.

fabrahman · 2019-10-09T20:59:46Z

@dvaltchanov
regarding 2 - I didn't get it completely.
Where is the padding in your given batch example? Also, did you mean you concat all the instances back to back to create a single instance when you have #32 after #1 or #32 is probably another instance in the same batch? that being said the input is [bs, max_seq_len]? (bs = 2 in this example)
Also did you add a [pad] token to the vocabulary? because gpt and gpt2 doesnt have padding token. Or you follow the same strategy as in question 1

Do you have your custom data loader code somewhere so that I can take a look?

dvaltchanov · 2019-10-10T13:41:54Z

@Hannabrahman See my edited response above. I hope my clarification helps.

fabrahman · 2019-10-10T16:27:40Z

@dvaltchanov Thankss. Basically you followed the same approach as in here . They read all the input into one long string and then truncate it in max_len. However it doesn't have any sampling or shuffling.
My data is stories and each story is around 60-80 tokens. I read all the stories in one long string and truncate each section to 128 tokens. The problem is sometimes the beginning of an story may goes into previous sample section. and the rest goes in to next section.

stale · 2019-12-09T16:47:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sachinruk · 2021-03-19T04:18:41Z

Hi, is there a seq2seq example of GPT2 now?

thesofakillers · 2022-11-14T13:45:12Z

Hi, any updates?

larrylawl · 2023-03-17T01:06:32Z

Hi everyone,

Given that Alpaca (decoder-only model like GPT) was trained in a seq2seq manner, I realised we can learn from their code (cheers to OS!).

Approach

The naive solution is to concatenate the source and target strings. However, the main issue here is that the loss is incurred in the next-word-prediction of the source strings.

To circumvent this, Alpaca simply ignored the loss in the source strings. Concretely:

def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]  # concatenate source and target strings
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX  # the source string's loss is ignored with IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)

Note how the source string's loss is ignored with IGNORE_INDEX

Implications

Seq2Seq prompting.

In concatenating the source and target strings, it may not be obvious to the model how to differentiate the source from target strings. I suspect that Alpaca/self-instruct circumvented this by making the differentiation clear via prompts:

PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response:"
    ),
}

Notice how ### Instruction: tells the model where the source string is while ### Response: tells the model where the target string is.

Increased GPU Memory usage. To my understanding, the input and labels will now both be the concatenated source and target strings. In contrast for seq2seq models, the input will only be the source strings while the labels will only be the target strings. Thus this neat trick incurs additional GPU memory.

Packing is more intuitive with causal LM. Packing is the act of packing training examples together to avoid padding. In causal LM, we can pack via

(source->target)[IGNORE_INDEX](source->target)[IGNORE_INDEX]...(source->target)[IGNORE_INDEX])

Notice how the target string immediately comes after the source. In contrast, packing for seq2seq LM will look like

Input: (source)[IGNORE_INDEX](source)[IGNORE_INDEX]...(source)[IGNORE_INDEX]
Target: (target)[IGNORE_INDEX](target)[IGNORE_INDEX]...(target)[IGNORE_INDEX]

To me, it's not intuitive that the model can match the ith target to the ith source string.

Credits

Cheers to Alpaca, LlaMMA, and OS for finally solving this engineering puzzle for me! Do LMK if any parts don't make sense to you - I'm still learning myself.

seungjun-green · 2023-06-24T11:26:14Z

Created training examples by concatenating inputs and targets like this: 'Document:{document}\nSummary:{Summary}'
and created text summary model with this. But the problem here is the model starts generating from Document not from Summary. Would be there anyway to handle this problem?

fabrahman changed the title ~~How is it possible to furthur tune gpt-2 in a seq2seq manner?~~ How is it possible to furthur tune gpt-2(or gpt) in a seq2seq manner? Oct 8, 2019

ZhangTianrong mentioned this issue Dec 7, 2019

GPT2: how to construct batch for Language Modeling #2001

Closed

stale bot added the wontfix label Dec 9, 2019

stale bot closed this as completed Dec 16, 2019

patrickvonplaten mentioned this issue Mar 29, 2021

wav2vec2: adding single-char tokens to tokenizer causes tokenization mistakes #10622

Closed

4 tasks

YerongLi mentioned this issue Oct 23, 2023

Fine tune decoder-only transformers in seq2seq manner #27005

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is it possible to furthur tune gpt-2(or gpt) in a seq2seq manner? #1464

How is it possible to furthur tune gpt-2(or gpt) in a seq2seq manner? #1464

fabrahman commented Oct 8, 2019

thomwolf commented Oct 9, 2019

dvaltchanov commented Oct 9, 2019

fabrahman commented Oct 9, 2019 •

edited

Loading

dvaltchanov commented Oct 9, 2019

fabrahman commented Oct 9, 2019 •

edited

Loading

dvaltchanov commented Oct 9, 2019 •

edited

Loading

fabrahman commented Oct 9, 2019 •

edited

Loading

dvaltchanov commented Oct 10, 2019

fabrahman commented Oct 10, 2019

stale bot commented Dec 9, 2019

sachinruk commented Mar 19, 2021

thesofakillers commented Nov 14, 2022 •

edited

Loading

larrylawl commented Mar 17, 2023 •

edited

Loading

seungjun-green commented Jun 24, 2023

How is it possible to furthur tune gpt-2(or gpt) in a seq2seq manner? #1464

How is it possible to furthur tune gpt-2(or gpt) in a seq2seq manner? #1464

Comments

fabrahman commented Oct 8, 2019

thomwolf commented Oct 9, 2019

dvaltchanov commented Oct 9, 2019

fabrahman commented Oct 9, 2019 • edited Loading

dvaltchanov commented Oct 9, 2019

fabrahman commented Oct 9, 2019 • edited Loading

dvaltchanov commented Oct 9, 2019 • edited Loading

fabrahman commented Oct 9, 2019 • edited Loading

dvaltchanov commented Oct 10, 2019

fabrahman commented Oct 10, 2019

stale bot commented Dec 9, 2019

sachinruk commented Mar 19, 2021

thesofakillers commented Nov 14, 2022 • edited Loading

larrylawl commented Mar 17, 2023 • edited Loading

Approach

Implications

Credits

seungjun-green commented Jun 24, 2023

fabrahman commented Oct 9, 2019 •

edited

Loading

fabrahman commented Oct 9, 2019 •

edited

Loading

dvaltchanov commented Oct 9, 2019 •

edited

Loading

fabrahman commented Oct 9, 2019 •

edited

Loading

thesofakillers commented Nov 14, 2022 •

edited

Loading

larrylawl commented Mar 17, 2023 •

edited

Loading