-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is last token dropped in loglikelihood computation? Gives different result than when calculating loss. #942
Comments
Hi! I'll add a further note on this to the comment and the documentation, as this is a frequent question. The reason behind chopping off the last completion token is that for autoregressive LLMs, they take in tokens up to When we're feeding in
what we want is the logits predicting Leaving open until I update the documentation. If this doesn't make sense happy to clarify further! |
For multi-token continuations, do we only drop the last token? if the input is 0 1 2 3 and the continuation is 4 5 6, do we condition on 0 1 2 3 4 5? |
Thank you very much, @haileyschoelkopf for a swift reply! IDK why I thought that calling cross entropy loss on the logits would magically handle this for me, this shifting is of course also implemented in decoder model losses. For completeness, I have updated my little code snippet such that it gives the same result import torch
import torch.nn
from transformers import AutoModelForCausalLM, AutoTokenizer
from lm_eval.models.huggingface import AutoCausalLM
lm_key = "sshleifer/tiny-gpt2"
context = "we are the"
cont = " koala bears of the world"
model = AutoModelForCausalLM.from_pretrained(lm_key)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(lm_key)
encodings = tokenizer(context, text_target=cont, return_tensors="pt")
input_ids = torch.cat((encodings.input_ids, encodings.labels), dim=1)
target_ids = input_ids.clone()
# Makes context ignored by loss function
target_ids[:, : encodings.input_ids.size(1)] = -100
with torch.no_grad():
logits = model(input_ids).logits
# Move vocab dimension last as we do classification over these
logits = logits.permute(0, 2, 1)
# Task: Next-token-prediction => shift tokens
target_ids = target_ids[:, 1:]
logits = logits[:, :, :-1]
losses = torch.nn.CrossEntropyLoss(reduction="none")(logits, target_ids)
print(-losses.sum().item())
# Result: -65.07633972167969
lm_eval_model = AutoCausalLM(lm_key, device="cpu")
print(lm_eval_model.loglikelihood([(context, cont)])[0][0])
# Result: -65.07632446289062
# Same results - yay! |
Glad this is helpful!!
@sasaadi yes, we would feed So if the continuation is
And the loglikelihood of the completion is the loglikelihood of producing all 3 completion tokens in turn, starting from |
Question
In
lm-evaluation-harness/lm_eval/base.py
Line 342 in 3ccea2b
(and refactor:
lm-evaluation-harness/lm_eval/models/huggingface.py
Line 708 in 408115e
the last input token is dropped before the model call.
This is motivated by this diagram:
I must admit that I do not understand why this is: Does anyone have som pointers as to why removing this yields correct probabilities (surely the value of the last token matters for the overall likelihood?).
Minimal Example
The below computation shows that I can reproduce the result of _loglikelihood_tokens only if I remove the
[:-1]
, otherwise there is a difference from the last token:Similar issues
A similar question was asked in #337 where @jon-tow, who asked the question, closed with the message
The text was updated successfully, but these errors were encountered: