-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C4 PPL Evaluation Script #18
Comments
Thanks for the kind words and the question! Adding @radi-cho to the conversation, who can better answer this question. |
Hi, thanks a lot for your reply! Just want to follow up with my issue. Looking forward to your reply! |
One more gentle reminder @radi-cho (who's busy with end-of-semester things lately) |
Hi, The perplexity can be calculated as described here, https://huggingface.co/docs/transformers/en/perplexity, with sequence length 2048. The key with C4 is that different papers use different subsets of examples to calculate it. I found this to be the most commonly used: import random
test = load_dataset(
'allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation'
)
random.seed(0)
test_encodings = []
for _ in range(256):
while True:
i = random.randint(0, len(test) - 1)
tmp = tokenizer(test[i]['text'], return_tensors='pt')
if tmp.input_ids.shape[1] > 2048:
break
i = random.randint(0, tmp.input_ids.shape[1] - 2048 - 1)
j = i + 2048
test_encodings.append(tmp.input_ids[:, i:j])
test_encodings = torch.hstack(test_encodings) We loaded the subset as above for the flute paper. |
Thank you very much for the kind reply. I think the subset sampling function that you used is the same as the one in OmniQuant. I am using the following code extracted from the OmniQuant codebase to calculate the C4 perplexity: random.seed(0)
valenc = []
testenc = load_dataset(
'allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation'
)
for _ in range(256):
while True:
i = random.randint(0, len(testenc) - 1)
tmp = tokenizer(testenc[i]['text'], return_tensors='pt')
if tmp.input_ids.shape[1] > 2048:
break
i = random.randint(0, tmp.input_ids.shape[1] - 2048 - 1)
j = i + 2048
valenc.append(tmp.input_ids[:, i:j])
testenc = torch.hstack(valenc)
nsamples = testenc.numel() // 2048
loss_fct = nn.CrossEntropyLoss()
nlls = []
with tqdm.tqdm(range(nsamples)) as progress:
for i in progress:
batch = testenc[:, (i * 2048) : ((i + 1) * 2048)].to(model.device)
with torch.no_grad():
lm_logits = model(batch, use_cache=False, output_hidden_states=False, output_attentions=False)[0]
shift_logits = lm_logits[:, :-1, :].contiguous().float()
shift_labels = testenc[:, (i * 2048) : ((i + 1) * 2048)][:, 1:].to(model.device)
loss = loss_fct(
shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1),
)
neg_log_likelihood = loss.float() * 2048
nlls.append(neg_log_likelihood.item())
progress.set_description(f"Evaluating")
ppl = torch.exp(torch.tensor(nlls).sum() / (nsamples * 2048))
print(f'c4 perplexity: {ppl} \n') The method to calculate perplexity follows the one in your shared HuggingFace link. The above code gives me the C4 perplexity of 8.88 for Llama-3-8B, which is the same as the number reported in OmniQuant. However, the C4 perplexity reported in Table I of the Flute paper is 9.2. I am not sure if my evaluation script is different from yours. Hope you could help me clarify, thanks a lot! |
Ah, I see, you are talking about the unquantized results. While I cannot immediately confirm if your script is consistent with ours, I remember that for unquantized models we relied on and cited https://arxiv.org/pdf/2404.14047 as a source for comparison. |
Thanks for sharing the reference. I understand that the FP16 ppl could be directly sourced as 9.2 for Llama-3-8B. However, I think the 4-bit and 3-bit PPL reported in the Flute paper are also calculated based on the method that gets the 9.2 PPL for FP16 right? Since the referenced paper also doesn't have the C4 ppl evaluation script. I wonder if you could share what script you used to calculate the 4-bit and 3-bit C4 PPL for Flute, so that I can use that to see if my quantization method has a better PPL or not. Looking forward to your reply! |
Thank you for your question and for your patience, especially as the end of the semester/year became quite hectic. We referenced the C4 numbers from [1] because our own calculations were initially in line with their findings. After your question, we revisited the literature and examined more recent papers [2, 3]. It seems like there are discrepancies in the reported C4 numbers for what appears to be the same model. We suspect these variations may be due to differences in the data subsets used by various implementations. Notably, [2] reported the 8.88 PPL number you mentioned. With this in mind, we re-examined our (unquantized) baseline and found that our implementations seem to more closely match those in [2]. This led us to recognize that the C4 numbers we originally cited are inconsistent with both your calculations and our own. Below, we have attached the scripts we used to compute the PPL values. For reference, these scripts produced a PPL of 8.8844 for the unquantized model, which aligns closely with your results. The script should produce consistent comparisons with our NF and custom AWQ results. We plan to update the borrowed C4 numbers in our paper to ensure consistency. Thank you again for drawing our attention to this and for helping us improve our work! [1] https://arxiv.org/abs/2404.14047 test = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')
random.seed(0)
test_encodings = []
for _ in range(256):
while True:
i = random.randint(0, len(test) - 1)
tmp = tokenizer(test[i]['text'], return_tensors='pt')
if tmp.input_ids.shape[1] > 2048:
break
i = random.randint(0, tmp.input_ids.shape[1] - 2048 - 1)
j = i + 2048
test_encodings.append(tmp.input_ids[:, i:j])
test_encodings = torch.hstack(test_encodings) device = "cuda"
max_length = 2048
stride = max_length
seq_len = test_encodings.size(1)
with torch.no_grad():
nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
end_loc = min(begin_loc + max_length, seq_len)
trg_len = end_loc - prev_end_loc
input_ids = test_encodings[:, begin_loc:end_loc].to(device)
target_ids = input_ids.clone()
target_ids[:, :-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
neg_log_likelihood = outputs.loss
nlls.append(neg_log_likelihood)
print(torch.exp(torch.stack(nlls).mean()))
prev_end_loc = end_loc
if end_loc == seq_len:
break
ppl = torch.exp(torch.stack(nlls).mean()) |
Thanks for sharing the nice work. May I know what script you used for evaluating C4 perplexity? I noticed that the FP16 C4 PPL in your paper is different from numbers reported in OmniQuant.
The text was updated successfully, but these errors were encountered: