C4 PPL Evaluation Script #18

yc2367 · 2024-12-10T20:10:01Z

Thanks for sharing the nice work. May I know what script you used for evaluating C4 perplexity? I noticed that the FP16 C4 PPL in your paper is different from numbers reported in OmniQuant.

HanGuo97 · 2024-12-10T20:12:36Z

Thanks for the kind words and the question!

Adding @radi-cho to the conversation, who can better answer this question.

yc2367 · 2024-12-14T10:57:26Z

Hi, thanks a lot for your reply! Just want to follow up with my issue. Looking forward to your reply!

HanGuo97 · 2024-12-14T19:13:08Z

One more gentle reminder @radi-cho (who's busy with end-of-semester things lately)

radi-cho · 2024-12-15T09:20:02Z

Hi, The perplexity can be calculated as described here, https://huggingface.co/docs/transformers/en/perplexity, with sequence length 2048. The key with C4 is that different papers use different subsets of examples to calculate it. I found this to be the most commonly used:

import random

test = load_dataset(
    'allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation'
)

random.seed(0)
test_encodings = []

for _ in range(256):
    while True:
        i = random.randint(0, len(test) - 1)
        tmp = tokenizer(test[i]['text'], return_tensors='pt')
        if tmp.input_ids.shape[1] > 2048:
            break
    i = random.randint(0, tmp.input_ids.shape[1] - 2048 - 1)
    j = i + 2048
    test_encodings.append(tmp.input_ids[:, i:j])

test_encodings = torch.hstack(test_encodings)

We loaded the subset as above for the flute paper.

yc2367 · 2024-12-15T12:41:56Z

Thank you very much for the kind reply. I think the subset sampling function that you used is the same as the one in OmniQuant. I am using the following code extracted from the OmniQuant codebase to calculate the C4 perplexity:

random.seed(0)
valenc = []
testenc = load_dataset(
    'allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation'
)
for _ in range(256):
    while True:
        i = random.randint(0, len(testenc) - 1)
        tmp = tokenizer(testenc[i]['text'], return_tensors='pt')
        if tmp.input_ids.shape[1] > 2048:
            break
    i = random.randint(0, tmp.input_ids.shape[1] - 2048 - 1)
    j = i + 2048
    valenc.append(tmp.input_ids[:, i:j])

testenc = torch.hstack(valenc)

nsamples = testenc.numel() // 2048
loss_fct = nn.CrossEntropyLoss()
nlls = []
with tqdm.tqdm(range(nsamples)) as progress:
    for i in progress:
        batch = testenc[:, (i * 2048) : ((i + 1) * 2048)].to(model.device)
        with torch.no_grad():
            lm_logits = model(batch, use_cache=False, output_hidden_states=False, output_attentions=False)[0]
        shift_logits = lm_logits[:, :-1, :].contiguous().float()
        shift_labels = testenc[:, (i * 2048) : ((i + 1) * 2048)][:, 1:].to(model.device)
        loss = loss_fct(
            shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1),
        )
        neg_log_likelihood = loss.float() * 2048
        nlls.append(neg_log_likelihood.item())
        progress.set_description(f"Evaluating")

ppl = torch.exp(torch.tensor(nlls).sum() / (nsamples * 2048))
print(f'c4 perplexity: {ppl} \n')

The method to calculate perplexity follows the one in your shared HuggingFace link. The above code gives me the C4 perplexity of 8.88 for Llama-3-8B, which is the same as the number reported in OmniQuant.

However, the C4 perplexity reported in Table I of the Flute paper is 9.2. I am not sure if my evaluation script is different from yours. Hope you could help me clarify, thanks a lot!

radi-cho · 2024-12-16T08:40:47Z

Ah, I see, you are talking about the unquantized results. While I cannot immediately confirm if your script is consistent with ours, I remember that for unquantized models we relied on and cited https://arxiv.org/pdf/2404.14047 as a source for comparison.

yc2367 · 2024-12-16T10:05:31Z

Thanks for sharing the reference. I understand that the FP16 ppl could be directly sourced as 9.2 for Llama-3-8B. However, I think the 4-bit and 3-bit PPL reported in the Flute paper are also calculated based on the method that gets the 9.2 PPL for FP16 right?

Since the referenced paper also doesn't have the C4 ppl evaluation script. I wonder if you could share what script you used to calculate the 4-bit and 3-bit C4 PPL for Flute, so that I can use that to see if my quantization method has a better PPL or not.

Looking forward to your reply!

radi-cho · 2024-12-19T08:26:01Z

Thank you for your question and for your patience, especially as the end of the semester/year became quite hectic.

We referenced the C4 numbers from [1] because our own calculations were initially in line with their findings. After your question, we revisited the literature and examined more recent papers [2, 3]. It seems like there are discrepancies in the reported C4 numbers for what appears to be the same model. We suspect these variations may be due to differences in the data subsets used by various implementations. Notably, [2] reported the 8.88 PPL number you mentioned.

With this in mind, we re-examined our (unquantized) baseline and found that our implementations seem to more closely match those in [2]. This led us to recognize that the C4 numbers we originally cited are inconsistent with both your calculations and our own.

Below, we have attached the scripts we used to compute the PPL values. For reference, these scripts produced a PPL of 8.8844 for the unquantized model, which aligns closely with your results. The script should produce consistent comparisons with our NF and custom AWQ results.

We plan to update the borrowed C4 numbers in our paper to ensure consistency. Thank you again for drawing our attention to this and for helping us improve our work!

[1] https://arxiv.org/abs/2404.14047
[2] https://arxiv.org/abs/2407.11062
[3] https://arxiv.org/abs/2405.14917

test = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

random.seed(0)
test_encodings = []

for _ in range(256):
    while True:
        i = random.randint(0, len(test) - 1)
        tmp = tokenizer(test[i]['text'], return_tensors='pt')
        if tmp.input_ids.shape[1] > 2048:
            break
    i = random.randint(0, tmp.input_ids.shape[1] - 2048 - 1)
    j = i + 2048
    test_encodings.append(tmp.input_ids[:, i:j])

test_encodings = torch.hstack(test_encodings)

device = "cuda"
max_length = 2048
stride = max_length
seq_len = test_encodings.size(1)

with torch.no_grad():
    nlls = []
    prev_end_loc = 0
    for begin_loc in tqdm(range(0, seq_len, stride)):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end_loc
        input_ids = test_encodings[:, begin_loc:end_loc].to(device)
        target_ids = input_ids.clone()
        target_ids[:, :-trg_len] = -100
    
        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            neg_log_likelihood = outputs.loss

        nlls.append(neg_log_likelihood)
        print(torch.exp(torch.stack(nlls).mean()))
    
        prev_end_loc = end_loc
        if end_loc == seq_len:
            break
    
    ppl = torch.exp(torch.stack(nlls).mean())

HanGuo97 · 2025-02-09T04:01:34Z

We have updated the manuscript --- thanks again for pointing this out to us!

yc2367 · 2025-02-09T04:06:56Z

Thank you very much for letting me know!

HanGuo97 closed this as completed Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C4 PPL Evaluation Script #18

C4 PPL Evaluation Script #18

yc2367 commented Dec 10, 2024

HanGuo97 commented Dec 10, 2024

yc2367 commented Dec 14, 2024

HanGuo97 commented Dec 14, 2024

radi-cho commented Dec 15, 2024

yc2367 commented Dec 15, 2024 •

edited

Loading

radi-cho commented Dec 16, 2024

yc2367 commented Dec 16, 2024

radi-cho commented Dec 19, 2024

HanGuo97 commented Feb 9, 2025

yc2367 commented Feb 9, 2025

C4 PPL Evaluation Script #18

C4 PPL Evaluation Script #18

Comments

yc2367 commented Dec 10, 2024

HanGuo97 commented Dec 10, 2024

yc2367 commented Dec 14, 2024

HanGuo97 commented Dec 14, 2024

radi-cho commented Dec 15, 2024

yc2367 commented Dec 15, 2024 • edited Loading

radi-cho commented Dec 16, 2024

yc2367 commented Dec 16, 2024

radi-cho commented Dec 19, 2024

HanGuo97 commented Feb 9, 2025

yc2367 commented Feb 9, 2025

yc2367 commented Dec 15, 2024 •

edited

Loading