Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C4 PPL Evaluation Script #18

Open
yc2367 opened this issue Dec 10, 2024 · 8 comments
Open

C4 PPL Evaluation Script #18

yc2367 opened this issue Dec 10, 2024 · 8 comments

Comments

@yc2367
Copy link

yc2367 commented Dec 10, 2024

Thanks for sharing the nice work. May I know what script you used for evaluating C4 perplexity? I noticed that the FP16 C4 PPL in your paper is different from numbers reported in OmniQuant.

@HanGuo97
Copy link
Owner

Thanks for the kind words and the question!

Adding @radi-cho to the conversation, who can better answer this question.

@yc2367
Copy link
Author

yc2367 commented Dec 14, 2024

Hi, thanks a lot for your reply! Just want to follow up with my issue. Looking forward to your reply!

@HanGuo97
Copy link
Owner

One more gentle reminder @radi-cho (who's busy with end-of-semester things lately)

@radi-cho
Copy link
Contributor

Hi, The perplexity can be calculated as described here, https://huggingface.co/docs/transformers/en/perplexity, with sequence length 2048. The key with C4 is that different papers use different subsets of examples to calculate it. I found this to be the most commonly used:

import random

test = load_dataset(
    'allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation'
)

random.seed(0)
test_encodings = []

for _ in range(256):
    while True:
        i = random.randint(0, len(test) - 1)
        tmp = tokenizer(test[i]['text'], return_tensors='pt')
        if tmp.input_ids.shape[1] > 2048:
            break
    i = random.randint(0, tmp.input_ids.shape[1] - 2048 - 1)
    j = i + 2048
    test_encodings.append(tmp.input_ids[:, i:j])

test_encodings = torch.hstack(test_encodings)

We loaded the subset as above for the flute paper.

@yc2367
Copy link
Author

yc2367 commented Dec 15, 2024

Thank you very much for the kind reply. I think the subset sampling function that you used is the same as the one in OmniQuant. I am using the following code extracted from the OmniQuant codebase to calculate the C4 perplexity:

random.seed(0)
valenc = []
testenc = load_dataset(
    'allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation'
)
for _ in range(256):
    while True:
        i = random.randint(0, len(testenc) - 1)
        tmp = tokenizer(testenc[i]['text'], return_tensors='pt')
        if tmp.input_ids.shape[1] > 2048:
            break
    i = random.randint(0, tmp.input_ids.shape[1] - 2048 - 1)
    j = i + 2048
    valenc.append(tmp.input_ids[:, i:j])

testenc = torch.hstack(valenc)

nsamples = testenc.numel() // 2048
loss_fct = nn.CrossEntropyLoss()
nlls = []
with tqdm.tqdm(range(nsamples)) as progress:
    for i in progress:
        batch = testenc[:, (i * 2048) : ((i + 1) * 2048)].to(model.device)
        with torch.no_grad():
            lm_logits = model(batch, use_cache=False, output_hidden_states=False, output_attentions=False)[0]
        shift_logits = lm_logits[:, :-1, :].contiguous().float()
        shift_labels = testenc[:, (i * 2048) : ((i + 1) * 2048)][:, 1:].to(model.device)
        loss = loss_fct(
            shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1),
        )
        neg_log_likelihood = loss.float() * 2048
        nlls.append(neg_log_likelihood.item())
        progress.set_description(f"Evaluating")

ppl = torch.exp(torch.tensor(nlls).sum() / (nsamples * 2048))
print(f'c4 perplexity: {ppl} \n')

The method to calculate perplexity follows the one in your shared HuggingFace link. The above code gives me the C4 perplexity of 8.88 for Llama-3-8B, which is the same as the number reported in OmniQuant.

However, the C4 perplexity reported in Table I of the Flute paper is 9.2. I am not sure if my evaluation script is different from yours. Hope you could help me clarify, thanks a lot!

@radi-cho
Copy link
Contributor

Ah, I see, you are talking about the unquantized results. While I cannot immediately confirm if your script is consistent with ours, I remember that for unquantized models we relied on and cited https://arxiv.org/pdf/2404.14047 as a source for comparison.

@yc2367
Copy link
Author

yc2367 commented Dec 16, 2024

Thanks for sharing the reference. I understand that the FP16 ppl could be directly sourced as 9.2 for Llama-3-8B. However, I think the 4-bit and 3-bit PPL reported in the Flute paper are also calculated based on the method that gets the 9.2 PPL for FP16 right?

Since the referenced paper also doesn't have the C4 ppl evaluation script. I wonder if you could share what script you used to calculate the 4-bit and 3-bit C4 PPL for Flute, so that I can use that to see if my quantization method has a better PPL or not.

Looking forward to your reply!

@radi-cho
Copy link
Contributor

Thank you for your question and for your patience, especially as the end of the semester/year became quite hectic.

We referenced the C4 numbers from [1] because our own calculations were initially in line with their findings. After your question, we revisited the literature and examined more recent papers [2, 3]. It seems like there are discrepancies in the reported C4 numbers for what appears to be the same model. We suspect these variations may be due to differences in the data subsets used by various implementations. Notably, [2] reported the 8.88 PPL number you mentioned.

With this in mind, we re-examined our (unquantized) baseline and found that our implementations seem to more closely match those in [2]. This led us to recognize that the C4 numbers we originally cited are inconsistent with both your calculations and our own.

Below, we have attached the scripts we used to compute the PPL values. For reference, these scripts produced a PPL of 8.8844 for the unquantized model, which aligns closely with your results. The script should produce consistent comparisons with our NF and custom AWQ results.

We plan to update the borrowed C4 numbers in our paper to ensure consistency. Thank you again for drawing our attention to this and for helping us improve our work!

[1] https://arxiv.org/abs/2404.14047
[2] https://arxiv.org/abs/2407.11062
[3] https://arxiv.org/abs/2405.14917

test = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

random.seed(0)
test_encodings = []

for _ in range(256):
    while True:
        i = random.randint(0, len(test) - 1)
        tmp = tokenizer(test[i]['text'], return_tensors='pt')
        if tmp.input_ids.shape[1] > 2048:
            break
    i = random.randint(0, tmp.input_ids.shape[1] - 2048 - 1)
    j = i + 2048
    test_encodings.append(tmp.input_ids[:, i:j])

test_encodings = torch.hstack(test_encodings)
device = "cuda"
max_length = 2048
stride = max_length
seq_len = test_encodings.size(1)

with torch.no_grad():
    nlls = []
    prev_end_loc = 0
    for begin_loc in tqdm(range(0, seq_len, stride)):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end_loc
        input_ids = test_encodings[:, begin_loc:end_loc].to(device)
        target_ids = input_ids.clone()
        target_ids[:, :-trg_len] = -100
    
        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            neg_log_likelihood = outputs.loss

        nlls.append(neg_log_likelihood)
        print(torch.exp(torch.stack(nlls).mean()))
    
        prev_end_loc = end_loc
        if end_loc == seq_len:
            break
    
    ppl = torch.exp(torch.stack(nlls).mean())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants