Problem with Evaluation #4

duygunuryldz · 2024-04-08T00:04:48Z

Dear authors,
I have been trying to reproduce your results in the paper. I realized a problem with the evaluation of Entity Inferences dataset.
To calculate the probability of each label for one sample, probe sentence is forwarded only once with the first label. Then the logits of this pass is used to calculate the probability for all other labels. This approach brings out a problem when the tokenized lengths of labels are different.

For example, in edit_func.py logits for raw model are yielded at line 320:

pre_edit_logits = model_raw(**ex).logits

where ex is the probe sentence with the first label and created at line 312:

ex = {}
ex['input_ids'] = batch_pre["edit_inner"][0]['labels']['input_ids'][0].unsqueeze(0)
ex['attention_mask'] = batch_pre["edit_inner"][0]['labels'][ 'attention_mask'][0].unsqueeze(0)
ex['labels'] = batch_pre["edit_inner"][0]['labels']['input_ids'][ 0].unsqueeze(0)

Then, at lines between 347-369 in the same file probs are calculated for all labels of that sample:

    with torch.no_grad():
        n_probe_labels = batch_pre['edit_inner'][0]['labels']['input_ids'].size(0)
        pre_edit_dict = []
        for i in range(n_probe_labels):
            if dataset_name == 'ecbd':
                #code
            else:
                pre_label =  batch_pre["edit_inner"][0]['labels']['input_ids'][ i].unsqueeze(0)
            pre_edit_dict.append(get_log_probs(pre_edit_logits, pre_label, shift=True))

I believe, for each label pre_edit_logits should be obtained by forwarding the probe sentence with that label as follows:

with torch.no_grad():
        n_probe_labels = batch_pre['edit_inner'][0]['labels']['input_ids'].size(0)
        pre_edit_dict = []
        for i in range(n_probe_labels):
            ex = {}
            ex['input_ids'] = batch_pre["edit_inner"][0]['labels']['input_ids'][i].unsqueeze(0)
            ex['attention_mask'] = batch_pre["edit_inner"][0]['labels'][ 'attention_mask'][i].unsqueeze(0)
            ex['labels'] = batch_pre["edit_inner"][0]['labels']['input_ids'][i].unsqueeze(0)

            pre_edit_logits = model_raw(**ex).logits
            pre_edit_dict.append(get_log_probs( pre_edit_logits, ex['labels'], shift=True))

This issue exist also in other functions in edit_func.py. You can check the validity of the problem with the first sample in fake_person_implicit_attribute_dependent_adjective.json. I get 34.71 pre acc for gpt2-xl on entity inferences dataset with the version I provided.

I might also be misundertanding some parts. Please correct me if I am wrong.
Thanks in advance

The text was updated successfully, but these errors were encountered:

joshinh · 2024-04-24T20:16:33Z

Hi! I was trying out this code and also noticed this issue. But I think this is only an issue when we use edit_method = "prepend_def" right? When I try edit_method = "ft_per_ex" (i.e. finetune one example at a time), I think this is handled correctly (line 191 in edit_func.py) --- I do get 34.71% pre accuracy in this case for gpt2-xl.

EmilyGirl · 2024-05-09T02:27:36Z

Hello, do you have any code for generating using the generate function?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with Evaluation #4

Problem with Evaluation #4

duygunuryldz commented Apr 8, 2024 •

edited

Loading

joshinh commented Apr 24, 2024

EmilyGirl commented May 9, 2024

Problem with Evaluation #4

Problem with Evaluation #4

Comments

duygunuryldz commented Apr 8, 2024 • edited Loading

joshinh commented Apr 24, 2024

EmilyGirl commented May 9, 2024

duygunuryldz commented Apr 8, 2024 •

edited

Loading