Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jennhu/metalinguistic-prompting: Materials for "Prompting is not a substitute for probability measurements in large language models" (EMNLP 2023) #684

Open
1 task
irthomasthomas opened this issue Mar 4, 2024 · 1 comment
Labels
Algorithms Sorting, Learning or Classifying. All algorithms go here. Code-Interpreter OpenAI Code-Interpreter dataset public datasets and embeddings llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers Research personal research notes for a topic

Comments

@irthomasthomas
Copy link
Owner

Title

jennhu/metalinguistic-prompting: Materials for "Prompting is not a substitute for probability measurements in large language models" (EMNLP 2023)

Description

"Prompting is not a substitute for probability measurements in large language models

This repository contains materials for the EMNLP 2023 paper "Prompting is not a substitute for probability measurements in large language models" (Hu & Levy, 2023). The preprint is available on arXiv.

If you find the code or data useful in your research, please use the following citation:

@inproceedings{hu_prompting_2023,
title = {Prompting is not a substitute for probability measurements in large language models},
author = {Hu, Jennifer and Levy, Roger},
year = {2023},
booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
url = https://arxiv.org/abs/2305.13264
}

Evaluation materials

Evaluation datasets can be found in the datasets folder. Please refer to the README in that folder for more details on how the stimuli were assembled and formatted.

Evaluation scripts

The scripts folder contains scripts for running the experiments. There are separate scripts for models accessed through Huggingface (*hf.sh) and the OpenAI API (*openai.sh).

For example, to evaluate flan-t5-small on the SyntaxGym dataset of Experiment 3b, run the following command from the root of this directory:

bash scripts/run_exp3b_hf.sh syntaxgym google/flan-t5-small flan-t5-small

Please note that to run the OpenAI models, you will need to save your OpenAI API key to a file named key.txt in the root of this directory. For security reasons, do not commit this file (it is ignored in .gitignore).

Results and analyses

The results from the paper can be accessed by extracting the results.zip file. This will create a folder called results, which is organized by experiment:

  • exp1_word-prediction
  • exp2_word-comparison
  • exp3a_sentence-judge
  • exp3b_sentence-comparison

A few notes about the results:

  • Each result file is named in the following format: <eval_type>.json. For the experiments where option order matters, there is an addition _<option_order> suffix in the name.
  • Each result file is formatted as a JSON file, with two dictionaries:
    • meta contains meta information about the run (e.g., name of model, timestamp of run, path to data file)
    • results contains the results from the run, formatted as a list of dictionaries (one per stimulus item)

The results from the direct evaluation method are identical across Experiments 3a and 3b (see paper for details).

The figures from our paper can be reproduced using the analysis.ipynb notebook.

URL

https://github.com/jennhu/metalinguistic-prompting

Suggested labels

{'label-name': 'EMNLP-2023', 'label-description': "Materials and information related to the EMNLP 2023 paper 'Prompting is not a substitute for probability measurements in large language models'", 'confidence': 56.95}

@irthomasthomas irthomasthomas added Algorithms Sorting, Learning or Classifying. All algorithms go here. Code-Interpreter OpenAI Code-Interpreter dataset public datasets and embeddings llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers Research personal research notes for a topic labels Mar 4, 2024
@irthomasthomas
Copy link
Owner Author

irthomasthomas commented Mar 4, 2024

Related content

#684 - Similarity score: 1.0

#238 - Similarity score: 0.91

#309 - Similarity score: 0.89

#136 - Similarity score: 0.88

#706 - Similarity score: 0.88

#715 - Similarity score: 0.87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithms Sorting, Learning or Classifying. All algorithms go here. Code-Interpreter OpenAI Code-Interpreter dataset public datasets and embeddings llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers Research personal research notes for a topic
Projects
None yet
Development

No branches or pull requests

1 participant