CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting"

This repository contains code for our COLM 2024 paper "CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting".

Quick Links

Overview
Data
Prompting
Extracting Culture Symbols
Culture Frequency in Training Data
Evaluation

Overview

Data

Each directory in data/[gpt-4,llama2-13b,mistral-7b] contains generations from prompting and culture symbols extracted from the generations generated by each model. Each directory contains the following files:

    # Raw generations from culture-conditioned generation task
    generations.json

    # Culture Symbols for each topic
    favorite_music.json
    exercise_routine.json
    music_instrument.json
    favorite_show_or_movie.json
    food.json
    picture.json
    statue.json
    clothing.json

Prompting

To prompt a language model for topic-wise culture-conditioned generations, run the following script:

python new_culture_prompt.py \
    --home_dir __ # the directory to store searched data \
    --model_name __ # gpt-4, mistral-7b, llama2-13b \
    --num_samples __ # default=100 \
    --prompt \
    --overwrite # flag to overwrite existing cache \
    --probably # add "probably" to prompt \
    --topic_list __

Extracting Culture Symbols

First, extract candidate symbols from raw generations using a language model. The code implementation uses gpt-4-turbo-preview.

python new_culture_prompt.py \
    --home_dir __ # the directory to store searched data \
    --model_name __ # gpt-4, mistral-7b, llama2-13b \
    --num_samples __ # default=100 \
    --shorten \
    --probably # prompted with "probably" \
    --topic_list __

Calculate Symbol and Culture joint probability. Then choose culture symbols for each culture based on probability.

python culture_symbols.py \
    --home_dir __ # the directory to store searched data \
    --model_name __ # gpt-4, mistral-7b, llama2-13b \
    --num_samples __ # default=100 \
    --probably # prompted with "probably" \
    --topic_list __ \
    --extract # extract ngrams from candidate symbols \
    --probability # calculate symbol-culture joint probability \
    --choose # choose culture symbols for each culture

Total number of culture symbols extracted for each LLM. *gpt-4: only candidate symbols

Culture Frequency in Training Data

Frequency of cultures in RedPajama dataset is stored in data/dataset_search/nationality_count_document.pkl, and culture-topic co-occurence frequency in RedPajama dataset is stored in data/dataset_search/nationality_topic_count.pkl.

Number of documents in which the culture is mentioned in RedPajama

Evaluation

Markedness

Evaluate markedness in a model's generation.

python cultural_evaluations.py \ 
    --home_dir __ # the directory to store searched data \
    --model_name __ # gpt-4, mistral-7b, llama2-13b \
    --num_samples __ # default=100 \
    --probably # prompted with "probably" \
    --topic_list __ \
    --eval markedness \
    --plot \

Output will be stored in "{args.home_dir}/probable_data/categories_nationality_100_{args.model_name}_prob={args.probably}_markedness_evaluation.json" in the following format:

"{topic}": {
    "neighbor": {
        "{culture}": {
            "male": {
                "vocab_mark": int, # generations with vocabulary markers, eg. "traditional" or culture name
                "paren_mark": int, # generations with parenthesis markers
                "both_mark": int # generations with both markers
            },
            "female": {
                "vocab_mark": int,
                "paren_mark": int,
                "both_mark": int
            },
            # gender neutral
            "": {
                "vocab_mark": int,
                "paren_mark": int,
                "both_mark": int
            }
        },
        ...
    }
},
...

Plot of vocabulary markedness on "clothing" topic

Diversity

Calculate diversity in a model's generation.

python cultural_evaluations.py \ 
    --home_dir __ # the directory to store searched data \
    --model_name __ # gpt-4, mistral-7b, llama2-13b \
    --num_samples __ # default=100 \
    --probably # prompted with "probably" \
    --topic_list __ \
    --eval diversity

Output will be stored in "{args.home_dir}/probable_data/categories_nationality_100_{args.model_name}_prob={args.probably}_diversity_evaluation_count.json" in the following format:

"{topic}": {
    "": {
        "{culture}": {
            float ([0,1]), # simpson index of diversity
            int # number of unique culture symbols assigned to the culture
        },
        ...
    }
},
...

Calculate diversity/markedness correlation with culture appearance frequency in training data

python cultural_evaluations.py \ 
    --home_dir __ # the directory to store searched data \
    --model_name __ # gpt-4, mistral-7b, llama2-13b \
    --num_samples __ # default=100 \
    --probably # prompted with "probably" \
    --topic_list __ \
    --eval correlation

Kendall-tau correlation between diversity (count) and culture-topic correlation frequency in RedPajama

Culture-agnostic Presence

Count culture symbols for each culture that appear in culture-agnostic generations in a model's generation. Plot the boxplot showing the variance of overlapping culture symbols within each geographic region.

python cultural_evaluations.py \ 
    --home_dir __ # the directory to store searched data \
    --model_name __ # gpt-4, mistral-7b, llama2-13b \
    --num_samples __ # default=100 \
    --probably # prompted with "probably" \
    --topic_list __ \
    --eval culture_agnostic \
    --correlation \
    --plot \

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
figures		figures
script		script
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting"

Quick Links

Overview

Data

Prompting

Extracting Culture Symbols

Culture Frequency in Training Data

Evaluation

Markedness

Diversity

Calculate diversity/markedness correlation with culture appearance frequency in training data

Culture-agnostic Presence

About

Releases

Packages

Languages

License

huihanlhh/Culture-Gen

Folders and files

Latest commit

History

Repository files navigation

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting"

Quick Links

Overview

Data

Prompting

Extracting Culture Symbols

Culture Frequency in Training Data

Evaluation

Markedness

Diversity

Calculate diversity/markedness correlation with culture appearance frequency in training data

Culture-agnostic Presence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages