This repository contains the research work and implementation for investigating how Large Language Models (LLMs) understand and process word meanings in different contexts. The project specifically focuses on semantic analysis and polysemy, examining how well various LLMs can distinguish between different meanings of the same word in different contexts.
The research explores three fundamental questions:
- Can LLMs grasp word meanings and sentence context with human-like understanding?
- How do LLMs perform when interpreting word meanings across different contexts?
- What are effective methods for evaluating LLM performance in this domain?
Our investigation centers on semantic analysis and polysemy - the linguistic phenomenon where words have multiple related meanings sharing core semantic properties but different contextual usage. For example:
-
The word "head" can mean:
- A leader/person in charge: "She is the head of the department"
- A physical body part: "He hit his head on the doorframe"
-
The word "paper" can refer to:
- An academic article: "The professor published a paper in Nature"
- A physical material: "I need some paper to print these documents"
We utilized the Multilingual Word-in-Context (WiC) Dataset for our experiments. This dataset is specifically designed to evaluate LLMs' ability to disambiguate word meanings through binary classification:
- Label 1: Same meaning in both contexts
- Label 0: Different meanings
The dataset structure includes:
- target_word: The word being analyzed in both examples
- PoS: Part-of-Speech tag (N for noun, V for verb)
- start-index_i and end-index_i: Character indices for target word location
- example_i: Context sentences
- label: Binary classification (0/1)
We analyzed 100 equally distributed samples across five different LLMs:
- GPT-4-turbo (OpenAI)
- GPT-4 (OpenAI)
- GPT-3.5 (OpenAI)
- Phi-3 (Microsoft)
- llama-3.2 (Meta)
Access methods:
- OpenAI API for GPT models
- Ollama for Phi-3 and llama-3.2 (open-source models)
The models were evaluated using two distinct prompt templates:
Template 1: You are a helpful assistant. My goal is to check if a target word in a sentence has the same intended sense in another sentence. A target word and two sentences will be passed. Check if the intended sense is the same or not. Label as (1) if the meaning is the same, viceversa (0) for different meaning. In the final response you always add a line where you write the correct Label you think it is, starting in the following way: Label
Template 2: You are a linguistics expert. Determine if a given 'target word' has a same meaning in two sentences. Respond with 1 if the meaning of the word is similar or 0 if it is too different. Include your explanation, then write in new line 'Label: 1 or 0'
We employed multiple evaluation metrics:
- Accuracy
- Precision
- Recall
- F1 score
- Confusion Matrix
We conducted detailed embedding analysis using:
- Principal Component Analysis (PCA) for dimensionality reduction
- Cosine similarity for comparing context embeddings
Model | Template | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|
GPT-4 Turbo | T0 | 0.66 | 0.83 | 0.40 | 0.54 |
T1 | 0.71 | 0.76 | 0.62 | 0.68 | |
GPT-4 | T0 | 0.65 | 0.78 | 0.42 | 0.55 |
T1 | 0.73 | 0.73 | 0.72 | 0.73 | |
GPT-3.5 | T0 | 0.57 | 0.77 | 0.20 | 0.32 |
T1 | 0.54 | 0.55 | 0.48 | 0.51 | |
Phi-3 | T0 | 0.54 | 0.67 | 0.16 | 0.26 |
T1 | 0.65 | 0.65 | 0.64 | 0.65 | |
LLAMA 3.2 | T0 | 0.52 | 0.57 | 0.16 | 0.25 |
T1 | 0.64 | 0.61 | 0.80 | 0.69 |
Note: T0 and T1 represent Template 0 and Template 1 respectively. The metrics demonstrate that GPT-4 Turbo achieves the highest precision (0.83) with Template 0, while Template 1 generally leads to more balanced performance across metrics.
Our analysis reveals several key insights:
-
GPT-4 Turbo shows strong performance with both templates:
- Template 0 achieves exceptional precision (0.83) but lower recall (0.40)
- Template 1 delivers more balanced metrics with improved recall (0.62) and F1 score (0.68)
-
Comparing across models:
- GPT-4 maintains the most consistent performance across metrics, especially with Template 1
- GPT-4 Turbo excels in precision but shows room for improvement in recall
- The open-source models demonstrate competitive performance, particularly with Template 1
- LLAMA 3.2 achieves the highest recall score (0.80) with Template 1
-
Template patterns:
- Template 1 generally produces more balanced metrics across all models
- Template 0 tends to favor precision over recall