Update November 01 2024: β Implemented a separated bulk-chain project for handling massive amount of prompts with CoT. This concept was used in this studies.
Update 06 September 2024: Mentioning the related information about the project at BU-research-blog
Update 11 August 2024: π€ Announcing the talk on this framework @ NLPSummit 2024 with the preliminary ad and details in X/Twitter post π¦.
Update 23 June 2024: All metrics in Development mode has been evaluated under
closest
mode which makes a decision of the result class by relying on the first entry of the label.
Update 11 June 2024: Added evaluation mode that counts first label entry. See
eval-mode
parameter key.
This repository assess the LLMs reasoning capabilities in Targeted Sentiment Analysis on RuSentNE dataset proposed as a part of the self-titled competition.
In particular, we use pre-treained LLMs for the following datset splits:
- π Development
- π Final
The following reasoning we use [quick-cot] to experiment with:
- Instruction Prompts
- Chain-of-Thoughts (THoR)
All the sqlite
results are stored in contents
table.
Option 1. You may use sqlitebrowser
for accessing the results for exporting into CSV
.
Option 2. Use sqlite2csv.py
script implemented in this repository.
This is an open-access dataset split (sentiment labels available) utilized for the development stage and could be used anyone in evaluation checks.
Dataset: valiation_data_labeled.csv
*
-- denotes evaluation in first-entry
mode (seeking for the first entry).
Model | lang | Mode | F1(P,N) | F1(P,N,0) | N/A % | Answers |
---|---|---|---|---|---|---|
GPT-3.5-0613 | πΊπΈ | CoT THoR | 43.46 | 46.16 | 0.21 | answers |
GPT-3.5-1106 | πΊπΈ | CoT THoR | 40.83 | 39.91 | 0.49 | answers |
mistral-7b | πΊπΈ | CoT THoR | 42.34 | 51.43 | 0.04 | answers |
Model | lang | Mode | F1(P,N) | F1(P,N,0) | N/A % | Answers |
---|---|---|---|---|---|---|
Proprietary | ||||||
GPT-4-turbo-2024-04-09 | πΊπΈ | zero-shot | 50.79 | 61.19 | 0.0 | answers |
GPT-3.5-0613 | πΊπΈ | zero-shot | 47.1 | 57.76 | 0.0 | answers |
GPT-3.5-1106 | πΊπΈ | zero-shot | 45.79 | 52.55 | 0.0 | answers |
mistral-large-latest | πΊπΈ | zero-shot | 44.48 | 57.24 | 0.0 | answers |
gpt-4o | πΊπΈ | zero-shot | 42.84 | 56.19 | 0.0 | answers |
Open & Less 100B | ||||||
llama-3-70b-instruct | πΊπΈ | zero-shot | 49.79 | 61.24 | 0.0 | answers |
mixtral-8x22b | πΊπΈ | zero-shot | 46.09 | 58.24 | 0.0 | answers |
Phi-3-small-8k-instruct | πΊπΈ | zero-shot | 46.87 | 57.02 | 0.07 | answers |
mixtral-8x7b | πΊπΈ | zero-shot | 47.33 | 56.36 | 0.07 | answers |
llama-2-70b-chat | πΊπΈ | zero-shot | 42.42 | 54.25 | 13.44 | answers |
Open & Less 10B | ||||||
Gemma-2-9b-it | πΊπΈ | zero-shot | 45.57 | 55.06 | 0.0 | answers |
llama-3-8b-instruct | πΊπΈ | zero-shot | 45.25 | 54.43 | 0.0 | answers |
Mistral-7B-Instruct-v0.3 | πΊπΈ | zero-shot | 45.23 | 55.5 | 0.0 | answers |
Phi-3-mini-4k-instruct | πΊπΈ | zero-shot | 44.62 | 54.71 | 0.0 | answers |
Qwen1.5-7B-Chat | πΊπΈ | zero-shot | 44.39 | 55.55 | 0.04 | answers |
google_flan-t5-xl | πΊπΈ | zero-shot | 43.73 | 53.72 | 0.0 | answers |
mistral-7b | πΊπΈ | zero-shot | 43.11 | 53.64 | 0.11 | answers |
Qwen2-7B-Instruct | πΊπΈ | zero-shot | 39.74 | 48.11 | 3.87 | answers |
Qwen2-1.5B-Instruct | πΊπΈ | zero-shot | 33.88 | 48.59 | 0.0 | answers |
Qwen1.5-1.8B-Chat | πΊπΈ | zero-shot | 33.65 | 47.28 | 0.04 | answers |
Open & Less 1B | ||||||
Flan-T5-large | πΊπΈ | zero-shot | 36.72 | 24.51 | 0.0 | answers |
Qwen2-0.5B-Instruct | πΊπΈ | zero-shot | 9.52 | 33.0 | 0.0 | answers |
Model | lang | Mode | F1(P,N) | F1(P,N,0) | N/A % | Answers |
---|---|---|---|---|---|---|
Proprietary | ||||||
GPT-3.5-0613 | π·πΊ | zero-shot | 44.15 | 53.63 | 1.51 | answers |
gpt-4o | π·πΊ | zero-shot | 44.15 | 57.5 | 0.0 | answers |
GPT-4-turbo-2024-04-09 | π·πΊ | zero-shot | 42.21 | 56.36 | 0.0 | answers |
GPT-3.5-1106 | π·πΊ | zero-shot | 41.34 | 46.83 | 0.46 | answers |
mistral-large-latest | π·πΊ | zero-shot | 22.33 | 43.07 | 0.04 | answers |
Open & Less 100B | ||||||
llama-3-70b-instruct | π·πΊ | zero-shot | 45.89 | 58.73 | 0.0 | answers |
mixtral-8x22b | π·πΊ | zero-shot | 42.64 | 54.91 | 0.0 | answers |
mixtral-8x7b | π·πΊ | zero-shot | 41.11 | 53.75 | 0.18 | answers |
Phi-3-small-8k-instruct | π·πΊ | zero-shot | 40.65 | 49.64 | 0.14 | answers |
llama-2-70b-chat | π·πΊ | zero-shot | 29.51 | 27.27 | 1.65 | answers |
Open & Less 10B | ||||||
Gemma-2-9b-it | π·πΊ | zero-shot | 46.5 | 55.9 | 0.04 | answers |
Qwen2-7B-Instruct | π·πΊ | zero-shot | 42.16 | 51.13 | 0.25 | answers |
mistral-7b | π·πΊ | zero-shot | 42.14 | 47.57 | 0.18 | answers |
mistral-7B-Instruct-v0.3 | π·πΊ | zero-shot | 41.73 | 44.24 | 0.18 | answers |
llama-3-8b-instruct | π·πΊ | zero-shot | 40.55 | 47.81 | 0.35 | answers |
Qwen1.5-7B-Chat | π·πΊ | zero-shot | 34.1 | 45.05 | 0.25 | answers |
Phi-3-mini-4k-instruct | π·πΊ | zero-shot | 33.79 | 24.33 | 0.04 | answers |
Qwen2-1.5B-Instruct | π·πΊ | zero-shot | 20.5 | 33.57 | 0.35 | answers |
Qwen1.5-1.8B-Chat | π·πΊ | zero-shot | 11.74 | 8.05 | 0.42 | answers |
Open & Less 1B | ||||||
Qwen2-0.5B-Instruct | π·πΊ | zero-shot | 11.76 | 18.12 | 0.25 | answers |
This leaderboard and obtained LLM answers is a part of the experiments in paper: Large Language Models in Targeted Sentiment Analysis in Russian.
Dataset: final_data.csv
Model | lang | Mode | F1(P,N) | F1(P,N,0) | N/A % | Answers |
---|---|---|---|---|---|---|
GPT-4-1106-preview | πΊπΈ | CoT THoR | 50.13 | 55.93 | - | answers |
GPT-3.5-0613 | πΊπΈ | CoT THoR | 44.50 | 48.17 | - | answers |
GPT-3.5-1106 | πΊπΈ | CoT THoR | 42.58 | 42.18 | - | answers |
GPT-4-1106-preview | πΊπΈ | zero-shot (short) | 54.59 | 64.32 | - | answers |
GPT-3.5-0613 | πΊπΈ | zero-shot (short) | 51.79 | 61.38 | - | answers |
GPT-3.5-1106 | πΊπΈ | zero-shot (short) | 47.04 | 53.19 | - | answers |
Mistral-7B-instruct-v0.1 | πΊπΈ | zero-shot | 49.46 | 58.51 | - | answers |
Mistral-7B-instruct-v0.2 | πΊπΈ | zero-shot | 44.82 | 56.04 | - | answers |
DeciLM | πΊπΈ | zero-shot | 43.85 | 53.65 | 1.44 | answers |
Microsoft-Phi-2 | πΊπΈ | zero-shot | 40.95 | 42.77 | 3.13 | answers |
Gemma-7B-IT | πΊπΈ | zero-shot | 40.96 | 44.63 | - | answers |
Gemma-2B-IT | πΊπΈ | zero-shot | 31.75 | 45.96 | 2.62 | answers |
Flan-T5-xxl | πΊπΈ | zero-shot | 36.46 | 42.63 | 1.90 | answers |
Model | lang | Mode | F1(P,N) | F1(P,N,0) | N/A % | Answers |
---|---|---|---|---|---|---|
GPT-4-1106-preview | π·πΊ | zero-shot (short) | 48.04 | 60.55 | 0.0 | answers |
GPT-3.5-0613 | π·πΊ | zero-shot (short) | 45.85 | 57.36 | 0.0 | answers |
GPT-3.5-1106 | π·πΊ | zero-shot (short) | 35.07 | 48.53 | 0.0 | answers |
Mistral-7B-Instruct-v0.2 | π·πΊ | zero-shot | 42.60 | 48.05 | 0.0 | answers |
If you find the results and findings in Final Results section valuable π, feel free to cite the related work as follows:
@misc{rusnachenko2024large,
title={Large Language Models in Targeted Sentiment Analysis},
author={Nicolay Rusnachenko and Anton Golubev and Natalia Loukachevitch},
year={2024},
eprint={2404.12342},
archivePrefix={arXiv},
primaryClass={cs.CL}
}