Skip to content

This repository highlights the LLMs reasoning capabilities of ✨ Mistral / LLaMA-3 / Phi-3 / Gemma / Flan-T5 / GPT-4o ✨ in Targeted Sentiment Analysis in Russian / Translated to English mass-media πŸ“Š

License

Notifications You must be signed in to change notification settings

nicolay-r/RuSentNE-LLM-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

58 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RuSentNE-LLM-Benchmark β€’ twitter

Update November 01 2024: ⭐ Implemented a separated bulk-chain project for handling massive amount of prompts with CoT. This concept was used in this studies.

Update 06 September 2024: Mentioning the related information about the project at BU-research-blog

Update 11 August 2024: 🎀 Announcing the talk on this framework @ NLPSummit 2024 with the preliminary ad and details in X/Twitter post 🐦. twitter

Update 23 June 2024: All metrics in Development mode has been evaluated under closest mode which makes a decision of the result class by relying on the first entry of the label.

Update 11 June 2024: Added evaluation mode that counts first label entry. See eval-mode parameter key.

This repository assess the LLMs reasoning capabilities in Targeted Sentiment Analysis on RuSentNE dataset proposed as a part of the self-titled competition.

In particular, we use pre-treained LLMs for the following datset splits:

  1. πŸ”“ Development
  2. πŸ”’ Final

The following reasoning we use [quick-cot] to experiment with:

  • Instruction Prompts
  • Chain-of-Thoughts (THoR)

πŸ” Accessing the results

All the sqlite results are stored in contents table.

Option 1. You may use sqlitebrowser for accessing the results for exporting into CSV. accessability

Option 2. Use sqlite2csv.py script implemented in this repository.

πŸ”“ Development Results

twitter

This is an open-access dataset split (sentiment labels available) utilized for the development stage and could be used anyone in evaluation checks.

Dataset: valiation_data_labeled.csv

* -- denotes evaluation in first-entry mode (seeking for the first entry).

Model lang Mode F1(P,N) F1(P,N,0) N/A % Answers
GPT-3.5-0613 πŸ‡ΊπŸ‡Έ CoT THoR 43.46 46.16 0.21 answers
GPT-3.5-1106 πŸ‡ΊπŸ‡Έ CoT THoR 40.83 39.91 0.49 answers
mistral-7b πŸ‡ΊπŸ‡Έ CoT THoR 42.34 51.43 0.04 answers
Model lang Mode F1(P,N) F1(P,N,0) N/A % Answers
Proprietary
GPT-4-turbo-2024-04-09 πŸ‡ΊπŸ‡Έ zero-shot 50.79 61.19 0.0 answers
GPT-3.5-0613 πŸ‡ΊπŸ‡Έ zero-shot 47.1 57.76 0.0 answers
GPT-3.5-1106 πŸ‡ΊπŸ‡Έ zero-shot 45.79 52.55 0.0 answers
mistral-large-latest πŸ‡ΊπŸ‡Έ zero-shot 44.48 57.24 0.0 answers
gpt-4o πŸ‡ΊπŸ‡Έ zero-shot 42.84 56.19 0.0 answers
Open & Less 100B
llama-3-70b-instruct πŸ‡ΊπŸ‡Έ zero-shot 49.79 61.24 0.0 answers
mixtral-8x22b πŸ‡ΊπŸ‡Έ zero-shot 46.09 58.24 0.0 answers
Phi-3-small-8k-instruct πŸ‡ΊπŸ‡Έ zero-shot 46.87 57.02 0.07 answers
mixtral-8x7b πŸ‡ΊπŸ‡Έ zero-shot 47.33 56.36 0.07 answers
llama-2-70b-chat πŸ‡ΊπŸ‡Έ zero-shot 42.42 54.25 13.44 answers
Open & Less 10B
Gemma-2-9b-it πŸ‡ΊπŸ‡Έ zero-shot 45.57 55.06 0.0 answers
llama-3-8b-instruct πŸ‡ΊπŸ‡Έ zero-shot 45.25 54.43 0.0 answers
Mistral-7B-Instruct-v0.3 πŸ‡ΊπŸ‡Έ zero-shot 45.23 55.5 0.0 answers
Phi-3-mini-4k-instruct πŸ‡ΊπŸ‡Έ zero-shot 44.62 54.71 0.0 answers
Qwen1.5-7B-Chat πŸ‡ΊπŸ‡Έ zero-shot 44.39 55.55 0.04 answers
google_flan-t5-xl πŸ‡ΊπŸ‡Έ zero-shot 43.73 53.72 0.0 answers
mistral-7b πŸ‡ΊπŸ‡Έ zero-shot 43.11 53.64 0.11 answers
Qwen2-7B-Instruct πŸ‡ΊπŸ‡Έ zero-shot 39.74 48.11 3.87 answers
Qwen2-1.5B-Instruct πŸ‡ΊπŸ‡Έ zero-shot 33.88 48.59 0.0 answers
Qwen1.5-1.8B-Chat πŸ‡ΊπŸ‡Έ zero-shot 33.65 47.28 0.04 answers
Open & Less 1B
Flan-T5-large πŸ‡ΊπŸ‡Έ zero-shot 36.72 24.51 0.0 answers
Qwen2-0.5B-Instruct πŸ‡ΊπŸ‡Έ zero-shot 9.52 33.0 0.0 answers
Model lang Mode F1(P,N) F1(P,N,0) N/A % Answers
Proprietary
GPT-3.5-0613 πŸ‡·πŸ‡Ί zero-shot 44.15 53.63 1.51 answers
gpt-4o πŸ‡·πŸ‡Ί zero-shot 44.15 57.5 0.0 answers
GPT-4-turbo-2024-04-09 πŸ‡·πŸ‡Ί zero-shot 42.21 56.36 0.0 answers
GPT-3.5-1106 πŸ‡·πŸ‡Ί zero-shot 41.34 46.83 0.46 answers
mistral-large-latest πŸ‡·πŸ‡Ί zero-shot 22.33 43.07 0.04 answers
Open & Less 100B
llama-3-70b-instruct πŸ‡·πŸ‡Ί zero-shot 45.89 58.73 0.0 answers
mixtral-8x22b πŸ‡·πŸ‡Ί zero-shot 42.64 54.91 0.0 answers
mixtral-8x7b πŸ‡·πŸ‡Ί zero-shot 41.11 53.75 0.18 answers
Phi-3-small-8k-instruct πŸ‡·πŸ‡Ί zero-shot 40.65 49.64 0.14 answers
llama-2-70b-chat πŸ‡·πŸ‡Ί zero-shot 29.51 27.27 1.65 answers
Open & Less 10B
Gemma-2-9b-it πŸ‡·πŸ‡Ί zero-shot 46.5 55.9 0.04 answers
Qwen2-7B-Instruct πŸ‡·πŸ‡Ί zero-shot 42.16 51.13 0.25 answers
mistral-7b πŸ‡·πŸ‡Ί zero-shot 42.14 47.57 0.18 answers
mistral-7B-Instruct-v0.3 πŸ‡·πŸ‡Ί zero-shot 41.73 44.24 0.18 answers
llama-3-8b-instruct πŸ‡·πŸ‡Ί zero-shot 40.55 47.81 0.35 answers
Qwen1.5-7B-Chat πŸ‡·πŸ‡Ί zero-shot 34.1 45.05 0.25 answers
Phi-3-mini-4k-instruct πŸ‡·πŸ‡Ί zero-shot 33.79 24.33 0.04 answers
Qwen2-1.5B-Instruct πŸ‡·πŸ‡Ί zero-shot 20.5 33.57 0.35 answers
Qwen1.5-1.8B-Chat πŸ‡·πŸ‡Ί zero-shot 11.74 8.05 0.42 answers
Open & Less 1B
Qwen2-0.5B-Instruct πŸ‡·πŸ‡Ί zero-shot 11.76 18.12 0.25 answers

πŸ”’ Final Results

arXiv

This leaderboard and obtained LLM answers is a part of the experiments in paper: Large Language Models in Targeted Sentiment Analysis in Russian.

Dataset: final_data.csv

Model lang Mode F1(P,N) F1(P,N,0) N/A % Answers
GPT-4-1106-preview πŸ‡ΊπŸ‡Έ CoT THoR 50.13 55.93 - answers
GPT-3.5-0613 πŸ‡ΊπŸ‡Έ CoT THoR 44.50 48.17 - answers
GPT-3.5-1106 πŸ‡ΊπŸ‡Έ CoT THoR 42.58 42.18 - answers
GPT-4-1106-preview πŸ‡ΊπŸ‡Έ zero-shot (short) 54.59 64.32 - answers
GPT-3.5-0613 πŸ‡ΊπŸ‡Έ zero-shot (short) 51.79 61.38 - answers
GPT-3.5-1106 πŸ‡ΊπŸ‡Έ zero-shot (short) 47.04 53.19 - answers
Mistral-7B-instruct-v0.1 πŸ‡ΊπŸ‡Έ zero-shot 49.46 58.51 - answers
Mistral-7B-instruct-v0.2 πŸ‡ΊπŸ‡Έ zero-shot 44.82 56.04 - answers
DeciLM πŸ‡ΊπŸ‡Έ zero-shot 43.85 53.65 1.44 answers
Microsoft-Phi-2 πŸ‡ΊπŸ‡Έ zero-shot 40.95 42.77 3.13 answers
Gemma-7B-IT πŸ‡ΊπŸ‡Έ zero-shot 40.96 44.63 - answers
Gemma-2B-IT πŸ‡ΊπŸ‡Έ zero-shot 31.75 45.96 2.62 answers
Flan-T5-xxl πŸ‡ΊπŸ‡Έ zero-shot 36.46 42.63 1.90 answers
Model lang Mode F1(P,N) F1(P,N,0) N/A % Answers
GPT-4-1106-preview πŸ‡·πŸ‡Ί zero-shot (short) 48.04 60.55 0.0 answers
GPT-3.5-0613 πŸ‡·πŸ‡Ί zero-shot (short) 45.85 57.36 0.0 answers
GPT-3.5-1106 πŸ‡·πŸ‡Ί zero-shot (short) 35.07 48.53 0.0 answers
Mistral-7B-Instruct-v0.2 πŸ‡·πŸ‡Ί zero-shot 42.60 48.05 0.0 answers

References

If you find the results and findings in Final Results section valuable πŸ’Ž, feel free to cite the related work as follows:

@misc{rusnachenko2024large,
      title={Large Language Models in Targeted Sentiment Analysis}, 
      author={Nicolay Rusnachenko and Anton Golubev and Natalia Loukachevitch},
      year={2024},
      eprint={2404.12342},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

This repository highlights the LLMs reasoning capabilities of ✨ Mistral / LLaMA-3 / Phi-3 / Gemma / Flan-T5 / GPT-4o ✨ in Targeted Sentiment Analysis in Russian / Translated to English mass-media πŸ“Š

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages