Awesome Semantic Textual Similarity (STS)

Awesome Semantic Textual Similarity: A Curated List of Semantic/Sentence Textual Similarity (STS) in Large Language Models and the NLP Field

This repository, called Awesome Semantic Textual Similarity, contains a collection of resources and papers on Semantic/Sentence Textual Similarity (STS) in Large Language Models and NLP.

"If you can't measure it, you can't improve it. " - British Physicist William Thomson

Welcome to share your papers, thoughts, and ideas by submitting an issue!

Model Evolution Overview

Presentations

Sentence Textual Similarity: Model Evolution Overview
Shuyue Jia, Dependable Computing Laboratory, Boston University
[Link]
Oct 2023

Benchmarks

Please check here and here to download all the benchmark databases below.

STS

STS12:
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre
SemEval 2012, [Paper] [Download]
07 June 2012

STS13:
*SEM 2013 shared task: Semantic Textual Similarity
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo
*SEM 2013, [Paper] [Download]
13 June 2013

STS14:
SemEval-2014 Task 10: Multilingual Semantic Textual Similarity
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, Janyce Wiebe
SemEval 2014, [Paper] [Download]
23 Aug 2014

STS15:
SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, Janyce Wiebe
SemEval 2015, [Paper] [Download]
04 June 2015

STS16:
SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, Janyce Wiebe
SemEval 2016, [Paper] [Download]
16 June 2016

STS Benchmark (STSb):
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, Lucia Specia
SemEval 2017, [Paper] [Download]
03 Aug 2017

SICK-Relatedness

A SICK Cure for the Evaluation of Compositional Distributional Semantic Models
Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, Roberto Zamparelli
LREC 2014, [Paper] [Download]
26 May 2014

Papers

Baselines

GloVe: Global Vectors for Word Representation
Jeffrey Pennington, Richard Socher, Christopher Manning
EMNLP 2014, [Paper] [GitHub]
25 Oct 2014

Skip-Thought Vectors
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler
NeurIPS 2015, [Paper] [GitHub]
22 Jun 2015

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, Antoine Bordes
EMNLP 2017, [Paper] [GitHub]
07 Sept 2017

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
NAACL-HLT 2019, [Paper] [GitHub]
24 May 2019

BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
ICLR 2020, [Paper] [GitHub]
24 Feb 2020

BLEURT: Learning Robust Metrics for Text Generation
Thibault Sellam, Dipanjan Das, Ankur Parikh
ACL 2020, [Paper] [GitHub]
05 July 2020

Dense Passage Retrieval for Open-Domain Question Answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih
EMNLP 2020, [Paper] [GitHub]
16 Nov 2020

Universal Sentence Encoder
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
arXiv 2018, [Paper] [GitHub]
12 Apr 2018

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers, Iryna Gurevych
EMNLP 2019, [Paper] [GitHub]
27 Aug 2019

Matrix-based Methods

Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement
Hua He, Jimmy Lin
NAACL 2016, [Paper]
12 June 2016

Text Matching as Image Recognition
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, Xueqi Cheng
AAAI 2016, [Paper] [GitHub]
20 Feb 2016

MultiGranCNN: An Architecture for General Matching of Text Chunks on Multiple Levels of Granularity
Myeongjun Jang, Deuk Sin Kwon, Thomas Lukasiewicz
IJCNLP 2015, [Paper]
26 July 2015

Alignment-based Methods

Attention Mechanism

Simple and Effective Text Matching with Richer Alignment Features
Runqi Yang, Jianhai Zhang, Xing Gao, Feng Ji, Haiqing Chen
ACL 2019, [Paper] [GitHub]
01 Aug 2019

Semantic Sentence Matching with Densely-Connected Recurrent and Co-Attentive Information
Seonhoon Kim, Inho Kang, Nojun Kwak
AAAI 2019, [Paper] [GitHub (Unofficial)]
27 January 2019

Multiway Attention Networks for Modeling Sentence Pairs
Chuanqi Tan, Furu Wei, Wenhui Wang, Weifeng Lv, Ming Zhou
IJCAI 2018, [Paper] [GitHub]
13 July 2018

Natural Language Inference over Interaction Space
Yichen Gong, Heng Luo, Jian Zhang
EMNLP 2017, [Paper] [GitHub]
13 Sep 2017

Inter-Weighted Alignment Network for Sentence Pair Modeling
Gehui Shen, Yunlun Yang, Zhi-Hong Deng
EMNLP 2017, [Paper]
07 Sept 2017

Bidirectional Attention Flow for Machine Comprehension
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi
ICLR 2017, [Paper] [Webpage] [GitHub]
24 Apr 2017

A Structured Self-attentive Sentence Embedding
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, Yoshua Bengio
EMNLP 2017, [Paper] [GitHub]
09 Mar 2017

Sentence Similarity Learning by Lexical Decomposition and Composition
Zhiguo Wang, Haitao Mi, Abraham Ittycheriah
COLING 2016, [Paper] [GitHub]
11 Dec 2016

A Decomposable Attention Model for Natural Language Inference
Ankur Parikh, Oscar Täckström, Dipanjan Das, Jakob Uszkoreit
EMNLP 2016, [Paper] [GitHub]
01 Nov 2016

Reasoning about Entailment with Neural Attention
Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Phil Blunsom
ICLR 2016, [Paper] [GitHub]
1 Mar 2016

Traditional Methods

DLS@CU: Sentence Similarity from Word Alignment and Semantic Vector Composition
Md Arafat Sultan, Steven Bethard, Tamara Sumner
SemEval 2015, [Paper]
04 June 2015

Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence
Md Arafat Sultan, Steven Bethard, Tamara Sumner
TACL 2014, [Paper]
01 May 2014

Word Distance-based Methods

Improving Word Mover’s Distance by Leveraging Self-attention Matrix
Hiroaki Yamagiwa, Sho Yokoi, Hidetoshi Shimodaira
EMNLP 2023 Findings, [Paper] [GitHub]
02 Nov 2023

Toward Interpretable Semantic Textual Similarity via Optimal Transport-based Contrastive Sentence Learning
Seonghyeon Lee, Dongha Lee, Seongbo Jang, Hwanjo Yu
ACL 2022, [Paper] [GitHub]
22 May 2022

Word Rotator’s Distance
Sho Yokoi, Ryo Takahashi, Reina Akama, Jun Suzuki, Kentaro Inui
EMNLP 2020, [Paper] [GitHub]
16 Nov 2020

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, Steffen Eger
EMNLP 2019, [Paper] [GitHub]
03 Nov 2019

From Word Embeddings To Document Distances
Matt Kusner, Yu Sun, Nicholas Kolkin, Kilian Weinberger
ICML 2015, [Paper] [GitHub]
06 July 2015

Sentence Embedding-based Methods

Paragraph Vector-based Methods

Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline
Kawin Ethayarajh
RepL4NLP 2018, [Paper] [GitHub]
20 July 2018

An Efficient Framework for Learning Sentence Representations
Lajanugen Logeswaran, Honglak Lee
ICLR 2018, [Paper] [GitHub]
30 Apr 2018

Universal Sentence Encoder
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
arXiv 2018, [Paper] [GitHub]
12 Apr 2018

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, Antoine Bordes
EMNLP 2017, [Paper] [GitHub]
07 Sept 2017

A Simple but Tough-to-Beat Baseline for Sentence Embeddings
Sanjeev Arora, Yingyu Liang, Tengyu Ma
ICLR 2017, [Paper] [GitHub]
06 Feb 2017

Learning Distributed Representations of Sentences from Unlabelled Data
Felix Hill, Kyunghyun Cho, Anna Korhonen
NAACL 2016, [Paper] [GitHub (Unofficial)]
12 Jun 2016

Skip-Thought Vectors
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler
NeurIPS 2015, [Paper] [GitHub]
22 Jun 2015

Distributed Representations of Sentences and Documents
Quoc V. Le, Tomas Mikolov
ICML 2014, [Paper]
21 June 2014

Pretraining-finetuning Paradigm

Whitening Sentence Representations for Better Semantics and Faster Retrieval
Jianlin Su, Jiarun Cao, Weijie Liu, Yangyiwen Ou
arXiv 2021, [Paper] [GitHub (TensorFlow)] [GitHub (PyTorch)]
29 Mar 2021

On the Sentence Embeddings from Pre-trained Language Models
Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, Lei Li
EMNLP 2020, [Paper] [GitHub]
02 Nov 2020

SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models
Bin Wang, C.-C. Jay Kuo
IEEE/ACM T-ASLP, [Paper] [GitHub]
29 July 2020

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers, Iryna Gurevych
EMNLP 2019, [Paper] [GitHub]
27 Aug 2019

BERT-based Scores

BLEURT: Learning Robust Metrics for Text Generation
Thibault Sellam, Dipanjan Das, Ankur Parikh
ACL 2020, [Paper] [GitHub]
05 July 2020

BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
ICLR 2020, [Paper] [GitHub]
24 Feb 2020

Contrastive Learning Framework

Toward Interpretable Semantic Textual Similarity via Optimal Transport-based Contrastive Sentence Learning
Seonghyeon Lee, Dongha Lee, Seongbo Jang, Hwanjo Yu
ACL 2022, [Paper] [GitHub]
22 May 2022

SimCSE: Simple Contrastive Learning of Sentence Embeddings
Tianyu Gao, Xingcheng Yao, Danqi Chen
EMNLP 2021, [Paper] [GitHub]
03 Jun 2021

Self-Guided Contrastive Learning for BERT Sentence Representations
Taeuk Kim, Kang Min Yoo, Sang-goo Lee
ACL 2021, [Paper] [GitHub]
03 Jun 2021

ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer
Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, Weiran Xu
ACL 2021, [Paper] [GitHub]
25 May 2021

Semantic Re-tuning with Contrastive Tension
Fredrik Carlsson, Amaru Cuba Gyllensten, Evangelia Gogoulou, Erik Ylipää Hellqvist, Magnus Sahlgren
ICLR 2021, [Paper] [GitHub]
03 May 2021

CLEAR: Contrastive Learning for Sentence Representation
Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, Hao Ma
arXiv 2020, [Paper]
31 Dec 2020

Distance Measurement

Evolution of Semantic Similarity - A Survey
Dhivya Chandrasekaran, Vijay Mago
ACM Computing Survey 2021, [Paper]
18 February 2021

Distributional Measures of Semantic Distance: A Survey
Saif M. Mohammad, Graeme Hirst
arXiv 2012, [Paper]
8 Mar 2012

Evaluation Metrics

Pearson Correlation

Pearson Linear Correlation Coefficient − measure the prediction accuracy

$$r=\frac{ \sum\nolimits_{i=1}^n \left( s_i-\bar{s} \right) \left( q_i-\bar{q} \right) }{\sqrt{ \sum\nolimits_{i=1}^n \left( s_i-\bar{s} \right)^2 } \sqrt{ \sum\nolimits_{i=1}^n \left( q_i-\bar{q} \right)^2 }},$$

where $s_i$ and $q_i$ are the gold label and the model’s prediction of the $i$-th sentence. $\bar{s}$ and $\bar{q}$ are the mean values of $\textbf{s}$ and $\textbf{q}$. $n$ is the number of sentences.

Spearman Rank Correlation

Spearman’s Rank-order Correlation Coefficient − measure the prediction monotonicity

$$\rho=1-\frac{6 \sum\nolimits_{i=1}^{n} d_i^2 }{n\left(n^2-1\right)},$$

where $d_i$ is the difference between the $i$-th sentence’s rank in the model’s predictions and gold labels.

Citation

If you find our list useful, please consider citing our repo and toolkit in your publications. We provide a BibTeX entry below.

@misc{JiaAwesomeSTS23,
      author = {Jia, Shuyue},
      title = {Awesome Semantic Textual Similarity},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub Repository},
      howpublished = {\url{https://github.com/SuperBruceJia/Awesome-Semantic-Textual-Similarity}},
}

@misc{JiaAwesomeLLM23,
      author = {Jia, Shuyue},
      title = {Awesome {LLM} Self-Consistency},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub Repository},
      howpublished = {\url{https://github.com/SuperBruceJia/Awesome-LLM-Self-Consistency}},
}

@misc{JiaPromptCraft23,
      author = {Jia, Shuyue},
      title = {{PromptCraft}: A Prompt Perturbation Toolkit},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub Repository},
      howpublished = {\url{https://github.com/SuperBruceJia/promptcraft}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LICENSE		LICENSE
Overview_before_2022.png		Overview_before_2022.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Semantic Textual Similarity (STS)

Contents

Model Evolution Overview

Presentations

Benchmarks

STS

SICK-Relatedness

Papers

Baselines

Matrix-based Methods

Alignment-based Methods

Attention Mechanism

Traditional Methods

Word Distance-based Methods

Sentence Embedding-based Methods

Paragraph Vector-based Methods

Pretraining-finetuning Paradigm

BERT-based Scores

Contrastive Learning Framework

Distance Measurement

Evaluation Metrics

Pearson Correlation

Spearman Rank Correlation

Citation

About

Releases

Packages

License

SuperBruceJia/Awesome-Semantic-Textual-Similarity

Folders and files

Latest commit

History

Repository files navigation

Awesome Semantic Textual Similarity (STS)

Contents

Model Evolution Overview

Presentations

Benchmarks

STS

SICK-Relatedness

Papers

Baselines

Matrix-based Methods

Alignment-based Methods

Attention Mechanism

Traditional Methods

Word Distance-based Methods

Sentence Embedding-based Methods

Paragraph Vector-based Methods

Pretraining-finetuning Paradigm

BERT-based Scores

Contrastive Learning Framework

Distance Measurement

Evaluation Metrics

Pearson Correlation

Spearman Rank Correlation

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages