AlignUnformEval

This is a python tool to evaluate alignment and uniformity of sentence embedding like SimCSE paper.

SimCSE paper explains alignment and uniformity as below:

Given a distribution of positive pairs p_pos, alignment calculates expected distance between embeddings of the paired instances (assuming representations are already normalized):

On the other hand, uniformity measures how well the embeddings are uniformly distributed:

where p_data denotes the data distribution.

Install

by pip

pip install alignuniformeval

by source

pip install https://github.com/akiFQC/AlignUnformEval

Usage

You can easily evaluate alignment and uniformity with this library.
This is a minimal example that evaluate alignment and uniformity of STS Benchmark.

from alignunformeval import STSBEval

evaluator = STSBEval(sentence_encoder)
# sentence_encoder is a callable from List[str] to numpy.array. The output numpy.array must be [dimention_of_sentence_vector].
result = evaluator.eval_summary()
# result =  {"alignment": value_of_aligenment, "uniformity": value_of_uniformity}

STSBEval get callable whose input is list of str and output is n dimentional numpy.array.

Dataset

STS Benchmark

This dataset (especially, sts-dev.csv) was used in SimCSE paper. In the paper, the threshold of similarity score was st at 4.0; pairs of sentences whose similarity score is higher than 4.0 are used for evaluation of alignment. You can set other threshold as the following example.

from alignunformeval import STSBEval
# sentence_encoder : some function List[str] to np.array[dimention_of_sentence_vector]
evaluator = STSBEval(sentence_encoder, threshold=3.0) # set threshold at 3.0
result = evaluator.eval_summary()

Please see test/test_stsb.py if you want more details.

Tokyo Metropolitan University Paraphrase Corpus (TMUP)

Tokyo Metropolitan University Paraphrase Corpus (TMUP) is a Japanese paraphrase dataset.

from alignunformeval import TMUPEval
# sentence_encoder : some function List[str] to np.array[dimention_of_sentence_vector]
evaluator = TMUPEval(sentence_encoder)
result = evaluator.eval_summary()

License

The license of this tool follows each dataset. Please read the documents of datasets you use.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
alignunformeval		alignunformeval
test		test
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlignUnformEval

Install

Usage

Dataset

STS Benchmark

Tokyo Metropolitan University Paraphrase Corpus (TMUP)

License

Reference

About

Releases

Packages

Languages

akiFQC/AlignUnformEval

Folders and files

Latest commit

History

Repository files navigation

AlignUnformEval

Install

Usage

Dataset

STS Benchmark

Tokyo Metropolitan University Paraphrase Corpus (TMUP)

License

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages