Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
KennethEnevoldsen committed Jul 31, 2023
2 parents 8615d3d + 15b8045 commit 21ffc06
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 4 deletions.
8 changes: 4 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ hide:

# Scandinavian Embedding Benchmark

This is the documentation for the Scandinavian Embedding Benchmark. This benchmark is intended to evaluate the sentence/documents embeddings of large language models.
This is the documentation for the Scandinavian Embedding Benchmark. This benchmark is intended to evaluate the sentence/document embeddings of large language models.

Intended uses for this benchmark:

- Evaluating document embeddings of Scandinavian language models
- Evaluating document embeddings for multilingual models on Scandinavian languages
- Allow ranking of competing Scandinavian and multilingual models using no more compute that what a consumer laptop can provide
- Allow ranking of competing Scandinavian and multilingual models using no more compute than what a consumer laptop can provide


=== "All"
Expand All @@ -34,9 +34,9 @@ Intended uses for this benchmark:

## Comparison to other benchmarks

If you use this benchmark for a relative ranking of language models you should also take a look at [ScandEval](https://scandeval.github.io), which as opposed the this benchmark fully fine-tunes the models. It also includes structured predictions tasks such as named entity recognition. Many of the tasks in this embeddings benchmark is also included in ScandEval. A notable difference between the ScandEval and this benchmark is that it does not include machine translated tasks.
If you use this benchmark for a relative ranking of language models you should also look at [ScandEval](https://scandeval.github.io), which as opposed to this benchmark fully fine-tunes the models. It also includes structured prediction tasks such as named entity recognition. Many of the tasks in this embedding benchmark are also included in ScandEval. A notable difference between ScandEval and this benchmark is that this one does not include machine-translated tasks.

The tasks within this benchmark is also included in the [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard, though the aggregations methods very slightly. The MTEB is primarily an English embedding benchmark, with a few multilingual tasks along with a few additional languages. As a part of this project the tasks was also added to the MTEB leaderboard.
The tasks within this benchmark are also included in the [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard, though the aggregation methods very slightly. MTEB is primarily an English embedding benchmark, with a few multilingual tasks and additional languages. As a part of this project, the tasks were also added to the MTEB leaderboard.



13 changes: 13 additions & 0 deletions src/seb/seb_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,19 @@ def create_all_mini_lm_l6_v2() -> SebModel:
meta=meta,
)

@models.register("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
def create_multilingual_mini_lm_l12_v2() -> SebModel:
hf_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
meta = ModelMeta(
name=hf_name.split("/")[-1],
huggingface_name=hf_name,
reference=f"https://huggingface.co/{hf_name}",
languages=[],
)
return SebModel(
loader=partial(get_sentence_transformer, model_name=hf_name), # type: ignore
meta=meta,
)

@models.register("KBLab/sentence-bert-swedish-cased")
def create_sentence_swedish_cased() -> SebModel:
Expand Down

0 comments on commit 21ffc06

Please sign in to comment.