Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Update affiliation #1247

Merged
merged 1 commit into from
Sep 27, 2024

Conversation

digantamisra98
Copy link
Contributor

@digantamisra98 digantamisra98 commented Sep 27, 2024

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

@isaac-chung isaac-chung changed the title Update affiliation docs: Update affiliation Sep 27, 2024
@isaac-chung isaac-chung enabled auto-merge (squash) September 27, 2024 16:21
@isaac-chung isaac-chung merged commit 45de3ec into embeddings-benchmark:main Sep 27, 2024
9 checks passed
KennethEnevoldsen added a commit that referenced this pull request Oct 27, 2024
* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elem… (#1203)

* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elements (#1201)

Fix OpenAI BadRequestError by limiting input dimensions to 2048 elements

- Ensure the 'sentences' list passed to OpenAI API does not exceed 2048 elements
- Reference: OpenAI's Embedding API documentation on input limits

Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>

* fix ruff formatting

* Added minor test fixes to ensure reproducility across systems

* Ensure that tmp.json is not created within repo when running tests

* format

* fixes path issues

* Rerun CI

---------

Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>

* fix: Ensure STS pearson and spearman does not use the p-value only the correlation (#1207)

Fixes #1206

* 1.14.16

Automatically generated by python-semantic-release

* fix: Normalize licenses including casing, uses of "-" etc.

* fix: Normalize licenses including casing, uses of "-" etc. (#1210)

* fix: Normalize licenses including casing, uses of "-" etc.

* fix tests

* 1.14.17

Automatically generated by python-semantic-release

* fix: Normalize benchmarks no only include task objects and added getter for benchmarks (#1208)

* Normalize benchmarks to only include tasks

- Force benchmarks to only include tasks. This fixes a few bugs where benchmarks can reference a task which is not implemented
- implements `mteb.get_benchmark`, which makes it easier to fetch benchmarks
- Added tests + updated docs

A few outstanding issues:

I would like `mteb.MTEB(benchmark)` to always reproduce the benchmark. Currently this is not possible as MTEB(eng) required the split to be specified. A solution it to allow "eval_splits) to be specified when initializing a task and then pass it on to the `load_data()`. This way we can write the following:

`mteb.get_tasks(tasks=[...], eval_splits=["test"], ...)`

I would also love the aggregation to be a part of the benchmark (such that it is clear how it should be aggregated). This is especially relevant for MTEB(eng) as it average the CQAD datasets before creating the global average. This way we can also create a result object for the benchmark itself. A complimenting solution for this is to allow nested benchmarks.

* fix error in tests

* format

* Added corrections based on review

* added example and formatted

* 1.14.18

Automatically generated by python-semantic-release

* docs: Fix broken links in docs (#1212)

* Added fixes for broken links in adding_a_dataset and adding_a_model docs.

* Updated link name

* Mismatch of the category of AmazonPolarityClassification (#1220)

Fixes #1219

* Update tasks table

* fix: Ensure that results are returned even when hitting cache (#1215)

Fixes #1122

* 1.14.19

Automatically generated by python-semantic-release

* fix: Allow benchmark to specify eval_splits (#1217)

* fix: Allow benchmark to specify eval_splits

This PR allow for benchmarks to specify specific eval. splits. This allow us to fully specify a benchmark within the benchmark object.

To do this it add the following:
- added eval_splits to the Abstask object, which default to metadata.eval_splits
- use the task.eval_splits unless overwritten in mteb.MTEB.run
- added eval_splits arg to mteb.get_tasks, which filter the tasks based on splits
- updated documentation
  - renamed the "Advanced Usage" to "Usage Documentation" to make it more accicible
- added tests where relevant

* Added correction based on feedback

* 1.14.20

Automatically generated by python-semantic-release

* Update points table

* Update points table

* docs: clarify adding a model (#1222)

* fix: Add RepLLaMA style models (#1223)

* init commit

* working and reproducing

* lint

* update hashes

* warning

* add pyproject

* Update points table

* 1.14.21

Automatically generated by python-semantic-release

* docs: Update points (#1228)

* Fix case

* Fix casing

* Fix case

* Fix case

* Create 971.jsonl

* Update contrib

* Add contributors

* Update points table

* docs: Add MTEB(code) dataset (#1237)

* docs: Add MTEB(code) dataset

* Fix linting

* Update points table

* Update of my affiliation (#1242)

Update points.md

* Add contributor (#1243)

* fix: @mrshu's name in `points.md` (#1246)

* Use the diacritic character to be inline with Slovak spelling.

Signed-off-by: mr.Shu <mr@shu.io>

* docs: Create benchmarks overview table (#1245)

* fix get_benchmarks method

* add create benchmark script

* make lint

* 1.14.22

Automatically generated by python-semantic-release

* docs: Update affiliation (#1247)

Update points.md

* Added author-information

* Add final author list

* Update points table

* docs: Added coordination point for Jimmy Lee  (#1253)

docs: Added coordination point for Jimmy lee for his work on the coordination of Crystina and Nandan

* Update points table

* fix: Add multilingual Benchmark (#1252)

* fix: Add multilingual bench

* Update mteb/benchmarks/benchmarks.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* format

---------

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* 1.14.23

Automatically generated by python-semantic-release

* docs: Small point changes & more contributors (#1254)

* Update points.md

* Fix format

* Fix attribution

* Update points table

* fix: Downsample large retrieval datasets (#1236)

* most tasks

* lint

* fix other issues

* refactor

* lint and docs

* add polish

* keep case sensitive mteb paths

* add potential points

* fix points

* fix test about metadata

* update tasks and stats

* lint

* Update points table

* Update tasks table

* 1.14.24

Automatically generated by python-semantic-release

* fix: Get meta from CrossEncoder (#1255)

* remove indent after return

* handle cross encoders for model meta

* make lint

* update filename since we now have model name

* 1.14.25

Automatically generated by python-semantic-release

* fix: Add listing all available benchmarks CLI option (#1256)

* add benchmarks.md in README

* add cli option

* add benchmark cli test case

* correct typo

* 1.14.26

Automatically generated by python-semantic-release

* docs: Update affiliation (#1248)

* Update points.md

* Update points.md

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* docs: Update mteb(eng) calculation (#1258)

* Update mteb(eng) calculation

* Fixed citations

* Update MTEB(eng) + MTEB(multilingual)

* feat: leverage SentenceTransformers' query/passage specific prompts (#1221)

* feat: leverage SentenceTransformer models' query/passage specific prompts

* refactor: remove E5Wrapper

fix: wrong e5 revisions

* fix: default prompt_type to None

* fix: e4ce987 revision no longer exists for multilingual-e5-small on the Hub

* fix: keep `prompt_name` in kwargs when model doesn't have a `prompts` attr

* feat: use Enum for `prompt_type`

* docs: specify how to use prompts with Sentence Transformers

* feat: readd arctic models due to metadata

* 1.15.0

Automatically generated by python-semantic-release

* fix: Add Touche2020v3 and JMTEB (#1262)

* add datasets

* fix metrics

* add Touche2020v3

* fix metadata

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* upd name and supress

* add benchmark class

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Update tasks table

* 1.15.1

Automatically generated by python-semantic-release

* fix: Select benchmarks CLI option (#1261)

* add test case for a list of Benchmarks

* add selecting benchmarks CLI option

* typos

* use a separate attribute for benchmarks

* try fixing tests

* should accept string as well

* revert filename change

* use Benchmark and avoid circular import

* fix: derive `results_directory` path from `results_repo` name (#1275)

fix: don't hardcode repo name when downloading results

* 1.15.2

Automatically generated by python-semantic-release

* fix: sorting benchmark tasks by MTEB, then alphabetical (#1271)

* sorted

* fixed formatting

* efficiency changes

* fix test

* make lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* 1.15.3

Automatically generated by python-semantic-release

* ci: Removed 3.8 dependency (#1281)

Changes include:
- remove 3.8 from tests (added 3.11 and 3.12)
- changed other CI to 3.9
- updated lint rules to use 3.8

* Update points table

* fix: Allow Numpy >=2.0 (#1264)

Allow Numpy >=2.0

* 1.15.4

Automatically generated by python-semantic-release

* docs: points for paper writing (#1286)

* Create 1004.jsonl

* Create 1006.jsonl

* Update docs/mmteb/points/1004.jsonl

* Update docs/mmteb/points/1006.jsonl

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Update points table

* Update points table

* Update points table

* docs: Fix a link in the README (#1289)

* Fix a link in the README

And fix some typos.

* Update README.md

* Update points table

* fix: Update benchmarks (#1288)

* make benchmark var name uppercase

* update touche to v3

* add MIRACLRetrievalHardNegatives to multilingual

* add mteb(indic)

* add eu benchmark

* 1.15.5

Automatically generated by python-semantic-release

* fix: Allow numpy<2.0.0 (#1291)

* 1.15.6

Automatically generated by python-semantic-release

* fix: Add metadata dict to QBQTC in C-MTEB (#1292)

* fix QBQTC in C-MTEB

* make lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* 1.15.7

Automatically generated by python-semantic-release

* fix: Remove non-existent eval split of CMNLI (#1294)

fix eval_splits of CMNLI

* 1.15.8

Automatically generated by python-semantic-release

* Leaderboard (#1235)

* Add leaderboard dev

* Renamed MTEBResults to TaskResult

* Moved model and model meta loading utilities into overview.py

* Added get_model_metas to retrieve filtered metadata for models

* Restructured results object and made it into a class instead of a dict

* Added utilities for filtering models on BenchmarkResults objects

* Added to_table utility function to BenchmarkResults

* Added serialization utilities to BenchmarkResults

* Attempted fixing tests

* Added get_model_metas to __init__

* Added get_benchmarks to __init__ and made it return all benchmarks by default

* Added get_benchmarks to __init__

* Made tasks hashable

* Added task filtering based on task objects on BenchmarkResults

* Added BenchmarkResults to __init__

* Added additional arguments to get_scores on two classes

* Made get_scores smarter on BenchmarkResult

* Added basic multilingual benchmark

* Modified benchmark to be able to easily access results

* Added useful properties and filtering functions to BenchmarkResults

* Added minimal functioning example

* Added smarter table, task-list updating and tried fixing dropdown scrolling

* Made restrict_results into a private function

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Removed old leaderboard scripts

* Hardcoded max and min model size

* Removed redundant utils file

* Ran linting

* added leaderboard dependencies as optional

* Fixed union type error on Python 3.9

* Removed references to Dict in task aggregation

* Fixed name errors in _restrict_task_results

* Fixed _restrict_task_results

* Made hf_subsets={'default'} when the task is monolingual in _restric_task_results

* Task dropdown now gets filtered based on the other criteria

* Ran linting again

* Introduced hotfix for reranking test

* Added BenchmarkResults to __all__ in __init__

* Fixed validate_and_filter_scores method, and replaced _restric_task_results with it

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* feat: Use prompts instead of encode_corpus and encode_queries (#1278)

* add prompt per task type

* fix prompt

* upd test

* lint

* fix test

* fix DeprecatedSummarizationEvaluator

* fix prompts

* add test

* lint

* logger info

* use task type only in model_encode

* lint

* update interface

* add prompt types to docs

* fix test

* mock tasks

* mock task registry

* remove last task_type

* fix tests

* lint

* fix test

* fix

* use wrapper and new prompts

* fix tests

* lint

* fix test

* remove conftest

* validate task to prompt_name

* override model prompts

* task to prompt name optional

* fix tests

* fix models

* remove task_to_prompt_name

* remove from mteb __init__

* update docs

* load existing model prompts if model_prompts is None

* fix

* lint

* change wrapper loader

* add wrapper class

* lint

* add wrapper file

* update logging

* upd logging

* refactor reranking

* lint

* remove prints

* 1.16.0

Automatically generated by python-semantic-release

* fix: Add Retrieval SK Quad dataset for Slovak search evaluation (#1276)

* Add Retrieval SK Quad dataset for Slovak search evaluation

This commit introduces the Retrieval SK Quad dataset, designed to assess Slovak search performance. The dataset is derived from SK-QuAD and includes questions with their best answers categorized post-annotation. This addition provides a significant resource for advancing Slovak language search evaluation and supporting further research and development.

* Add Retrieval SK Quad dataset for Slovak search evaluation 2

Added the requested changes on the SKQuadRetrieval.py file

* add task to init

* add missing task metadata

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks table

* 1.16.1

Automatically generated by python-semantic-release

* fix: Add Slovak Hate Speech and Offensive Language Dataset (#1274)

* Add Slovak Hate Speech and Offensive Language
Dataset

This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update.

* Add Slovak Hate Speech and Offensive Language Dataset
- Updated __init__.py to include the new SlovakHateSpeechClassification task.
- Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.

* Did requested changes:
- Updated __init__.py to include the new SlovakHateSpeechClassification task.
- Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.

* resolve linting issues by running `make lint`

* Update tasks table

* WIP: Leaderboard UI improvements (#1312)

* Fixed typos in task_results

* Fixed typos in task_results

* Added Tailwind, reorganized layout and fixed scrolling

* Ran linting

* 1.16.2

Automatically generated by python-semantic-release

* fix: remove duplicate multilingual

* 1.16.3

Automatically generated by python-semantic-release

* fix: Re-upload dataset to hub to avoid using script upload (#1322)

* fix dataset upload

* add linting

* Update tasks table

* 1.16.4

Automatically generated by python-semantic-release

* fix: Add implementations of common reranker models (#1309)

* init

* revert

* revert

* add metadata

* lint

* add reqs

* change to float16

* benchmark lint fix

* 1.16.5

Automatically generated by python-semantic-release

* Add multilingual mFollowIR dataset (#1308)

* add mFollowIR

* paper name

* edit warning->info

* convert to parquet

* lint

* Update tasks table

* Cache the embeddings when requested (#1307)

* add caching

* update test to use close

* change from json to pkl

* fix for window

* cleanup on Windows again

* infer dimension

* move cachewrapper

* add wrapper

* fix

* updates

* fix tests

* fix lint

* lint

* add test

* WIP: Leaderboard UI improvements (#1320)

* Fixed typos in task_results

* Fixed typos in task_results

* Added Tailwind, reorganized layout and fixed scrolling

* Ran linting

* Removed faux benchmark

* Updated layout

* Changed table number format

* Table highlights highest values by making them bold

* Added rank to table, removed organization from model_name

* Added mean rank to table

* Ran linting

* feat: Update metadata for all models (#1316)

* Added model meta

* format

* fixed metadata

* Metadata update for voyage models

* Update mteb/models/cohere_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/cohere_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Added corrections from review

* fix spelling error

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* resolved bugs from pytest --collect-only

* Avoid wrapping all models with the SentenceTransformerWrapper

* Added normalize_embeddings_to_numpy to ensure standard embeddings during evaluations

* fixed moved on correction from @Samoed

* conditionally set .predict method on SentenceTransformerWrapper

---------

Signed-off-by: mr.Shu <mr@shu.io>
Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Thomas van Dongen <thomas123@live.nl>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: Orion Weller <31665361+orionw@users.noreply.github.com>
Co-authored-by: John Yang <byjohnyang@gmail.com>
Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
Co-authored-by: Marek Šuppa <mrshu@users.noreply.github.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Xa9aX ツ <mishradiganta91@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Daniel Buades Marcos <daniel@buad.es>
Co-authored-by: Daniel Buades Marcos <daniel.buades@clinia.com>
Co-authored-by: Sathvik Nallamalli <sathviknallamalli@gmail.com>
Co-authored-by: Michael Graczyk <michael@mgraczyk.com>
Co-authored-by: Mariya Hendriksen <35101262+mariyahendriksen@users.noreply.github.com>
Co-authored-by: Santiago Castro <bryant1410@gmail.com>
Co-authored-by: Joey Xia <77958037+ZiyiXia@users.noreply.github.com>
Co-authored-by: Márton Kardos <power.up1163@gmail.com>
Co-authored-by: Oliver <oliver.pejic@students.fhnw.ch>
Muennighoff added a commit that referenced this pull request Dec 11, 2024
* [MIEB] Adding DataComp CLIP models (#1283)

* adding data comp CLIP models

* update model and caltech101 results

* make lint

* [mieb] Any2TextMultipleChoice Abstask&Evaluator & four tasks in CV-bench (#1287)

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* fix meta data

* fix validate points

* CV-Bench

* evaluator args comment

* fix

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* [mieb] adding 10 tasks (#1290)

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add vidore benchmark 10 tasks

* fix reference

* fix old metadata

* fix meta

* [mieb] Adding MOCOv3 models (#1293)

* add moco models first try

* add as a timm model

* add large model results

* make lint

* [mieb] Add more Any2AnyRetrieval datasets (#1285)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* remove GLDv2I2IRetrieval

* [mieb] Add any2any multiple choice evaluator and abstask (and one task) (#1301)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* [mieb] Fix FORB dataset (#1306)

* correct format

* update results

* add more results

* add more results

* [mieb] run tasks fix (#1302)

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* fix e5v&vista

* task type fix for running tasks

* fix wrong meta

* run mieb script

* script

* lint

* align

* [mieb] split RParisI2IRetrieval and ROxfordI2IRetrieval into easy, medium and hard versions (#1305)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* fix e5v&vista

* remove duplicate corpus entries from BLINKIT2TRetreival dataset

* task type fix for running tasks

* update BLINKIT2T metadata

* fix wrong meta

* run mieb script

* split ROxford, RParis into easy, medium and hard

* make lint

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [mieb] run tasks small fix (#1310)

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* fix e5v&vista

* task type fix for running tasks

* fix wrong meta

* run mieb script

* script

* lint

* align

* fix

* linting

* [mieb] Add VLM2vec (#1323)

* wip vlm2vec model

* making i2t classification work wit Calteh101

* test vlm2vec on other task types

* move peft into class

* feat: Merge main into MIEB (#1329)

* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elem… (#1203)

* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elements (#1201)

Fix OpenAI BadRequestError by limiting input dimensions to 2048 elements

- Ensure the 'sentences' list passed to OpenAI API does not exceed 2048 elements
- Reference: OpenAI's Embedding API documentation on input limits

Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>

* fix ruff formatting

* Added minor test fixes to ensure reproducility across systems

* Ensure that tmp.json is not created within repo when running tests

* format

* fixes path issues

* Rerun CI

---------

Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>

* fix: Ensure STS pearson and spearman does not use the p-value only the correlation (#1207)

Fixes #1206

* 1.14.16

Automatically generated by python-semantic-release

* fix: Normalize licenses including casing, uses of "-" etc.

* fix: Normalize licenses including casing, uses of "-" etc. (#1210)

* fix: Normalize licenses including casing, uses of "-" etc.

* fix tests

* 1.14.17

Automatically generated by python-semantic-release

* fix: Normalize benchmarks no only include task objects and added getter for benchmarks (#1208)

* Normalize benchmarks to only include tasks

- Force benchmarks to only include tasks. This fixes a few bugs where benchmarks can reference a task which is not implemented
- implements `mteb.get_benchmark`, which makes it easier to fetch benchmarks
- Added tests + updated docs

A few outstanding issues:

I would like `mteb.MTEB(benchmark)` to always reproduce the benchmark. Currently this is not possible as MTEB(eng) required the split to be specified. A solution it to allow "eval_splits) to be specified when initializing a task and then pass it on to the `load_data()`. This way we can write the following:

`mteb.get_tasks(tasks=[...], eval_splits=["test"], ...)`

I would also love the aggregation to be a part of the benchmark (such that it is clear how it should be aggregated). This is especially relevant for MTEB(eng) as it average the CQAD datasets before creating the global average. This way we can also create a result object for the benchmark itself. A complimenting solution for this is to allow nested benchmarks.

* fix error in tests

* format

* Added corrections based on review

* added example and formatted

* 1.14.18

Automatically generated by python-semantic-release

* docs: Fix broken links in docs (#1212)

* Added fixes for broken links in adding_a_dataset and adding_a_model docs.

* Updated link name

* Mismatch of the category of AmazonPolarityClassification (#1220)

Fixes #1219

* Update tasks table

* fix: Ensure that results are returned even when hitting cache (#1215)

Fixes #1122

* 1.14.19

Automatically generated by python-semantic-release

* fix: Allow benchmark to specify eval_splits (#1217)

* fix: Allow benchmark to specify eval_splits

This PR allow for benchmarks to specify specific eval. splits. This allow us to fully specify a benchmark within the benchmark object.

To do this it add the following:
- added eval_splits to the Abstask object, which default to metadata.eval_splits
- use the task.eval_splits unless overwritten in mteb.MTEB.run
- added eval_splits arg to mteb.get_tasks, which filter the tasks based on splits
- updated documentation
  - renamed the "Advanced Usage" to "Usage Documentation" to make it more accicible
- added tests where relevant

* Added correction based on feedback

* 1.14.20

Automatically generated by python-semantic-release

* Update points table

* Update points table

* docs: clarify adding a model (#1222)

* fix: Add RepLLaMA style models (#1223)

* init commit

* working and reproducing

* lint

* update hashes

* warning

* add pyproject

* Update points table

* 1.14.21

Automatically generated by python-semantic-release

* docs: Update points (#1228)

* Fix case

* Fix casing

* Fix case

* Fix case

* Create 971.jsonl

* Update contrib

* Add contributors

* Update points table

* docs: Add MTEB(code) dataset (#1237)

* docs: Add MTEB(code) dataset

* Fix linting

* Update points table

* Update of my affiliation (#1242)

Update points.md

* Add contributor (#1243)

* fix: @mrshu's name in `points.md` (#1246)

* Use the diacritic character to be inline with Slovak spelling.

Signed-off-by: mr.Shu <mr@shu.io>

* docs: Create benchmarks overview table (#1245)

* fix get_benchmarks method

* add create benchmark script

* make lint

* 1.14.22

Automatically generated by python-semantic-release

* docs: Update affiliation (#1247)

Update points.md

* Added author-information

* Add final author list

* Update points table

* docs: Added coordination point for Jimmy Lee  (#1253)

docs: Added coordination point for Jimmy lee for his work on the coordination of Crystina and Nandan

* Update points table

* fix: Add multilingual Benchmark (#1252)

* fix: Add multilingual bench

* Update mteb/benchmarks/benchmarks.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* format

---------

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* 1.14.23

Automatically generated by python-semantic-release

* docs: Small point changes & more contributors (#1254)

* Update points.md

* Fix format

* Fix attribution

* Update points table

* fix: Downsample large retrieval datasets (#1236)

* most tasks

* lint

* fix other issues

* refactor

* lint and docs

* add polish

* keep case sensitive mteb paths

* add potential points

* fix points

* fix test about metadata

* update tasks and stats

* lint

* Update points table

* Update tasks table

* 1.14.24

Automatically generated by python-semantic-release

* fix: Get meta from CrossEncoder (#1255)

* remove indent after return

* handle cross encoders for model meta

* make lint

* update filename since we now have model name

* 1.14.25

Automatically generated by python-semantic-release

* fix: Add listing all available benchmarks CLI option (#1256)

* add benchmarks.md in README

* add cli option

* add benchmark cli test case

* correct typo

* 1.14.26

Automatically generated by python-semantic-release

* docs: Update affiliation (#1248)

* Update points.md

* Update points.md

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* docs: Update mteb(eng) calculation (#1258)

* Update mteb(eng) calculation

* Fixed citations

* Update MTEB(eng) + MTEB(multilingual)

* feat: leverage SentenceTransformers' query/passage specific prompts (#1221)

* feat: leverage SentenceTransformer models' query/passage specific prompts

* refactor: remove E5Wrapper

fix: wrong e5 revisions

* fix: default prompt_type to None

* fix: e4ce987 revision no longer exists for multilingual-e5-small on the Hub

* fix: keep `prompt_name` in kwargs when model doesn't have a `prompts` attr

* feat: use Enum for `prompt_type`

* docs: specify how to use prompts with Sentence Transformers

* feat: readd arctic models due to metadata

* 1.15.0

Automatically generated by python-semantic-release

* fix: Add Touche2020v3 and JMTEB (#1262)

* add datasets

* fix metrics

* add Touche2020v3

* fix metadata

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* upd name and supress

* add benchmark class

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Update tasks table

* 1.15.1

Automatically generated by python-semantic-release

* fix: Select benchmarks CLI option (#1261)

* add test case for a list of Benchmarks

* add selecting benchmarks CLI option

* typos

* use a separate attribute for benchmarks

* try fixing tests

* should accept string as well

* revert filename change

* use Benchmark and avoid circular import

* fix: derive `results_directory` path from `results_repo` name (#1275)

fix: don't hardcode repo name when downloading results

* 1.15.2

Automatically generated by python-semantic-release

* fix: sorting benchmark tasks by MTEB, then alphabetical (#1271)

* sorted

* fixed formatting

* efficiency changes

* fix test

* make lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* 1.15.3

Automatically generated by python-semantic-release

* ci: Removed 3.8 dependency (#1281)

Changes include:
- remove 3.8 from tests (added 3.11 and 3.12)
- changed other CI to 3.9
- updated lint rules to use 3.8

* Update points table

* fix: Allow Numpy >=2.0 (#1264)

Allow Numpy >=2.0

* 1.15.4

Automatically generated by python-semantic-release

* docs: points for paper writing (#1286)

* Create 1004.jsonl

* Create 1006.jsonl

* Update docs/mmteb/points/1004.jsonl

* Update docs/mmteb/points/1006.jsonl

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Update points table

* Update points table

* Update points table

* docs: Fix a link in the README (#1289)

* Fix a link in the README

And fix some typos.

* Update README.md

* Update points table

* fix: Update benchmarks (#1288)

* make benchmark var name uppercase

* update touche to v3

* add MIRACLRetrievalHardNegatives to multilingual

* add mteb(indic)

* add eu benchmark

* 1.15.5

Automatically generated by python-semantic-release

* fix: Allow numpy<2.0.0 (#1291)

* 1.15.6

Automatically generated by python-semantic-release

* fix: Add metadata dict to QBQTC in C-MTEB (#1292)

* fix QBQTC in C-MTEB

* make lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* 1.15.7

Automatically generated by python-semantic-release

* fix: Remove non-existent eval split of CMNLI (#1294)

fix eval_splits of CMNLI

* 1.15.8

Automatically generated by python-semantic-release

* Leaderboard (#1235)

* Add leaderboard dev

* Renamed MTEBResults to TaskResult

* Moved model and model meta loading utilities into overview.py

* Added get_model_metas to retrieve filtered metadata for models

* Restructured results object and made it into a class instead of a dict

* Added utilities for filtering models on BenchmarkResults objects

* Added to_table utility function to BenchmarkResults

* Added serialization utilities to BenchmarkResults

* Attempted fixing tests

* Added get_model_metas to __init__

* Added get_benchmarks to __init__ and made it return all benchmarks by default

* Added get_benchmarks to __init__

* Made tasks hashable

* Added task filtering based on task objects on BenchmarkResults

* Added BenchmarkResults to __init__

* Added additional arguments to get_scores on two classes

* Made get_scores smarter on BenchmarkResult

* Added basic multilingual benchmark

* Modified benchmark to be able to easily access results

* Added useful properties and filtering functions to BenchmarkResults

* Added minimal functioning example

* Added smarter table, task-list updating and tried fixing dropdown scrolling

* Made restrict_results into a private function

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Removed old leaderboard scripts

* Hardcoded max and min model size

* Removed redundant utils file

* Ran linting

* added leaderboard dependencies as optional

* Fixed union type error on Python 3.9

* Removed references to Dict in task aggregation

* Fixed name errors in _restrict_task_results

* Fixed _restrict_task_results

* Made hf_subsets={'default'} when the task is monolingual in _restric_task_results

* Task dropdown now gets filtered based on the other criteria

* Ran linting again

* Introduced hotfix for reranking test

* Added BenchmarkResults to __all__ in __init__

* Fixed validate_and_filter_scores method, and replaced _restric_task_results with it

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* feat: Use prompts instead of encode_corpus and encode_queries (#1278)

* add prompt per task type

* fix prompt

* upd test

* lint

* fix test

* fix DeprecatedSummarizationEvaluator

* fix prompts

* add test

* lint

* logger info

* use task type only in model_encode

* lint

* update interface

* add prompt types to docs

* fix test

* mock tasks

* mock task registry

* remove last task_type

* fix tests

* lint

* fix test

* fix

* use wrapper and new prompts

* fix tests

* lint

* fix test

* remove conftest

* validate task to prompt_name

* override model prompts

* task to prompt name optional

* fix tests

* fix models

* remove task_to_prompt_name

* remove from mteb __init__

* update docs

* load existing model prompts if model_prompts is None

* fix

* lint

* change wrapper loader

* add wrapper class

* lint

* add wrapper file

* update logging

* upd logging

* refactor reranking

* lint

* remove prints

* 1.16.0

Automatically generated by python-semantic-release

* fix: Add Retrieval SK Quad dataset for Slovak search evaluation (#1276)

* Add Retrieval SK Quad dataset for Slovak search evaluation

This commit introduces the Retrieval SK Quad dataset, designed to assess Slovak search performance. The dataset is derived from SK-QuAD and includes questions with their best answers categorized post-annotation. This addition provides a significant resource for advancing Slovak language search evaluation and supporting further research and development.

* Add Retrieval SK Quad dataset for Slovak search evaluation 2

Added the requested changes on the SKQuadRetrieval.py file

* add task to init

* add missing task metadata

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks table

* 1.16.1

Automatically generated by python-semantic-release

* fix: Add Slovak Hate Speech and Offensive Language Dataset (#1274)

* Add Slovak Hate Speech and Offensive Language
Dataset

This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update.

* Add Slovak Hate Speech and Offensive Language Dataset
- Updated __init__.py to include the new SlovakHateSpeechClassification task.
- Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.

* Did requested changes:
- Updated __init__.py to include the new SlovakHateSpeechClassification task.
- Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.

* resolve linting issues by running `make lint`

* Update tasks table

* WIP: Leaderboard UI improvements (#1312)

* Fixed typos in task_results

* Fixed typos in task_results

* Added Tailwind, reorganized layout and fixed scrolling

* Ran linting

* 1.16.2

Automatically generated by python-semantic-release

* fix: remove duplicate multilingual

* 1.16.3

Automatically generated by python-semantic-release

* fix: Re-upload dataset to hub to avoid using script upload (#1322)

* fix dataset upload

* add linting

* Update tasks table

* 1.16.4

Automatically generated by python-semantic-release

* fix: Add implementations of common reranker models (#1309)

* init

* revert

* revert

* add metadata

* lint

* add reqs

* change to float16

* benchmark lint fix

* 1.16.5

Automatically generated by python-semantic-release

* Add multilingual mFollowIR dataset (#1308)

* add mFollowIR

* paper name

* edit warning->info

* convert to parquet

* lint

* Update tasks table

* Cache the embeddings when requested (#1307)

* add caching

* update test to use close

* change from json to pkl

* fix for window

* cleanup on Windows again

* infer dimension

* move cachewrapper

* add wrapper

* fix

* updates

* fix tests

* fix lint

* lint

* add test

* WIP: Leaderboard UI improvements (#1320)

* Fixed typos in task_results

* Fixed typos in task_results

* Added Tailwind, reorganized layout and fixed scrolling

* Ran linting

* Removed faux benchmark

* Updated layout

* Changed table number format

* Table highlights highest values by making them bold

* Added rank to table, removed organization from model_name

* Added mean rank to table

* Ran linting

* feat: Update metadata for all models (#1316)

* Added model meta

* format

* fixed metadata

* Metadata update for voyage models

* Update mteb/models/cohere_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/cohere_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Added corrections from review

* fix spelling error

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* resolved bugs from pytest --collect-only

* Avoid wrapping all models with the SentenceTransformerWrapper

* Added normalize_embeddings_to_numpy to ensure standard embeddings during evaluations

* fixed moved on correction from @Samoed

* conditionally set .predict method on SentenceTransformerWrapper

---------

Signed-off-by: mr.Shu <mr@shu.io>
Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Thomas van Dongen <thomas123@live.nl>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: Orion Weller <31665361+orionw@users.noreply.github.com>
Co-authored-by: John Yang <byjohnyang@gmail.com>
Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
Co-authored-by: Marek Šuppa <mrshu@users.noreply.github.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Xa9aX ツ <mishradiganta91@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Daniel Buades Marcos <daniel@buad.es>
Co-authored-by: Daniel Buades Marcos <daniel.buades@clinia.com>
Co-authored-by: Sathvik Nallamalli <sathviknallamalli@gmail.com>
Co-authored-by: Michael Graczyk <michael@mgraczyk.com>
Co-authored-by: Mariya Hendriksen <35101262+mariyahendriksen@users.noreply.github.com>
Co-authored-by: Santiago Castro <bryant1410@gmail.com>
Co-authored-by: Joey Xia <77958037+ZiyiXia@users.noreply.github.com>
Co-authored-by: Márton Kardos <power.up1163@gmail.com>
Co-authored-by: Oliver <oliver.pejic@students.fhnw.ch>

* [mieb] Add OpenCLIP models (#1335)

* add open clip models

* Update __init__.py

* lint

* fix model overview

* update jina clip

---------

Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
Co-authored-by: gowitheflow-1998 <chenghao.xiao@durham.ac.uk>

* [mieb] new version with downsampled train split to 32 per class (#1327)

* new version with downsampled train split to 32 per class

* force load truncated image file

* make lint

* add open clip models

* Update __init__.py

* lint

* fix model overview

* fix ImageCLS undersample; run birdsnap

* make lint

* make lint

---------

Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
Co-authored-by: gowitheflow-1998 <chenghao.xiao@durham.ac.uk>

* [mieb] Fix Jina CLIP (#1349)

fix jina clip v1

* fix: Add clevr license (#1356)

* Add BLINK as multi-choice tasks (#1348)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* fix e5v&vista

* remove duplicate corpus entries from BLINKIT2TRetreival dataset

* task type fix for running tasks

* update BLINKIT2T metadata

* fix wrong meta

* run mieb script

* split ROxford, RParis into easy, medium and hard

* make lint

* add BLINK as multi choice tasks

* fix: license metadata in wrong format

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [mieb] add Eva CLIP models (#1369)

* add Eva CLIP models

* make lint

* [mieb] add siglip, cohere multimodal & some fixes for final run (#1357)

* fix dataset type error

* fix clustering metrics

* add siglip & cohere

* update mieb run script

* cohere-v import

* fix

* api key name

* [mieb] fixes for final run (#1374)

* e5_v device arg

* dataloader num_workers

* vista doc

* vista doc

* run mieb

* fix

* Update run_vista.md

* [mieb] Fix torch no grad (#1378)

Fix torch no grad

* [mieb] Fix vlm2vec (#1380)

* fix vlm2vec return dtype

* make lint

* [mieb] Remove null entries from corpus of ROxford, RParis (#1371)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* fix e5v&vista

* remove duplicate corpus entries from BLINKIT2TRetreival dataset

* task type fix for running tasks

* update BLINKIT2T metadata

* fix wrong meta

* run mieb script

* split ROxford, RParis into easy, medium and hard

* make lint

* add BLINK as multi choice tasks

* fix: license metadata in wrong format

* remove null examples from corpus of ROxford and RParis

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [mieb] fixes (#1390)

* Fix torch no grad

* simplify

* make lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* [MIEB] Remove non-existent method for blip (#1394)

remove non-existent method for blip

* [mieb] fix ALIGN; update Winoground revision id; update run script (#1391)

* fix align & winoground

* lint

* Convert task category to i2i for tasks that only calls image encode

* update categories should include img cls, clustering, and multi label clf

* no op

* no op

* make lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* [mieb] Fix open clip for cv bench count (#1397)

fix shape mismatch

* [mieb] Update subtasks of BLINKIT2TMultiChoice and BLINKIT2IMultiChoice (#1403)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* fix e5v&vista

* remove duplicate corpus entries from BLINKIT2TRetreival dataset

* task type fix for running tasks

* update BLINKIT2T metadata

* fix wrong meta

* run mieb script

* split ROxford, RParis into easy, medium and hard

* make lint

* add BLINK as multi choice tasks

* fix: license metadata in wrong format

* remove null examples from corpus of ROxford and RParis

* fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice

* update blink metadata

* add updated BLINK results

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [mieb] Fix EVA CLIP for CV Bench (#1414)

* unsqueeze after preprocess

* make lint

* [mieb] Add calculate probs for vlm2vec (#1418)

* add method

* make lint

* [mieb] Fix siglip bug & add retrieval datasets (#1424)

* fix siglip

* add edis&gld-v2 i2i

* results

* siglip updated results

* fix siglip non-dataloader tasks

* [mieb] use Logistic Regression classifier for AbsTaskImageMultilabelClassification (#1420)

* use moc-lr classifier

* set n_experiments=5

* run dinov2 and some laion models

* add dinov2-giant results

* [mieb] mieb scripts (siglip rerun & linear probing ablation & params count) (#1429)

* mieb scripts

* lint

* [MIEB] Change Flickr30k to test split (#1449)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* fix e5v&vista

* remove duplicate corpus entries from BLINKIT2TRetreival dataset

* task type fix for running tasks

* update BLINKIT2T metadata

* fix wrong meta

* run mieb script

* split ROxford, RParis into easy, medium and hard

* make lint

* add BLINK as multi choice tasks

* fix: license metadata in wrong format

* remove null examples from corpus of ROxford and RParis

* fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice

* update blink metadata

* add updated BLINK results

* merge upstream mieb

* change Flickr30k to test split

* change flickr to test split

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [mieb] Fix VLM2vec dtype (#1462)

* propagate dtype

* fix fuse embeddings using list of PIL images

* [mieb] run script for missing results (#1472)

* task type fix

* scripts

* [mieb] Fix Moco model on CIFAR10Clustering (#1487)

Fix Moco model on CIFAR10Clustering

* [mieb] Fix Flickr30k I2T and T2I (#1505)

* remake flickr30k it2 and t2i

* add openai clip vit-b32 b16 and jina-clip results

* make lint

* [MIEB] add missing siglip models  (#1533)

* add udpates
* lint errors

* fix typo (#1535)

* add udpates
* lint errors
* fix small typo

* [mieb] Fix numbers of CIRR, Fashion200k, FashionIQ, Flickr30k, MSCOCO data statistics (#1544)

fix numbers

* Add Voyage's multimodal embedding (#1555)

* add voyage multimodal & ran 17 tasks

* lint

* typo

* clean

* [mieb] update script for final re-run (#1576)

* mieb final runs

* lint

* fix: no longer using same query text for all of BLINKIT2TMultiChoice (#1572)

* fix: no longer using same query text for all of BLINKIT2TMultiChoice

* fix: remove blink subtask

* fix: remove subtask from blink it2i

* fix: align BLINK retrieval to multi choice

* add ROxford and RParis I2I multi choice

* add retrieval metrics to multi choice evaluator

* fix: remove wrong negatives from revisiting multichoice datasets

* fix revisiting datasets

* add new results for revisiting multichoice

---------

Signed-off-by: mr.Shu <mr@shu.io>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
Co-authored-by: Jamie-Stirling <36764530+Jamie-Stirling@users.noreply.github.com>
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Thomas van Dongen <thomas123@live.nl>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Orion Weller <31665361+orionw@users.noreply.github.com>
Co-authored-by: John Yang <byjohnyang@gmail.com>
Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
Co-authored-by: Marek Šuppa <mrshu@users.noreply.github.com>
Co-authored-by: Xa9aX ツ <mishradiganta91@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Daniel Buades Marcos <daniel@buad.es>
Co-authored-by: Daniel Buades Marcos <daniel.buades@clinia.com>
Co-authored-by: Sathvik Nallamalli <sathviknallamalli@gmail.com>
Co-authored-by: Michael Graczyk <michael@mgraczyk.com>
Co-authored-by: Mariya Hendriksen <35101262+mariyahendriksen@users.noreply.github.com>
Co-authored-by: Santiago Castro <bryant1410@gmail.com>
Co-authored-by: Joey Xia <77958037+ZiyiXia@users.noreply.github.com>
Co-authored-by: Márton Kardos <power.up1163@gmail.com>
Co-authored-by: Oliver <oliver.pejic@students.fhnw.ch>
Co-authored-by: gowitheflow-1998 <chenghao.xiao@durham.ac.uk>
Co-authored-by: Saiteja Utpala <73220310+SaitejaUtpala@users.noreply.github.com>
Co-authored-by: Xin Zhang <izhx404@gmail.com>
isaac-chung added a commit that referenced this pull request Feb 4, 2025
* mieb ZeroshotClassification
* mieb docs
* mieb implementation demo
* model meta; abstask column names; linear probe clf
* model meta; abstask column names; linear probe clf
* fix: update naming as candidate_labels
* Update README.md
* Update README.md
* i2tretrieval
* test load data ignore i2tretrieval
* [MIEB] Add image clustering (#1088)
* make lint
* wip
* add TinyImageNet and run
* type hints
* add accuracy
* lint
* remove unused & fix typos
* T2I Retrieval
* Any2AnyRetrieval
* fix tests from merge
* [MIEB] Add image text pair classification and tests (#1099)
* add ImageTextPairClassification abstask and evaluator
* dataset transform into sequence of images for each sample
* fix processing logic; list of list images compatability
* lint and docstrings
* make lint
* fix failing tests in TaskMetadata
* add tests for mieb
* skip gated repo
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [MIEB] Add image classification and zero shot classification tasks (#1101)
* fix task metadata
* use overrideable column names
* add CIFAR datasets
* add caltech101 dataset
* add FGVC aircraft dataset
* add food 101 dataset
* add OxfordPets dataset
* remove comments
* correct cifar100 path
* update cifar100 classification results
* cifar zero shot results
* add caltech101 zero shot
* matching CLIP paper implementation
* add aircraft and food zero shot
* add oxford pets zero shot
* [MIEB] Add CIFAR clustering (#1104)
add CIFAR clustering
* [MIEB] Add more image classification and zero shot classification datasets (#1103)
* update category to i2t
* add MNIST linear probe and zero shot
* add FER2013 linear probe and zero shot
* add stanford cars linear probe and zero shot
* add birdsnap linear probe and zero shot
* add eurosat linear probe and zero shot
* lint
* correct eurosat zero shot labels
* add abstask for image multilable and voc2007
* make lint
* [MIEB] Add more image classification and zero shot datasets (#1105)
* add STL10 linear probe and zero shot
* add RESISC45 linear probe and zeor shot
* add Describable textures linear probe and zero shot
* fix spacing lint
* add SUN397 linear probe and zero shot
* correct SUN397 zero shot captions
* add baai bge vista
* add e5-v
* linting
* memory issues for image linear probe & zeroshot
* kknn linear probe arguments
* del comments
* Add some classification and ZeroShot classification tasks (#1107)
* Add Country211 classification task
* Add imagenet1k classification task
* Add UCF101 classification task
* Add PatchCamelyon Classification task
* Add GTSRB classification task
* Add GSTRB Zero Shot Classification
* Add country211 zero shot classification
* Add results for classification tasks
* Add zero shot classification tasks
* Add PatchCamelyon tasks and results
* Add linting
* Add results and fix prompts for zero shot
* Add results
* Add results and linting
* fix dependency & clip mock test
* [MIEB] Add jina clip (#1120)
* add jina clip and mscoco i2t and t2i results
* make lint
* [MIEB] Update `mieb` with the `main` branch and some fixes (#1126)
* fix instruction retrival (#1072)
* fix instruction retrival
* fix test
* add points
* make nested results
* add test
* skip instruction test
* fix instruction passes
* fix unions
* move do_length_ablation
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Update points table
* fix: fix bug-causing spelling error in function name of e5-mistral-instruct (#1106)
found bug
* 1.12.85
Automatically generated by python-semantic-release
* fix: MultilingualSentimentClassification (#1109)
* Update points table
* fix: Avoid spaces in dataset name for CQADupstack and ignore speed tasks
* 1.12.86
Automatically generated by python-semantic-release
* fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended (#1112)
* fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended
* fix: fixed formatting for cli
* docs: improve searchability in the advanced usage documentation
* 1.12.87
Automatically generated by python-semantic-release
* docs: improve searchability in the advanced usage documentation (#1113)
* docs: improve searchability in the advanced usage documentation
* docs: update based on corrections
* fix: export type for `mteb create_meta` (#1114)
* fix export type
* fix dataset version too
* 1.12.88
Automatically generated by python-semantic-release
* fix: Simplify models implementations (#1085)
* Merge
* Adapt
* Simplify
* Check for rev again
* Rmv cmmnt
* Simplify
* simplify
* Rmv comment
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Use logging; change try except; add info
* Lint
* Rmv results
* Update rev
* format
* Simplify models; Allow instructions
* Jobs
* Fix merge
* Format
* Adapt models
* fix: ensure that e5 ignores the NQ
* format
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* 1.12.89
Automatically generated by python-semantic-release
* fix: nomic models using prefix correctly (#1125)
* fix: nomic models using prefix correctly
* chore: remove comment
* fix: handling in case not torch tensor
* Fix typo
---------
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
* 1.12.90
Automatically generated by python-semantic-release
* refactor vista model wrapper to contain lib import
* python 38 type hints
---------
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: anpalmak2003 <73543260+anpalmak2003@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: Zach Nussbaum <zanussbaum@gmail.com>
Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
* image memoery issues for all retrieval Abstasks
* Add CLEVR and SciMMIR Image-Text Understanding tasks (#1127)
* Add CLEVER and SciMMIR
* Update metadata
* remove useless comment
* Add linting
* fix typo and tests
* Add CLEVR count task
* add linting
* add fashion200k & fashionIQ test passed
* clip text max seq truncation
* add WebQA, NIGHTS, OVEN
* any2any retrieval chunk encoding
* add nomic vision model; any2any topk bug
* add cv recall
* add InfoSeek; VisualNews
* [MIEB] Add Stanford Cars i2i Retrieval (#1147)
* wip
* add results
* make lint
* change back the order
* [MIEB] Add CUB200 i2i retrieval (#1154)
* add cub200 and results
* add skip_first_result
* skipped self and rerun results
* consolidate i2t and t2i to any2any
* remove abstask and evaluators
* remove references from test
* tu-add berlin sketch retrieval
* XM3600; XFlickr30kCO; mutilingual
* wit multilingual retrieval t2i
* correct multilingual t2i meta
* meta
* add dinov2 model; 4 sizes
* cls evaluator channel bug fix
* add ALIGN model
* add FORBI2IRetrieval
* forb & tuberlin new revision
* disable tokenization parallelism
* add hateful meme retrieval i2tt2i
* add memotion retrieval t2ii2t
* add SciMMIR Retrieval i2tt2i
* ruff update
* Visual STS Abstask&evaluator
* add visual STS17
* add visual STS 12-16
* [mieb] Add blip and blip2 models, and ImageNetDog15Clustering task (#1226)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* [mieb] add 3 compositionality evaluation tasks (#1229)
* linting & update unavailable dataset path
* add aro visual relation&attribution; sugarcrepe
* correct reference
* add SOPI2IRetrieval dataset/task (#1232)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* change reference
* Image text pair cls (#1233)
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* fix meta data
* fix validate points
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Add RP2kI2IRetrieval and METI2IRetrieval (#1239)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* [MIEB] Adding DataComp CLIP models (#1283)
* adding data comp CLIP models
* update model and caltech101 results
* make lint
* [mieb] Any2TextMultipleChoice Abstask&Evaluator & four tasks in CV-bench (#1287)
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* fix meta data
* fix validate points
* CV-Bench
* evaluator args comment
* fix
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* [mieb] adding 10 tasks (#1290)
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add vidore benchmark 10 tasks
* fix reference
* fix old metadata
* fix meta
* [mieb] Adding MOCOv3 models (#1293)
* add moco models first try
* add as a timm model
* add large model results
* make lint
* [mieb] Add more Any2AnyRetrieval datasets (#1285)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* remove GLDv2I2IRetrieval
* [mieb] Add any2any multiple choice evaluator and abstask (and one task) (#1301)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* [mieb] Fix FORB dataset (#1306)
* correct format
* update results
* add more results
* add more results
* [mieb] run tasks fix (#1302)
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* fix e5v&vista
* task type fix for running tasks
* fix wrong meta
* run mieb script
* script
* lint
* align
* [mieb] split RParisI2IRetrieval and ROxfordI2IRetrieval into easy, medium and hard versions (#1305)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* fix e5v&vista
* remove duplicate corpus entries from BLINKIT2TRetreival dataset
* task type fix for running tasks
* update BLINKIT2T metadata
* fix wrong meta
* run mieb script
* split ROxford, RParis into easy, medium and hard
* make lint
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [mieb] run tasks small fix (#1310)
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* fix e5v&vista
* task type fix for running tasks
* fix wrong meta
* run mieb script
* script
* lint
* align
* fix
* linting
* [mieb] Add VLM2vec (#1323)
* wip vlm2vec model
* making i2t classification work wit Calteh101
* test vlm2vec on other task types
* move peft into class
* feat: Merge main into MIEB (#1329)
* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elem… (#1203)
* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elements (#1201)
Fix OpenAI BadRequestError by limiting input dimensions to 2048 elements
- Ensure the 'sentences' list passed to OpenAI API does not exceed 2048 elements
- Reference: OpenAI's Embedding API documentation on input limits
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>
* fix ruff formatting
* Added minor test fixes to ensure reproducility across systems
* Ensure that tmp.json is not created within repo when running tests
* format
* fixes path issues
* Rerun CI
---------
Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>
* fix: Ensure STS pearson and spearman does not use the p-value only the correlation (#1207)
Fixes #1206
* 1.14.16
Automatically generated by python-semantic-release
* fix: Normalize licenses including casing, uses of "-" etc.
* fix: Normalize licenses including casing, uses of "-" etc. (#1210)
* fix: Normalize licenses including casing, uses of "-" etc.
* fix tests
* 1.14.17
Automatically generated by python-semantic-release
* fix: Normalize benchmarks no only include task objects and added getter for benchmarks (#1208)
* Normalize benchmarks to only include tasks
- Force benchmarks to only include tasks. This fixes a few bugs where benchmarks can reference a task which is not implemented
- implements `mteb.get_benchmark`, which makes it easier to fetch benchmarks
- Added tests + updated docs
A few outstanding issues:
I would like `mteb.MTEB(benchmark)` to always reproduce the benchmark. Currently this is not possible as MTEB(eng) required the split to be specified. A solution it to allow "eval_splits) to be specified when initializing a task and then pass it on to the `load_data()`. This way we can write the following:
`mteb.get_tasks(tasks=[...], eval_splits=["test"], ...)`
I would also love the aggregation to be a part of the benchmark (such that it is clear how it should be aggregated). This is especially relevant for MTEB(eng) as it average the CQAD datasets before creating the global average. This way we can also create a result object for the benchmark itself. A complimenting solution for this is to allow nested benchmarks.
* fix error in tests
* format
* Added corrections based on review
* added example and formatted
* 1.14.18
Automatically generated by python-semantic-release
* docs: Fix broken links in docs (#1212)
* Added fixes for broken links in adding_a_dataset and adding_a_model docs.
* Updated link name
* Mismatch of the category of AmazonPolarityClassification (#1220)
Fixes #1219
* Update tasks table
* fix: Ensure that results are returned even when hitting cache (#1215)
Fixes #1122
* 1.14.19
Automatically generated by python-semantic-release
* fix: Allow benchmark to specify eval_splits (#1217)
* fix: Allow benchmark to specify eval_splits
This PR allow for benchmarks to specify specific eval. splits. This allow us to fully specify a benchmark within the benchmark object.
To do this it add the following:
- added eval_splits to the Abstask object, which default to metadata.eval_splits
- use the task.eval_splits unless overwritten in mteb.MTEB.run
- added eval_splits arg to mteb.get_tasks, which filter the tasks based on splits
- updated documentation
  - renamed the "Advanced Usage" to "Usage Documentation" to make it more accicible
- added tests where relevant
* Added correction based on feedback
* 1.14.20
Automatically generated by python-semantic-release
* Update points table
* Update points table
* docs: clarify adding a model (#1222)
* fix: Add RepLLaMA style models (#1223)
* init commit
* working and reproducing
* lint
* update hashes
* warning
* add pyproject
* Update points table
* 1.14.21
Automatically generated by python-semantic-release
* docs: Update points (#1228)
* Fix case
* Fix casing
* Fix case
* Fix case
* Create 971.jsonl
* Update contrib
* Add contributors
* Update points table
* docs: Add MTEB(code) dataset (#1237)
* docs: Add MTEB(code) dataset
* Fix linting
* Update points table
* Update of my affiliation (#1242)
Update points.md
* Add contributor (#1243)
* fix: @mrshu's name in `points.md` (#1246)
* Use the diacritic character to be inline with Slovak spelling.
Signed-off-by: mr.Shu <mr@shu.io>
* docs: Create benchmarks overview table (#1245)
* fix get_benchmarks method
* add create benchmark script
* make lint
* 1.14.22
Automatically generated by python-semantic-release
* docs: Update affiliation (#1247)
Update points.md
* Added author-information
* Add final author list
* Update points table
* docs: Added coordination point for Jimmy Lee  (#1253)
docs: Added coordination point for Jimmy lee for his work on the coordination of Crystina and Nandan
* Update points table
* fix: Add multilingual Benchmark (#1252)
* fix: Add multilingual bench
* Update mteb/benchmarks/benchmarks.py
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
* format
---------
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
* 1.14.23
Automatically generated by python-semantic-release
* docs: Small point changes & more contributors (#1254)
* Update points.md
* Fix format
* Fix attribution
* Update points table
* fix: Downsample large retrieval datasets (#1236)
* most tasks
* lint
* fix other issues
* refactor
* lint and docs
* add polish
* keep case sensitive mteb paths
* add potential points
* fix points
* fix test about metadata
* update tasks and stats
* lint
* Update points table
* Update tasks table
* 1.14.24
Automatically generated by python-semantic-release
* fix: Get meta from CrossEncoder (#1255)
* remove indent after return
* handle cross encoders for model meta
* make lint
* update filename since we now have model name
* 1.14.25
Automatically generated by python-semantic-release
* fix: Add listing all available benchmarks CLI option (#1256)
* add benchmarks.md in README
* add cli option
* add benchmark cli test case
* correct typo
* 1.14.26
Automatically generated by python-semantic-release
* docs: Update affiliation (#1248)
* Update points.md
* Update points.md
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* docs: Update mteb(eng) calculation (#1258)
* Update mteb(eng) calculation
* Fixed citations
* Update MTEB(eng) + MTEB(multilingual)
* feat: leverage SentenceTransformers' query/passage specific prompts (#1221)
* feat: leverage SentenceTransformer models' query/passage specific prompts
* refactor: remove E5Wrapper
fix: wrong e5 revisions
* fix: default prompt_type to None
* fix: e4ce987 revision no longer exists for multilingual-e5-small on the Hub
* fix: keep `prompt_name` in kwargs when model doesn't have a `prompts` attr
* feat: use Enum for `prompt_type`
* docs: specify how to use prompts with Sentence Transformers
* feat: readd arctic models due to metadata
* 1.15.0
Automatically generated by python-semantic-release
* fix: Add Touche2020v3 and JMTEB (#1262)
* add datasets
* fix metrics
* add Touche2020v3
* fix metadata
* Apply suggestions from code review
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* upd name and supress
* add benchmark class
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Update tasks table
* 1.15.1
Automatically generated by python-semantic-release
* fix: Select benchmarks CLI option (#1261)
* add test case for a list of Benchmarks
* add selecting benchmarks CLI option
* typos
* use a separate attribute for benchmarks
* try fixing tests
* should accept string as well
* revert filename change
* use Benchmark and avoid circular import
* fix: derive `results_directory` path from `results_repo` name (#1275)
fix: don't hardcode repo name when downloading results
* 1.15.2
Automatically generated by python-semantic-release
* fix: sorting benchmark tasks by MTEB, then alphabetical (#1271)
* sorted
* fixed formatting
* efficiency changes
* fix test
* make lint
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* 1.15.3
Automatically generated by python-semantic-release
* ci: Removed 3.8 dependency (#1281)
Changes include:
- remove 3.8 from tests (added 3.11 and 3.12)
- changed other CI to 3.9
- updated lint rules to use 3.8
* Update points table
* fix: Allow Numpy >=2.0 (#1264)
Allow Numpy >=2.0
* 1.15.4
Automatically generated by python-semantic-release
* docs: points for paper writing (#1286)
* Create 1004.jsonl
* Create 1006.jsonl
* Update docs/mmteb/points/1004.jsonl
* Update docs/mmteb/points/1006.jsonl
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Update points table
* Update points table
* Update points table
* docs: Fix a link in the README (#1289)
* Fix a link in the README
And fix some typos.
* Update README.md
* Update points table
* fix: Update benchmarks (#1288)
* make benchmark var name uppercase
* update touche to v3
* add MIRACLRetrievalHardNegatives to multilingual
* add mteb(indic)
* add eu benchmark
* 1.15.5
Automatically generated by python-semantic-release
* fix: Allow numpy<2.0.0 (#1291)
* 1.15.6
Automatically generated by python-semantic-release
* fix: Add metadata dict to QBQTC in C-MTEB (#1292)
* fix QBQTC in C-MTEB
* make lint
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* 1.15.7
Automatically generated by python-semantic-release
* fix: Remove non-existent eval split of CMNLI (#1294)
fix eval_splits of CMNLI
* 1.15.8
Automatically generated by python-semantic-release
* Leaderboard (#1235)
* Add leaderboard dev
* Renamed MTEBResults to TaskResult
* Moved model and model meta loading utilities into overview.py
* Added get_model_metas to retrieve filtered metadata for models
* Restructured results object and made it into a class instead of a dict
* Added utilities for filtering models on BenchmarkResults objects
* Added to_table utility function to BenchmarkResults
* Added serialization utilities to BenchmarkResults
* Attempted fixing tests
* Added get_model_metas to __init__
* Added get_benchmarks to __init__ and made it return all benchmarks by default
* Added get_benchmarks to __init__
* Made tasks hashable
* Added task filtering based on task objects on BenchmarkResults
* Added BenchmarkResults to __init__
* Added additional arguments to get_scores on two classes
* Made get_scores smarter on BenchmarkResult
* Added basic multilingual benchmark
* Modified benchmark to be able to easily access results
* Added useful properties and filtering functions to BenchmarkResults
* Added minimal functioning example
* Added smarter table, task-list updating and tried fixing dropdown scrolling
* Made restrict_results into a private function
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Removed old leaderboard scripts
* Hardcoded max and min model size
* Removed redundant utils file
* Ran linting
* added leaderboard dependencies as optional
* Fixed union type error on Python 3.9
* Removed references to Dict in task aggregation
* Fixed name errors in _restrict_task_results
* Fixed _restrict_task_results
* Made hf_subsets={'default'} when the task is monolingual in _restric_task_results
* Task dropdown now gets filtered based on the other criteria
* Ran linting again
* Introduced hotfix for reranking test
* Added BenchmarkResults to __all__ in __init__
* Fixed validate_and_filter_scores method, and replaced _restric_task_results with it
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* feat: Use prompts instead of encode_corpus and encode_queries (#1278)
* add prompt per task type
* fix prompt
* upd test
* lint
* fix test
* fix DeprecatedSummarizationEvaluator
* fix prompts
* add test
* lint
* logger info
* use task type only in model_encode
* lint
* update interface
* add prompt types to docs
* fix test
* mock tasks
* mock task registry
* remove last task_type
* fix tests
* lint
* fix test
* fix
* use wrapper and new prompts
* fix tests
* lint
* fix test
* remove conftest
* validate task to prompt_name
* override model prompts
* task to prompt name optional
* fix tests
* fix models
* remove task_to_prompt_name
* remove from mteb __init__
* update docs
* load existing model prompts if model_prompts is None
* fix
* lint
* change wrapper loader
* add wrapper class
* lint
* add wrapper file
* update logging
* upd logging
* refactor reranking
* lint
* remove prints
* 1.16.0
Automatically generated by python-semantic-release
* fix: Add Retrieval SK Quad dataset for Slovak search evaluation (#1276)
* Add Retrieval SK Quad dataset for Slovak search evaluation
This commit introduces the Retrieval SK Quad dataset, designed to assess Slovak search performance. The dataset is derived from SK-QuAD and includes questions with their best answers categorized post-annotation. This addition provides a significant resource for advancing Slovak language search evaluation and supporting further research and development.
* Add Retrieval SK Quad dataset for Slovak search evaluation 2
Added the requested changes on the SKQuadRetrieval.py file
* add task to init
* add missing task metadata
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks table
* 1.16.1
Automatically generated by python-semantic-release
* fix: Add Slovak Hate Speech and Offensive Language Dataset (#1274)
* Add Slovak Hate Speech and Offensive Language
Dataset
This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update.
* Add Slovak Hate Speech and Offensive Language Dataset
- Updated __init__.py to include the new SlovakHateSpeechClassification task.
- Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.
* Did requested changes:
- Updated __init__.py to include the new SlovakHateSpeechClassification task.
- Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.
* resolve linting issues by running `make lint`
* Update tasks table
* WIP: Leaderboard UI improvements (#1312)
* Fixed typos in task_results
* Fixed typos in task_results
* Added Tailwind, reorganized layout and fixed scrolling
* Ran linting
* 1.16.2
Automatically generated by python-semantic-release
* fix: remove duplicate multilingual
* 1.16.3
Automatically generated by python-semantic-release
* fix: Re-upload dataset to hub to avoid using script upload (#1322)
* fix dataset upload
* add linting
* Update tasks table
* 1.16.4
Automatically generated by python-semantic-release
* fix: Add implementations of common reranker models (#1309)
* init
* revert
* revert
* add metadata
* lint
* add reqs
* change to float16
* benchmark lint fix
* 1.16.5
Automatically generated by python-semantic-release
* Add multilingual mFollowIR dataset (#1308)
* add mFollowIR
* paper name
* edit warning->info
* convert to parquet
* lint
* Update tasks table
* Cache the embeddings when requested (#1307)
* add caching
* update test to use close
* change from json to pkl
* fix for window
* cleanup on Windows again
* infer dimension
* move cachewrapper
* add wrapper
* fix
* updates
* fix tests
* fix lint
* lint
* add test
* WIP: Leaderboard UI improvements (#1320)
* Fixed typos in task_results
* Fixed typos in task_results
* Added Tailwind, reorganized layout and fixed scrolling
* Ran linting
* Removed faux benchmark
* Updated layout
* Changed table number format
* Table highlights highest values by making them bold
* Added rank to table, removed organization from model_name
* Added mean rank to table
* Ran linting
* feat: Update metadata for all models (#1316)
* Added model meta
* format
* fixed metadata
* Metadata update for voyage models
* Update mteb/models/cohere_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/cohere_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Added corrections from review
* fix spelling error
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* resolved bugs from pytest --collect-only
* Avoid wrapping all models with the SentenceTransformerWrapper
* Added normalize_embeddings_to_numpy to ensure standard embeddings during evaluations
* fixed moved on correction from @Samoed
* conditionally set .predict method on SentenceTransformerWrapper
---------
Signed-off-by: mr.Shu <mr@shu.io>
Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Thomas van Dongen <thomas123@live.nl>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: Orion Weller <31665361+orionw@users.noreply.github.com>
Co-authored-by: John Yang <byjohnyang@gmail.com>
Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
Co-authored-by: Marek Šuppa <mrshu@users.noreply.github.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Xa9aX ツ <mishradiganta91@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Daniel Buades Marcos <daniel@buad.es>
Co-authored-by: Daniel Buades Marcos <daniel.buades@clinia.com>
Co-authored-by: Sathvik Nallamalli <sathviknallamalli@gmail.com>
Co-authored-by: Michael Graczyk <michael@mgraczyk.com>
Co-authored-by: Mariya Hendriksen <35101262+mariyahendriksen@users.noreply.github.com>
Co-authored-by: Santiago Castro <bryant1410@gmail.com>
Co-authored-by: Joey Xia <77958037+ZiyiXia@users.noreply.github.com>
Co-authored-by: Márton Kardos <power.up1163@gmail.com>
Co-authored-by: Oliver <oliver.pejic@students.fhnw.ch>
* [mieb] Add OpenCLIP models (#1335)
* add open clip models
* Update __init__.py
* lint
* fix model overview
* update jina clip
---------
Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
Co-authored-by: gowitheflow-1998 <chenghao.xiao@durham.ac.uk>
* [mieb] new version with downsampled train split to 32 per class (#1327)
* new version with downsampled train split to 32 per class
* force load truncated image file
* make lint
* add open clip models
* Update __init__.py
* lint
* fix model overview
* fix ImageCLS undersample; run birdsnap
* make lint
* make lint
---------
Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
Co-authored-by: gowitheflow-1998 <chenghao.xiao@durham.ac.uk>
* [mieb] Fix Jina CLIP (#1349)
fix jina clip v1
* fix: Add clevr license (#1356)
* Add BLINK as multi-choice tasks (#1348)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* fix e5v&vista
* remove duplicate corpus entries from BLINKIT2TRetreival dataset
* task type fix for running tasks
* update BLINKIT2T metadata
* fix wrong meta
* run mieb script
* split ROxford, RParis into easy, medium and hard
* make lint
* add BLINK as multi choice tasks
* fix: license metadata in wrong format
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [mieb] add Eva CLIP models (#1369)
* add Eva CLIP models
* make lint
* [mieb] add siglip, cohere multimodal & some fixes for final run (#1357)
* fix dataset type error
* fix clustering metrics
* add siglip & cohere
* update mieb run script
* cohere-v import
* fix
* api key name
* [mieb] fixes for final run (#1374)
* e5_v device arg
* dataloader num_workers
* vista doc
* vista doc
* run mieb
* fix
* Update run_vista.md
* [mieb] Fix torch no grad (#1378)
Fix torch no grad
* [mieb] Fix vlm2vec (#1380)
* fix vlm2vec return dtype
* make lint
* [mieb] Remove null entries from corpus of ROxford, RParis (#1371)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* fix e5v&vista
* remove duplicate corpus entries from BLINKIT2TRetreival dataset
* task type fix for running tasks
* update BLINKIT2T metadata
* fix wrong meta
* run mieb script
* split ROxford, RParis into easy, medium and hard
* make lint
* add BLINK as multi choice tasks
* fix: license metadata in wrong format
* remove null examples from corpus of ROxford and RParis
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [mieb] fixes (#1390)
* Fix torch no grad
* simplify
* make lint
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* [MIEB] Remove non-existent method for blip (#1394)
remove non-existent method for blip
* [mieb] fix ALIGN; update Winoground revision id; update run script (#1391)
* fix align & winoground
* lint
* Convert task category to i2i for tasks that only calls image encode
* update categories should include img cls, clustering, and multi label clf
* no op
* no op
* make lint
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* [mieb] Fix open clip for cv bench count (#1397)
fix shape mismatch
* [mieb] Update subtasks of BLINKIT2TMultiChoice and BLINKIT2IMultiChoice (#1403)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* fix e5v&vista
* remove duplicate corpus entries from BLINKIT2TRetreival dataset
* task type fix for running tasks
* update BLINKIT2T metadata
* fix wrong meta
* run mieb script
* split ROxford, RParis into easy, medium and hard
* make lint
* add BLINK as multi choice tasks
* fix: license metadata in wrong format
* remove null examples from corpus of ROxford and RParis
* fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice
* update blink metadata
* add updated BLINK results
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [mieb] Fix EVA CLIP for CV Bench (#1414)
* unsqueeze after preprocess
* make lint
* [mieb] Add calculate probs for vlm2vec (#1418)
* add method
* make lint
* [mieb] Fix siglip bug & add retrieval datasets (#1424)
* fix siglip
* add edis&gld-v2 i2i
* results
* siglip updated results
* fix siglip non-dataloader tasks
* [mieb] use Logistic Regression classifier for AbsTaskImageMultilabelClassification (#1420)
* use moc-lr classifier
* set n_experiments=5
* run dinov2 and some laion models
* add dinov2-giant results
* [mieb] mieb scripts (siglip rerun & linear probing ablation & params count) (#1429)
* mieb scripts
* lint
* [MIEB] Change Flickr30k to test split (#1449)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* fix e5v&vista
* remove duplicate corpus entries from BLINKIT2TRetreival dataset
* task type fix for running tasks
* update BLINKIT2T metadata
* fix wrong meta
* run mieb script
* split ROxford, RParis into easy, medium and hard
* make lint
* add BLINK as multi choice tasks
* fix: license metadata in wrong format
* remove null examples from corpus of ROxford and RParis
* fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice
* update blink metadata
* add updated BLINK results
* merge upstream mieb
* change Flickr30k to test split
* change flickr to test split
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [mieb] Fix VLM2vec dtype (#1462)
* propagate dtype
* fix fuse embeddings using list of PIL images
* [mieb] run script for missing results (#1472)
* task type fix
* scripts
* [mieb] Fix Moco model on CIFAR10Clustering (#1487)
Fix Moco model on CIFAR10Clustering
* [mieb] Fix Flickr30k I2T and T2I (#1505)
* remake flickr30k it2 and t2i
* add openai clip vit-b32 b16 and jina-clip results
* make lint
* [MIEB] add missing siglip models  (#1533)
* add udpates
* lint errors
* fix typo (#1535)
* add udpates
* lint errors
* fix small typo
* [mieb] Fix numbers of CIRR, Fashion200k, FashionIQ, Flickr30k, MSCOCO data statistics (#1544)
fix numbers
* Discussing a standard for ImageEncoders
* Add Voyage's multimodal embedding (#1555)
* add voyage multimodal & ran 17 tasks
* lint
* typo
* clean
* [mieb] update script for final re-run (#1576)
* mieb final runs
* lint
* fix: no longer using same query text for all of BLINKIT2TMultiChoice (#1572)
* fix: no longer using same query text for all of BLINKIT2TMultiChoice
* fix: remove blink subtask
* fix: remove subtask from blink it2i
* fix: align BLINK retrieval to multi choice
* add ROxford and RParis I2I multi choice
* add retrieval metrics to multi choice evaluator
* fix: remove wrong negatives from revisiting multichoice datasets
* fix revisiting datasets
* add new results for revisiting multichoice
* [MIEB] Make multimodal models compatible to `task_name` and `prompt_type` (#1583)
* 1. Make `get_xxx_embeddings` follow `encode`.
2. `ImageDataset.transform` could be `None`.
* Apply suggestions from code review
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Fix arguments
* Try to fix tests
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* fix image encoder (#1596)
* format
* fixed tests
* lint
* [mieb] voyage-v: add exponential backoff and other error handling (#1610)
* add voyage multimodal & ran 17 tasks
* lint
* typo
* clean
* exponential backoff tmp
* downsize large images for voyage api call
* voyage error handling
* lint
* add more results
* make tenacity optional
* lint
* log
* [MIEB] Fix `get_fused_emebddings` (#1612)
* Fix fused
* fix vlm2vec
* Fix lint
* [MIEB] Add new multimodal retrieval tasks (#1611)
* Add new tasks
* Fix score type
* [MIEB] Switch to ViDoRe BEIR version (#1607)
* Fix ViDoRe corpus
* fix lint
* ViDoRe beir version
* Extend MIEB test coverage (#1629)
* add one task from each image AbsTask to test grid
* add visual sts to test grid
* [mieb] Task filtering by modality supported by models (#1633)
* fix function signature for moco loader
* filter out tasks by model modalities
* correct conditions
* add model meta to relevant models
* use modalities instead and separate out constants
* [MIEB] Fix VISTA model (#1638)
Fix vista
* Warn (#1639)
* [mieb] model task modalities matching logic (#1640)
fixing task & model modalities matching logic
* [mieb] Use mock abstask classes (#1648)
* rename to downsampled_dataset_transform
* add mock tasks for mieb
* wip getting to 57%
* make lint
* update mock classes to improve coverage
* omit mock tasks from some tests
* [MIEB] Add code for GME models (#1635)
* Add GME
* Fix infoseek prompts
* Merge instructions
* fix: add version check e5-v in mieb (#1723)
* add version check for e5v model
* Update e5_v.py
* make lint
* fix: change comparison to bigger than (#1743)
change comparison to bigger than
* docs: Rework MIEB docs (#1802)
* combine mieb docs and move to main docs folder
* make flow more coherent
* tidy up
* skip AfriSentiLID for now #1785
* fix typo: exclude MIEB mock tests
* update vista doc
* Apply suggestions from code review
---------
Co-authored-by: Isaac Chung <isaac.chung@team.wrike.com>
* [mieb] Remove results-mieb folder (#1815)
remove results-mieb folder
* [mieb] fixing lrap computation for multi-label classification (#1834)
multi-label cls lrap computation fix
* [mieb] Merge from main (#1853)
* Update tasks table
* 1.19.0
Automatically generated by python-semantic-release
* fix: Add the_ugly_duckling.txt for speedtask to Python wheel (#1402)
Add the_ugly_duckling.txt for speedtask to Python wheel
* 1.19.1
Automatically generated by python-semantic-release
* fix: Added the necessary trust_remote_code (#1406)
* 1.19.2
Automatically generated by python-semantic-release
* docs: Update recommendation for pushing results (#1401)
fix: Update recommendation for pushing results
* docs: Fix a typo in README (#1430)
Fix typo in readme
* fix: add logging for RetrievalEvaluator NaN values for similarity scores (#1398)
Fixes #1389
* 1.19.3
Automatically generated by python-semantic-release
* fix: make samples_per_label a task attribute (#1419)
make samples_per_label a task attr
* fix: Add Korean AutoRAGRetrieval (#1388)
* feat: add AutoRAG Korean embedding retrieval benchmark
* fix: run --- 🧹 Running linters ---
ruff format . 			# running ruff formatting
716 files left unchanged
ruff check . --fix  	# running ruff linting
All checks passed!
* fix: add metadata for AutoRAGRetrieval
* change link for markers_bm
* add AutoRAGRetrieval to init.py and update metadata
* add precise metadata
* update metadata: description and license
* delete descriptive_stats in AutoRAGRetrieval.py and run calculate_matadata_metrics.py
* fix: Add missing benchmarks in benchmarks.py (#1431)
Fixes #1423
* Update tasks table
* 1.19.4
Automatically generated by python-semantic-release
* Leaderboard 2.0: added performance x n_parameters plot + more benchmark info (#1437)
* Added elementary speed/performance plot
* Refactored table formatting code
* Bumped Gradio version
* Added more general info to benchmark description markdown block
* Adjusted margin an range on plot
* Made hover information easier to read on plot
* Made range scaling dynamic in plot
* Moved citation next to benchmark description
* Made titles in benchmark info bold
* Leaderboard: Fixed code benchmarks (#1441)
* fixed code benchmarks
* fix: Made n_parameters formatting smarter and more robust
* fix: changed jina-embeddings-v3 number of parameters from 572K to 572M
* fix: Fixed use_instuctions typo in model overview
* fix: Fixed sentence-transformer compatibility switch
* Ran linting
* Added all languages, tasks, types and domains to options
* Removed resetting options when a new benchmark is selected
* All results now get displayed, but models that haven't been run on everything get nan values in the table
* fix: Count unique texts, data leaks in calculate metrics (#1438)
* add more stat
* add more stat
* update statistics
* fix: update task metadata to allow for null (#1448)
* Update tasks table
* 1.19.5
Automatically generated by python-semantic-release
* Fix: Made data parsing in the leaderboard figure more robust (#1450)
Bugfixes with data parsing in main figure
* Fixed task loading (#1451)
* Fixed task result loading from disk
* Fixed task result loading from disk
* fix: publish (#1452)
* 1.19.6
Automatically generated by python-semantic-release
* fix: Fix load external results with `None` mteb_version (#1453)
* fix
* lint
* 1.19.7
Automatically generated by python-semantic-release
* WIP: Polishing up leaderboard UI (#1461)
* fix: Removed column wrapping on the table, so that it remains readable
* Added disclaimer to figure
* fix: Added links to task info table, switched out license with metric
* fix: loading pre 1.11.0 (#1460)
* small fix
* fix: fix
* 1.19.8
Automatically generated by python-semantic-release
* fix: swap touche2020 to maintain compatibility (#1469)
swap touche2020 for parity
* 1.19.9
Automatically generated by python-semantic-release
* docs: Add sum per language for task counts (#1468)
* add sum per lang
* add sort by sum option
* make lint
* fix: pinned datasets to <3.0.0 (#1470)
* 1.19.10
Automatically generated by python-semantic-release
* feat: add CUREv1 retrieval dataset (#1459)
* feat: add CUREv1 dataset
---------
Co-authored-by: nadshe <nadia.sheikh@clinia.com>
Co-authored-by: olivierr42 <olivier.rousseau@clinia.com>
Co-authored-by: Daniel Buades Marcos <daniel@buad.es>
* feat: add missing domains to medical tasks
* feat: modify benchmark tasks
* chore: benchmark naming
---------
Co-authored-by: nadshe <nadia.sheikh@clinia.com>
Co-authored-by: olivierr42 <olivier.rousseau@clinia.com>
* Update tasks table
* 1.20.0
Automatically generated by python-semantic-release
* fix: check if `model` attr of model exists (#1499)
* check if model attr of model exists
* lint
* Fix retrieval evaluator
* 1.20.1
Automatically generated by python-semantic-release
* fix: Leaderboard demo data loading (#1507)
* Made get_scores error tolerant
* Added join_revisions, made get_scores failsafe
* Fetching metadata fixed fr HF models
* Added failsafe metadata fetching to leaderboard code
* Added revision joining to leaderboard app
* fix
* Only show models that have metadata, when filter_models is called
* Ran linting
* 1.20.2
Automatically generated by python-semantic-release
* fix: leaderboard only shows models that have ModelMeta (#1508)
Filtering for models that have metadata
* 1.20.3
Automatically generated by python-semantic-release
* fix: align readme with current mteb (#1493)
* align readme with current mteb
* align with mieb branch
* fix test
* 1.20.4
Automatically generated by python-semantic-release
* docs: Add lang family mapping and map to task table (#1486)
* add lang family mapping and map to task table
* make lint
* add back some unclassified lang codes
* Update tasks table
* fix: Ensure that models match the names on embedding-benchmarks/results (#1519)
* 1.20.5
Automatically generated by python-semantic-release
* fix: Adding missing metadata on models and mathcing names up with the results repo (#1528)
* Added Voyage 3 models
* Added correct metadata to Cohere models and matched names with the results repo
* 1.20.6
Automatically generated by python-semantic-release
* feat: Evaluate missing splits (#1525)
* fix: evaluate missing splits (#1268)
* implement partial evaluation for missing splits
* lint
* requested changes done from scratch
* test for missing split evaluation added
* uncomment test
* lint
* avoid circular import
* use TaskResult
* skip tests for now
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* got test_all_splits_evaluated passing
* tests passing
* address review comments
* make lint
* handle None cases for kg_co2_emissions
* use new results info
---------
Co-authored-by: Thivyanth <thivyanth2004@gmail.com>
* 1.21.0
Automatically generated by python-semantic-release
* fix: Correct typos superseeded -> superseded (#1532)
fix typo -> superseded
* 1.21.1
Automatically generated by python-semantic-release
* fix: Task load data error for SICK-BR-STS and XStance (#1534)
* fix task load data for two tasks
* correct dataset keys
* 1.21.2
Automatically generated by python-semantic-release
* fix: Proprietary models now get correctly shown in leaderboard (#1530)
* Fixed showing proprietary models in leaderboard
* Added links to all OpenAI models
* Fixed table formatting issues
* Bumped Gradio version
* 1.21.3
Automatically generated by python-semantic-release
* docs: Add Model Meta parameters and metadata (#1536)
* add multi_qa_MiniLM_L6_cos_v1 model meta
* add all_mpnet_base_v2
* add parameters to model meta
* make lint
* add extra params to meta
* fix: add more model meta (jina, e5) (#1537)
* add e5 model meta
* address review comments
* 1.21.4
Automatically generated by python-semantic-release
* Add cohere models (#1538)
* fix: bug cohere names
* format
* fix: add nomic models (#1543)
#1515
* fix: Added all-minilm-l12-v2 (#1542)
#1515
* fix: Added arctic models (#1541)
#1515
* fix: add sentence trimming to OpenAIWrapper (#1526)
* fix: add sentence trimming to OpenAIWrapper
* fix: import tiktoken library inside encode function
* fix: check tokenizer library installed and update ModelMeta to pass tokenizer_name
* fix: pass tokenizer_name, max_tokens to loader
* fix: make tokenizer_name None for default
* fix: delete changes for ModelMeta
* fix: fix revision to 2 for OpenAI models
* fix: add docstring for OpenAIWrapper
* fix: lint
* feat: add openai optional dependency set
* fix: add sleep for too many requests
* fix: add lint
* fix: delete evaluate file
* 1.21.5
Automatically generated by python-semantic-release
* fix: Fixed metadata errors (#1547)
* 1.21.6
Automatically generated by python-semantic-release
* fix: remove curev1 from multlingual (#1552)
Seems like it was added here:
https://github.com/embeddings-benchmark/mteb/commit/1cc6c9e0fe62ca4e77708b641823fa1a121f048b
* 1.21.7
Automatically generated by python-semantic-release
* fix: Add Model2vec (#1546)
* Added Model2Vec wrapper
* Added Model2vec models
* Added model2vec models to registry
* Added model2vec as a dependency
* Ran linting
* Update mteb/models/model2vec_models.py
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Update mteb/models/model2vec_models.py
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Added adapted_from and superseeded_by to model2vec models.
* Added missing import
* Moved pyproject.toml to optional dependencies
* Fixed typos
* Added import error and changed model to model_name
* Added Numpy to frameworks
* Added Numpy to frameworks
* Corrected false info on model2vec models
* Replaced np.inf with maxint
* Update mteb/models/model2vec_models.py
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Added option to have infinite max tokens, added it to Model2vec
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Made result loading more permissive, changed eval splits for HotPotQA and DBPedia (#1554)
* Removed train and dev from eval splits on HotpotQA
* Removed dev from eval splits on DBPedia
* Made task_results validation more permissive
* Readded exception in get_score
* Ran linting
* 1.21.8
Automatically generated by python-semantic-release
* docs: Correction of SICK-R metadata (#1558)
* Correction of SICK-R metadata
* Correction of SICK-R metadata
---------
Co-authored-by: rposwiata <rposwiata@opi.org.pl>
* feat(google_models): fix issues and add support for `text-embedding-005` and `text-multilingual-embedding-002` (#1562)
* fix: google_models batching and prompt
* feat: add text-embedding-005 and text-multilingual-embedding-002
* chore: `make lint` errors
* fix: address PR comments
* 1.22.0
Automatically generated by python-semantic-release
* fix(bm25s): search implementation (#1566)
fix: bm25s implementation
* 1.22.1
Automatically generated by python-semantic-release
* docs: Fix dependency library name for bm25s (#1568)
* fix: bm25s implementation
* correct library name
---------
Co-authored-by: Daniel Buades Marcos <daniel.buades@clinia.com>
* fix: Add training dataset to model meta (#1561)
* fix: Add training dataset to model meta
Adresses #1556
* Added docs
* format
* feat: (cohere_models) cohere_task_type issue, batch requests and tqdm for visualization (#1564)
* feat: batch requests to cohere models
* fix: use correct task_type
* feat: use tqdm with openai
* fix: explicitely set `show_progress_bar` to False
* fix(publichealth-qa):  ignore rows with `None` values in `question` or `answer` (#1565)
* 1.23.0
Automatically generated by python-semantic-release
* fix: Added metadata for miscellaneous models (#1557)
* Added script for generating metadata, and metadata for the listed models
* Added misc models to overview
* Fixed misc metas
* Removed unnecessary imports
* Added logic to retrieve base model information
* Added base models to misc meta
* Added superseded_by to sentence-croissant models
* Added training datasets to mis models
* 1.23.1
Automatically generated by python-semantic-release
* fix: Added radar chart displaying capabilities on task types (#1570)
* Added radar chart displaying capabilities on task types
* Fixed table aggregation in leaderboard
* Spelled out why instructionretrieval is excluded
* 1.23.2
Automatically generated by python-semantic-release
* feat: add new arctic v2.0 models (#1574)
* feat: add new arctic v2.0 models
* chore: make lint
* 1.24.0
Automatically generated by python-semantic-release
* fix: Ad…
isaac-chung added a commit that referenced this pull request Feb 5, 2025
* mieb ZeroshotClassification

* mieb docs

* mieb implementation demo

* model meta; abstask column names; linear probe clf

* model meta; abstask column names; linear probe clf

* fix: update naming as candidate_labels

* Update README.md

* Update README.md

* i2tretrieval

* test load data ignore i2tretrieval

* [MIEB] Add image clustering (#1088)

* make lint
* wip
* add TinyImageNet and run
* type hints
* add accuracy
* lint

* remove unused & fix typos

* T2I Retrieval

* Any2AnyRetrieval

* fix tests from merge

* [MIEB] Add image text pair classification and tests (#1099)

* add ImageTextPairClassification abstask and evaluator

* dataset transform into sequence of images for each sample

* fix processing logic; list of list images compatability

* lint and docstrings

* make lint

* fix failing tests in TaskMetadata

* add tests for mieb

* skip gated repo

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [MIEB] Add image classification and zero shot classification tasks (#1101)

* fix task metadata

* use overrideable column names

* add CIFAR datasets

* add caltech101 dataset

* add FGVC aircraft dataset

* add food 101 dataset

* add OxfordPets dataset

* remove comments

* correct cifar100 path

* update cifar100 classification results

* cifar zero shot results

* add caltech101 zero shot

* matching CLIP paper implementation

* add aircraft and food zero shot

* add oxford pets zero shot

* [MIEB] Add CIFAR clustering (#1104)

add CIFAR clustering

* [MIEB] Add more image classification and zero shot classification datasets (#1103)

* update category to i2t

* add MNIST linear probe and zero shot

* add FER2013 linear probe and zero shot

* add stanford cars linear probe and zero shot

* add birdsnap linear probe and zero shot

* add eurosat linear probe and zero shot

* lint

* correct eurosat zero shot labels

* add abstask for image multilable and voc2007

* make lint

* [MIEB] Add more image classification and zero shot datasets (#1105)

* add STL10 linear probe and zero shot

* add RESISC45 linear probe and zeor shot

* add Describable textures linear probe and zero shot

* fix spacing lint

* add SUN397 linear probe and zero shot

* correct SUN397 zero shot captions

* add baai bge vista

* add e5-v

* linting

* memory issues for image linear probe & zeroshot

* kknn linear probe arguments

* del comments

* Add some classification and ZeroShot classification tasks (#1107)

* Add Country211 classification task

* Add imagenet1k classification task

* Add UCF101 classification task

* Add PatchCamelyon Classification task

* Add GTSRB classification task

* Add GSTRB Zero Shot Classification

* Add country211 zero shot classification

* Add results for classification tasks

* Add zero shot classification tasks

* Add PatchCamelyon tasks and results

* Add linting

* Add results and fix prompts for zero shot

* Add results

* Add results and linting

* fix dependency & clip mock test

* [MIEB] Add jina clip (#1120)

* add jina clip and mscoco i2t and t2i results

* make lint

* [MIEB] Update `mieb` with the `main` branch and some fixes (#1126)

* fix instruction retrival (#1072)

* fix instruction retrival

* fix test

* add points

* make nested results

* add test

* skip instruction test

* fix instruction passes

* fix unions

* move do_length_ablation

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Update points table

* fix: fix bug-causing spelling error in function name of e5-mistral-instruct (#1106)

found bug

* 1.12.85

Automatically generated by python-semantic-release

* fix: MultilingualSentimentClassification (#1109)

* Update points table

* fix: Avoid spaces in dataset name for CQADupstack and ignore speed tasks

* 1.12.86

Automatically generated by python-semantic-release

* fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended (#1112)

* fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended

* fix: fixed formatting for cli

* docs: improve searchability in the advanced usage documentation

* 1.12.87

Automatically generated by python-semantic-release

* docs: improve searchability in the advanced usage documentation (#1113)

* docs: improve searchability in the advanced usage documentation

* docs: update based on corrections

* fix: export type for `mteb create_meta` (#1114)

* fix export type

* fix dataset version too

* 1.12.88

Automatically generated by python-semantic-release

* fix: Simplify models implementations (#1085)

* Merge

* Adapt

* Simplify

* Check for rev again

* Rmv cmmnt

* Simplify

* simplify

* Rmv comment

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Use logging; change try except; add info

* Lint

* Rmv results

* Update rev

* format

* Simplify models; Allow instructions

* Jobs

* Fix merge

* Format

* Adapt models

* fix: ensure that e5 ignores the NQ

* format

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* 1.12.89

Automatically generated by python-semantic-release

* fix: nomic models using prefix correctly (#1125)

* fix: nomic models using prefix correctly

* chore: remove comment

* fix: handling in case not torch tensor

* Fix typo

---------

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* 1.12.90

Automatically generated by python-semantic-release

* refactor vista model wrapper to contain lib import

* python 38 type hints

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: anpalmak2003 <73543260+anpalmak2003@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: Zach Nussbaum <zanussbaum@gmail.com>
Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>

* image memoery issues for all retrieval Abstasks

* Add CLEVR and SciMMIR Image-Text Understanding tasks (#1127)

* Add CLEVER and SciMMIR

* Update metadata

* remove useless comment

* Add linting

* fix typo and tests

* Add CLEVR count task

* add linting

* add fashion200k & fashionIQ test passed

* clip text max seq truncation

* add WebQA, NIGHTS, OVEN

* any2any retrieval chunk encoding

* add nomic vision model; any2any topk bug

* add cv recall

* add InfoSeek; VisualNews

* [MIEB] Add Stanford Cars i2i Retrieval (#1147)

* wip

* add results

* make lint

* change back the order

* [MIEB] Add CUB200 i2i retrieval (#1154)

* add cub200 and results

* add skip_first_result

* skipped self and rerun results

* consolidate i2t and t2i to any2any

* remove abstask and evaluators

* remove references from test

* tu-add berlin sketch retrieval

* XM3600; XFlickr30kCO; mutilingual

* wit multilingual retrieval t2i

* correct multilingual t2i meta

* meta

* add dinov2 model; 4 sizes

* cls evaluator channel bug fix

* add ALIGN model

* add FORBI2IRetrieval

* forb & tuberlin new revision

* disable tokenization parallelism

* add hateful meme retrieval i2tt2i

* add memotion retrieval t2ii2t

* add SciMMIR Retrieval i2tt2i

* ruff update

* Visual STS Abstask&evaluator

* add visual STS17

* add visual STS 12-16

* [mieb] Add blip and blip2 models, and ImageNetDog15Clustering task (#1226)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* [mieb] add 3 compositionality evaluation tasks (#1229)

* linting & update unavailable dataset path

* add aro visual relation&attribution; sugarcrepe

* correct reference

* add SOPI2IRetrieval dataset/task (#1232)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* change reference

* Image text pair cls (#1233)

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* fix meta data

* fix validate points

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Add RP2kI2IRetrieval and METI2IRetrieval (#1239)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* [MIEB] Adding DataComp CLIP models (#1283)

* adding data comp CLIP models

* update model and caltech101 results

* make lint

* [mieb] Any2TextMultipleChoice Abstask&Evaluator & four tasks in CV-bench (#1287)

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* fix meta data

* fix validate points

* CV-Bench

* evaluator args comment

* fix

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* [mieb] adding 10 tasks (#1290)

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add vidore benchmark 10 tasks

* fix reference

* fix old metadata

* fix meta

* [mieb] Adding MOCOv3 models (#1293)

* add moco models first try

* add as a timm model

* add large model results

* make lint

* [mieb] Add more Any2AnyRetrieval datasets (#1285)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* remove GLDv2I2IRetrieval

* [mieb] Add any2any multiple choice evaluator and abstask (and one task) (#1301)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* [mieb] Fix FORB dataset (#1306)

* correct format

* update results

* add more results

* add more results

* [mieb] run tasks fix (#1302)

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* fix e5v&vista

* task type fix for running tasks

* fix wrong meta

* run mieb script

* script

* lint

* align

* [mieb] split RParisI2IRetrieval and ROxfordI2IRetrieval into easy, medium and hard versions (#1305)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* fix e5v&vista

* remove duplicate corpus entries from BLINKIT2TRetreival dataset

* task type fix for running tasks

* update BLINKIT2T metadata

* fix wrong meta

* run mieb script

* split ROxford, RParis into easy, medium and hard

* make lint

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [mieb] run tasks small fix (#1310)

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* fix e5v&vista

* task type fix for running tasks

* fix wrong meta

* run mieb script

* script

* lint

* align

* fix

* linting

* [mieb] Add VLM2vec (#1323)

* wip vlm2vec model

* making i2t classification work wit Calteh101

* test vlm2vec on other task types

* move peft into class

* feat: Merge main into MIEB (#1329)

* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elem… (#1203)

* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elements (#1201)

Fix OpenAI BadRequestError by limiting input dimensions to 2048 elements

- Ensure the 'sentences' list passed to OpenAI API does not exceed 2048 elements
- Reference: OpenAI's Embedding API documentation on input limits

Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>

* fix ruff formatting

* Added minor test fixes to ensure reproducility across systems

* Ensure that tmp.json is not created within repo when running tests

* format

* fixes path issues

* Rerun CI

---------

Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>

* fix: Ensure STS pearson and spearman does not use the p-value only the correlation (#1207)

Fixes #1206

* 1.14.16

Automatically generated by python-semantic-release

* fix: Normalize licenses including casing, uses of "-" etc.

* fix: Normalize licenses including casing, uses of "-" etc. (#1210)

* fix: Normalize licenses including casing, uses of "-" etc.

* fix tests

* 1.14.17

Automatically generated by python-semantic-release

* fix: Normalize benchmarks no only include task objects and added getter for benchmarks (#1208)

* Normalize benchmarks to only include tasks

- Force benchmarks to only include tasks. This fixes a few bugs where benchmarks can reference a task which is not implemented
- implements `mteb.get_benchmark`, which makes it easier to fetch benchmarks
- Added tests + updated docs

A few outstanding issues:

I would like `mteb.MTEB(benchmark)` to always reproduce the benchmark. Currently this is not possible as MTEB(eng) required the split to be specified. A solution it to allow "eval_splits) to be specified when initializing a task and then pass it on to the `load_data()`. This way we can write the following:

`mteb.get_tasks(tasks=[...], eval_splits=["test"], ...)`

I would also love the aggregation to be a part of the benchmark (such that it is clear how it should be aggregated). This is especially relevant for MTEB(eng) as it average the CQAD datasets before creating the global average. This way we can also create a result object for the benchmark itself. A complimenting solution for this is to allow nested benchmarks.

* fix error in tests

* format

* Added corrections based on review

* added example and formatted

* 1.14.18

Automatically generated by python-semantic-release

* docs: Fix broken links in docs (#1212)

* Added fixes for broken links in adding_a_dataset and adding_a_model docs.

* Updated link name

* Mismatch of the category of AmazonPolarityClassification (#1220)

Fixes #1219

* Update tasks table

* fix: Ensure that results are returned even when hitting cache (#1215)

Fixes #1122

* 1.14.19

Automatically generated by python-semantic-release

* fix: Allow benchmark to specify eval_splits (#1217)

* fix: Allow benchmark to specify eval_splits

This PR allow for benchmarks to specify specific eval. splits. This allow us to fully specify a benchmark within the benchmark object.

To do this it add the following:
- added eval_splits to the Abstask object, which default to metadata.eval_splits
- use the task.eval_splits unless overwritten in mteb.MTEB.run
- added eval_splits arg to mteb.get_tasks, which filter the tasks based on splits
- updated documentation
  - renamed the "Advanced Usage" to "Usage Documentation" to make it more accicible
- added tests where relevant

* Added correction based on feedback

* 1.14.20

Automatically generated by python-semantic-release

* Update points table

* Update points table

* docs: clarify adding a model (#1222)

* fix: Add RepLLaMA style models (#1223)

* init commit

* working and reproducing

* lint

* update hashes

* warning

* add pyproject

* Update points table

* 1.14.21

Automatically generated by python-semantic-release

* docs: Update points (#1228)

* Fix case

* Fix casing

* Fix case

* Fix case

* Create 971.jsonl

* Update contrib

* Add contributors

* Update points table

* docs: Add MTEB(code) dataset (#1237)

* docs: Add MTEB(code) dataset

* Fix linting

* Update points table

* Update of my affiliation (#1242)

Update points.md

* Add contributor (#1243)

* fix: @mrshu's name in `points.md` (#1246)

* Use the diacritic character to be inline with Slovak spelling.

Signed-off-by: mr.Shu <mr@shu.io>

* docs: Create benchmarks overview table (#1245)

* fix get_benchmarks method

* add create benchmark script

* make lint

* 1.14.22

Automatically generated by python-semantic-release

* docs: Update affiliation (#1247)

Update points.md

* Added author-information

* Add final author list

* Update points table

* docs: Added coordination point for Jimmy Lee  (#1253)

docs: Added coordination point for Jimmy lee for his work on the coordination of Crystina and Nandan

* Update points table

* fix: Add multilingual Benchmark (#1252)

* fix: Add multilingual bench

* Update mteb/benchmarks/benchmarks.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* format

---------

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* 1.14.23

Automatically generated by python-semantic-release

* docs: Small point changes & more contributors (#1254)

* Update points.md

* Fix format

* Fix attribution

* Update points table

* fix: Downsample large retrieval datasets (#1236)

* most tasks

* lint

* fix other issues

* refactor

* lint and docs

* add polish

* keep case sensitive mteb paths

* add potential points

* fix points

* fix test about metadata

* update tasks and stats

* lint

* Update points table

* Update tasks table

* 1.14.24

Automatically generated by python-semantic-release

* fix: Get meta from CrossEncoder (#1255)

* remove indent after return

* handle cross encoders for model meta

* make lint

* update filename since we now have model name

* 1.14.25

Automatically generated by python-semantic-release

* fix: Add listing all available benchmarks CLI option (#1256)

* add benchmarks.md in README

* add cli option

* add benchmark cli test case

* correct typo

* 1.14.26

Automatically generated by python-semantic-release

* docs: Update affiliation (#1248)

* Update points.md

* Update points.md

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* docs: Update mteb(eng) calculation (#1258)

* Update mteb(eng) calculation

* Fixed citations

* Update MTEB(eng) + MTEB(multilingual)

* feat: leverage SentenceTransformers' query/passage specific prompts (#1221)

* feat: leverage SentenceTransformer models' query/passage specific prompts

* refactor: remove E5Wrapper

fix: wrong e5 revisions

* fix: default prompt_type to None

* fix: e4ce987 revision no longer exists for multilingual-e5-small on the Hub

* fix: keep `prompt_name` in kwargs when model doesn't have a `prompts` attr

* feat: use Enum for `prompt_type`

* docs: specify how to use prompts with Sentence Transformers

* feat: readd arctic models due to metadata

* 1.15.0

Automatically generated by python-semantic-release

* fix: Add Touche2020v3 and JMTEB (#1262)

* add datasets

* fix metrics

* add Touche2020v3

* fix metadata

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* upd name and supress

* add benchmark class

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Update tasks table

* 1.15.1

Automatically generated by python-semantic-release

* fix: Select benchmarks CLI option (#1261)

* add test case for a list of Benchmarks

* add selecting benchmarks CLI option

* typos

* use a separate attribute for benchmarks

* try fixing tests

* should accept string as well

* revert filename change

* use Benchmark and avoid circular import

* fix: derive `results_directory` path from `results_repo` name (#1275)

fix: don't hardcode repo name when downloading results

* 1.15.2

Automatically generated by python-semantic-release

* fix: sorting benchmark tasks by MTEB, then alphabetical (#1271)

* sorted

* fixed formatting

* efficiency changes

* fix test

* make lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* 1.15.3

Automatically generated by python-semantic-release

* ci: Removed 3.8 dependency (#1281)

Changes include:
- remove 3.8 from tests (added 3.11 and 3.12)
- changed other CI to 3.9
- updated lint rules to use 3.8

* Update points table

* fix: Allow Numpy >=2.0 (#1264)

Allow Numpy >=2.0

* 1.15.4

Automatically generated by python-semantic-release

* docs: points for paper writing (#1286)

* Create 1004.jsonl

* Create 1006.jsonl

* Update docs/mmteb/points/1004.jsonl

* Update docs/mmteb/points/1006.jsonl

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Update points table

* Update points table

* Update points table

* docs: Fix a link in the README (#1289)

* Fix a link in the README

And fix some typos.

* Update README.md

* Update points table

* fix: Update benchmarks (#1288)

* make benchmark var name uppercase

* update touche to v3

* add MIRACLRetrievalHardNegatives to multilingual

* add mteb(indic)

* add eu benchmark

* 1.15.5

Automatically generated by python-semantic-release

* fix: Allow numpy<2.0.0 (#1291)

* 1.15.6

Automatically generated by python-semantic-release

* fix: Add metadata dict to QBQTC in C-MTEB (#1292)

* fix QBQTC in C-MTEB

* make lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* 1.15.7

Automatically generated by python-semantic-release

* fix: Remove non-existent eval split of CMNLI (#1294)

fix eval_splits of CMNLI

* 1.15.8

Automatically generated by python-semantic-release

* Leaderboard (#1235)

* Add leaderboard dev

* Renamed MTEBResults to TaskResult

* Moved model and model meta loading utilities into overview.py

* Added get_model_metas to retrieve filtered metadata for models

* Restructured results object and made it into a class instead of a dict

* Added utilities for filtering models on BenchmarkResults objects

* Added to_table utility function to BenchmarkResults

* Added serialization utilities to BenchmarkResults

* Attempted fixing tests

* Added get_model_metas to __init__

* Added get_benchmarks to __init__ and made it return all benchmarks by default

* Added get_benchmarks to __init__

* Made tasks hashable

* Added task filtering based on task objects on BenchmarkResults

* Added BenchmarkResults to __init__

* Added additional arguments to get_scores on two classes

* Made get_scores smarter on BenchmarkResult

* Added basic multilingual benchmark

* Modified benchmark to be able to easily access results

* Added useful properties and filtering functions to BenchmarkResults

* Added minimal functioning example

* Added smarter table, task-list updating and tried fixing dropdown scrolling

* Made restrict_results into a private function

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Removed old leaderboard scripts

* Hardcoded max and min model size

* Removed redundant utils file

* Ran linting

* added leaderboard dependencies as optional

* Fixed union type error on Python 3.9

* Removed references to Dict in task aggregation

* Fixed name errors in _restrict_task_results

* Fixed _restrict_task_results

* Made hf_subsets={'default'} when the task is monolingual in _restric_task_results

* Task dropdown now gets filtered based on the other criteria

* Ran linting again

* Introduced hotfix for reranking test

* Added BenchmarkResults to __all__ in __init__

* Fixed validate_and_filter_scores method, and replaced _restric_task_results with it

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* feat: Use prompts instead of encode_corpus and encode_queries (#1278)

* add prompt per task type

* fix prompt

* upd test

* lint

* fix test

* fix DeprecatedSummarizationEvaluator

* fix prompts

* add test

* lint

* logger info

* use task type only in model_encode

* lint

* update interface

* add prompt types to docs

* fix test

* mock tasks

* mock task registry

* remove last task_type

* fix tests

* lint

* fix test

* fix

* use wrapper and new prompts

* fix tests

* lint

* fix test

* remove conftest

* validate task to prompt_name

* override model prompts

* task to prompt name optional

* fix tests

* fix models

* remove task_to_prompt_name

* remove from mteb __init__

* update docs

* load existing model prompts if model_prompts is None

* fix

* lint

* change wrapper loader

* add wrapper class

* lint

* add wrapper file

* update logging

* upd logging

* refactor reranking

* lint

* remove prints

* 1.16.0

Automatically generated by python-semantic-release

* fix: Add Retrieval SK Quad dataset for Slovak search evaluation (#1276)

* Add Retrieval SK Quad dataset for Slovak search evaluation

This commit introduces the Retrieval SK Quad dataset, designed to assess Slovak search performance. The dataset is derived from SK-QuAD and includes questions with their best answers categorized post-annotation. This addition provides a significant resource for advancing Slovak language search evaluation and supporting further research and development.

* Add Retrieval SK Quad dataset for Slovak search evaluation 2

Added the requested changes on the SKQuadRetrieval.py file

* add task to init

* add missing task metadata

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks table

* 1.16.1

Automatically generated by python-semantic-release

* fix: Add Slovak Hate Speech and Offensive Language Dataset (#1274)

* Add Slovak Hate Speech and Offensive Language
Dataset

This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update.

* Add Slovak Hate Speech and Offensive Language Dataset
- Updated __init__.py to include the new SlovakHateSpeechClassification task.
- Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.

* Did requested changes:
- Updated __init__.py to include the new SlovakHateSpeechClassification task.
- Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.

* resolve linting issues by running `make lint`

* Update tasks table

* WIP: Leaderboard UI improvements (#1312)

* Fixed typos in task_results

* Fixed typos in task_results

* Added Tailwind, reorganized layout and fixed scrolling

* Ran linting

* 1.16.2

Automatically generated by python-semantic-release

* fix: remove duplicate multilingual

* 1.16.3

Automatically generated by python-semantic-release

* fix: Re-upload dataset to hub to avoid using script upload (#1322)

* fix dataset upload

* add linting

* Update tasks table

* 1.16.4

Automatically generated by python-semantic-release

* fix: Add implementations of common reranker models (#1309)

* init

* revert

* revert

* add metadata

* lint

* add reqs

* change to float16

* benchmark lint fix

* 1.16.5

Automatically generated by python-semantic-release

* Add multilingual mFollowIR dataset (#1308)

* add mFollowIR

* paper name

* edit warning->info

* convert to parquet

* lint

* Update tasks table

* Cache the embeddings when requested (#1307)

* add caching

* update test to use close

* change from json to pkl

* fix for window

* cleanup on Windows again

* infer dimension

* move cachewrapper

* add wrapper

* fix

* updates

* fix tests

* fix lint

* lint

* add test

* WIP: Leaderboard UI improvements (#1320)

* Fixed typos in task_results

* Fixed typos in task_results

* Added Tailwind, reorganized layout and fixed scrolling

* Ran linting

* Removed faux benchmark

* Updated layout

* Changed table number format

* Table highlights highest values by making them bold

* Added rank to table, removed organization from model_name

* Added mean rank to table

* Ran linting

* feat: Update metadata for all models (#1316)

* Added model meta

* format

* fixed metadata

* Metadata update for voyage models

* Update mteb/models/cohere_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/cohere_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Added corrections from review

* fix spelling error

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* resolved bugs from pytest --collect-only

* Avoid wrapping all models with the SentenceTransformerWrapper

* Added normalize_embeddings_to_numpy to ensure standard embeddings during evaluations

* fixed moved on correction from @Samoed

* conditionally set .predict method on SentenceTransformerWrapper

---------

Signed-off-by: mr.Shu <mr@shu.io>
Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Thomas van Dongen <thomas123@live.nl>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: Orion Weller <31665361+orionw@users.noreply.github.com>
Co-authored-by: John Yang <byjohnyang@gmail.com>
Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
Co-authored-by: Marek Šuppa <mrshu@users.noreply.github.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Xa9aX ツ <mishradiganta91@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Daniel Buades Marcos <daniel@buad.es>
Co-authored-by: Daniel Buades Marcos <daniel.buades@clinia.com>
Co-authored-by: Sathvik Nallamalli <sathviknallamalli@gmail.com>
Co-authored-by: Michael Graczyk <michael@mgraczyk.com>
Co-authored-by: Mariya Hendriksen <35101262+mariyahendriksen@users.noreply.github.com>
Co-authored-by: Santiago Castro <bryant1410@gmail.com>
Co-authored-by: Joey Xia <77958037+ZiyiXia@users.noreply.github.com>
Co-authored-by: Márton Kardos <power.up1163@gmail.com>
Co-authored-by: Oliver <oliver.pejic@students.fhnw.ch>

* [mieb] Add OpenCLIP models (#1335)

* add open clip models

* Update __init__.py

* lint

* fix model overview

* update jina clip

---------

Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
Co-authored-by: gowitheflow-1998 <chenghao.xiao@durham.ac.uk>

* [mieb] new version with downsampled train split to 32 per class (#1327)

* new version with downsampled train split to 32 per class

* force load truncated image file

* make lint

* add open clip models

* Update __init__.py

* lint

* fix model overview

* fix ImageCLS undersample; run birdsnap

* make lint

* make lint

---------

Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
Co-authored-by: gowitheflow-1998 <chenghao.xiao@durham.ac.uk>

* [mieb] Fix Jina CLIP (#1349)

fix jina clip v1

* fix: Add clevr license (#1356)

* Add BLINK as multi-choice tasks (#1348)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* fix e5v&vista

* remove duplicate corpus entries from BLINKIT2TRetreival dataset

* task type fix for running tasks

* update BLINKIT2T metadata

* fix wrong meta

* run mieb script

* split ROxford, RParis into easy, medium and hard

* make lint

* add BLINK as multi choice tasks

* fix: license metadata in wrong format

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [mieb] add Eva CLIP models (#1369)

* add Eva CLIP models

* make lint

* [mieb] add siglip, cohere multimodal & some fixes for final run (#1357)

* fix dataset type error

* fix clustering metrics

* add siglip & cohere

* update mieb run script

* cohere-v import

* fix

* api key name

* [mieb] fixes for final run (#1374)

* e5_v device arg

* dataloader num_workers

* vista doc

* vista doc

* run mieb

* fix

* Update run_vista.md

* [mieb] Fix torch no grad (#1378)

Fix torch no grad

* [mieb] Fix vlm2vec (#1380)

* fix vlm2vec return dtype

* make lint

* [mieb] Remove null entries from corpus of ROxford, RParis (#1371)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* fix e5v&vista

* remove duplicate corpus entries from BLINKIT2TRetreival dataset

* task type fix for running tasks

* update BLINKIT2T metadata

* fix wrong meta

* run mieb script

* split ROxford, RParis into easy, medium and hard

* make lint

* add BLINK as multi choice tasks

* fix: license metadata in wrong format

* remove null examples from corpus of ROxford and RParis

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [mieb] fixes (#1390)

* Fix torch no grad

* simplify

* make lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* [MIEB] Remove non-existent method for blip (#1394)

remove non-existent method for blip

* [mieb] fix ALIGN; update Winoground revision id; update run script (#1391)

* fix align & winoground

* lint

* Convert task category to i2i for tasks that only calls image encode

* update categories should include img cls, clustering, and multi label clf

* no op

* no op

* make lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* [mieb] Fix open clip for cv bench count (#1397)

fix shape mismatch

* [mieb] Update subtasks of BLINKIT2TMultiChoice and BLINKIT2IMultiChoice (#1403)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* fix e5v&vista

* remove duplicate corpus entries from BLINKIT2TRetreival dataset

* task type fix for running tasks

* update BLINKIT2T metadata

* fix wrong meta

* run mieb script

* split ROxford, RParis into easy, medium and hard

* make lint

* add BLINK as multi choice tasks

* fix: license metadata in wrong format

* remove null examples from corpus of ROxford and RParis

* fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice

* update blink metadata

* add updated BLINK results

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [mieb] Fix EVA CLIP for CV Bench (#1414)

* unsqueeze after preprocess

* make lint

* [mieb] Add calculate probs for vlm2vec (#1418)

* add method

* make lint

* [mieb] Fix siglip bug & add retrieval datasets (#1424)

* fix siglip

* add edis&gld-v2 i2i

* results

* siglip updated results

* fix siglip non-dataloader tasks

* [mieb] use Logistic Regression classifier for AbsTaskImageMultilabelClassification (#1420)

* use moc-lr classifier

* set n_experiments=5

* run dinov2 and some laion models

* add dinov2-giant results

* [mieb] mieb scripts (siglip rerun & linear probing ablation & params count) (#1429)

* mieb scripts

* lint

* [MIEB] Change Flickr30k to test split (#1449)

* wip: start adding BLIP models

* add other blip variants

* wip: add blip2_models.py

* make lint

* wip: implement blip2 wrapper

* feat: add blip2 models, still mismatched names

* fix: remove projections from image and text embeddings

* make lint

* wip: add coco BLIP2

* fix: BLIP2 better zero-shot classification without text_proj and vision_proj

* tidy blip2

* add imagenet-dog-15 dataset

* tidy and lint

* remove unused import

* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator

* add imagenet-10 clustering task

* add SOPI2IRetrieval

* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering

* add SOPI2IRetrieval results for clip 32

* add results for clip vit 32/SOPI2IRetrieval

* resolve conflict

* add RP2kI2IRetrieval dataset

* add RP2kI2IRetrieval results with clip-vit-base-patch32

* update image retrieval __init__.py

* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets

* add RP2kI2IRetrieval and METI2IRetrieval

* add METI2IRetreival

* add SOP results

* make lign

* new revision for METI2IRetrieval

* make lint

* reset corpus chunk size

* remove wrong classification import

* add Flickr30k T2I and I2T

* add Flickr30k T2I retriebal

* reduced-size MET revision

* fix: add Flickr30k T2I

* make lint

* add two landmark datasets and results

* add Sketchy i2i retrieval

* add task metadata

* add BLINKIT2IRetrieval dataset

* add BLINKIT2TRetrieval

* add ImageCoDeT2IRetrieval

* make lint

* add vizwiz retrieval and results

* fix vizwiz duplicate texts

* add new vizwiz results

* add VQA2 results

* add GLD v2 I2T retrieval

* add gld v2 i2i retrieval

* make lint

* add AbsTaskAny2AnyMultiChoice

* make lint

* remove GLDv2I2IRetrieval

* exclude AbsTaskAny2AnyMultiChoice from test_load_data

* fix e5v&vista

* remove duplicate corpus entries from BLINKIT2TRetreival dataset

* task type fix for running tasks

* update BLINKIT2T metadata

* fix wrong meta

* run mieb script

* split ROxford, RParis into easy, medium and hard

* make lint

* add BLINK as multi choice tasks

* fix: license metadata in wrong format

* remove null examples from corpus of ROxford and RParis

* fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice

* update blink metadata

* add updated BLINK results

* merge upstream mieb

* change Flickr30k to test split

* change flickr to test split

---------

Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>

* [mieb] Fix VLM2vec dtype (#1462)

* propagate dtype

* fix fuse embeddings using list of PIL images

* [mieb] run script for missing results (#1472)

* task type fix

* scripts

* [mieb] Fix Moco model on CIFAR10Clustering (#1487)

Fix Moco model on CIFAR10Clustering

* [mieb] Fix Flickr30k I2T and T2I (#1505)

* remake flickr30k it2 and t2i

* add openai clip vit-b32 b16 and jina-clip results

* make lint

* [MIEB] add missing siglip models  (#1533)

* add udpates
* lint errors

* fix typo (#1535)

* add udpates
* lint errors
* fix small typo

* [mieb] Fix numbers of CIRR, Fashion200k, FashionIQ, Flickr30k, MSCOCO data statistics (#1544)

fix numbers

* Discussing a standard for ImageEncoders

* Add Voyage's multimodal embedding (#1555)

* add voyage multimodal & ran 17 tasks

* lint

* typo

* clean

* [mieb] update script for final re-run (#1576)

* mieb final runs

* lint

* fix: no longer using same query text for all of BLINKIT2TMultiChoice (#1572)

* fix: no longer using same query text for all of BLINKIT2TMultiChoice

* fix: remove blink subtask

* fix: remove subtask from blink it2i

* fix: align BLINK retrieval to multi choice

* add ROxford and RParis I2I multi choice

* add retrieval metrics to multi choice evaluator

* fix: remove wrong negatives from revisiting multichoice datasets

* fix revisiting datasets

* add new results for revisiting multichoice

* [MIEB] Make multimodal models compatible to `task_name` and `prompt_type` (#1583)

* 1. Make `get_xxx_embeddings` follow `encode`.
2. `ImageDataset.transform` could be `None`.

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Fix arguments

* Try to fix tests

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* fix image encoder (#1596)

* format

* fixed tests

* lint

* [mieb] voyage-v: add exponential backoff and other error handling (#1610)

* add voyage multimodal & ran 17 tasks

* lint

* typo

* clean

* exponential backoff tmp

* downsize large images for voyage api call

* voyage error handling

* lint

* add more results

* make tenacity optional

* lint

* log

* [MIEB] Fix `get_fused_emebddings` (#1612)

* Fix fused

* fix vlm2vec

* Fix lint

* [MIEB] Add new multimodal retrieval tasks (#1611)

* Add new tasks
* Fix score type

* [MIEB] Switch to ViDoRe BEIR version (#1607)

* Fix ViDoRe corpus

* fix lint

* ViDoRe beir version

* Extend MIEB test coverage (#1629)

* add one task from each image AbsTask to test grid

* add visual sts to test grid

* [mieb] Task filtering by modality supported by models (#1633)

* fix function signature for moco loader

* filter out tasks by model modalities

* correct conditions

* add model meta to relevant models

* use modalities instead and separate out constants

* [MIEB] Fix VISTA model (#1638)

Fix vista

* Warn (#1639)

* [mieb] model task modalities matching logic (#1640)

fixing task & model modalities matching logic

* [mieb] Use mock abstask classes (#1648)

* rename to downsampled_dataset_transform

* add mock tasks for mieb

* wip getting to 57%

* make lint

* update mock classes to improve coverage

* omit mock tasks from some tests

* [MIEB] Add code for GME models (#1635)

* Add GME

* Fix infoseek prompts

* Merge instructions

* fix: add version check e5-v in mieb (#1723)

* add version check for e5v model

* Update e5_v.py

* make lint

* fix: change comparison to bigger than (#1743)

change comparison to bigger than

* docs: Rework MIEB docs (#1802)

* combine mieb docs and move to main docs folder

* make flow more coherent

* tidy up

* skip AfriSentiLID for now #1785

* fix typo: exclude MIEB mock tests

* update vista doc

* Apply suggestions from code review

---------

Co-authored-by: Isaac Chung <isaac.chung@team.wrike.com>

* [mieb] Remove results-mieb folder (#1815)

remove results-mieb folder

* [mieb] fixing lrap computation for multi-label classification (#1834)

multi-label cls lrap computation fix

* [mieb] Merge from main (#1853)

* Update tasks table
* 1.19.0
Automatically generated by python-semantic-release
* fix: Add the_ugly_duckling.txt for speedtask to Python wheel (#1402)
Add the_ugly_duckling.txt for speedtask to Python wheel
* 1.19.1
Automatically generated by python-semantic-release
* fix: Added the necessary trust_remote_code (#1406)
* 1.19.2
Automatically generated by python-semantic-release
* docs: Update recommendation for pushing results (#1401)
fix: Update recommendation for pushing results
* docs: Fix a typo in README (#1430)
Fix typo in readme
* fix: add logging for RetrievalEvaluator NaN values for similarity scores (#1398)
Fixes #1389
* 1.19.3
Automatically generated by python-semantic-release
* fix: make samples_per_label a task attribute (#1419)
make samples_per_label a task attr
* fix: Add Korean AutoRAGRetrieval (#1388)
* feat: add AutoRAG Korean embedding retrieval benchmark
* fix: run --- 🧹 Running linters ---
ruff format . 			# running ruff formatting
716 files left unchanged
ruff check . --fix  	# running ruff linting
All checks passed!
* fix: add metadata for AutoRAGRetrieval
* change link for markers_bm
* add AutoRAGRetrieval to init.py and update metadata
* add precise metadata
* update metadata: description and license
* delete descriptive_stats in AutoRAGRetrieval.py and run calculate_matadata_metrics.py
* fix: Add missing benchmarks in benchmarks.py (#1431)
Fixes #1423
* Update tasks table
* 1.19.4
Automatically generated by python-semantic-release
* Leaderboard 2.0: added performance x n_parameters plot + more benchmark info (#1437)
* Added elementary speed/performance plot
* Refactored table formatting code
* Bumped Gradio version
* Added more general info to benchmark description markdown block
* Adjusted margin an range on plot
* Made hover information easier to read on plot
* Made range scaling dynamic in plot
* Moved citation next to benchmark description
* Made titles in benchmark info bold
* Leaderboard: Fixed code benchmarks (#1441)
* fixed code benchmarks
* fix: Made n_parameters formatting smarter and more robust
* fix: changed jina-embeddings-v3 number of parameters from 572K to 572M
* fix: Fixed use_instuctions typo in model overview
* fix: Fixed sentence-transformer compatibility switch
* Ran linting
* Added all languages, tasks, types and domains to options
* Removed resetting options when a new benchmark is selected
* All results now get displayed, but models that haven't been run on everything get nan values in the table
* fix: Count unique texts, data leaks in calculate metrics (#1438)
* add more stat
* add more stat
* update statistics
* fix: update task metadata to allow for null (#1448)
* Update tasks table
* 1.19.5
Automatically generated by python-semantic-release
* Fix: Made data parsing in the leaderboard figure more robust (#1450)
Bugfixes with data parsing in main figure
* Fixed task loading (#1451)
* Fixed task result loading from disk
* Fixed task result loading from disk
* fix: publish (#1452)
* 1.19.6
Automatically generated by python-semantic-release
* fix: Fix load external results with `None` mteb_version (#1453)
* fix
* lint
* 1.19.7
Automatically generated by python-semantic-release
* WIP: Polishing up leaderboard UI (#1461)
* fix: Removed column wrapping on the table, so that it remains readable
* Added disclaimer to figure
* fix: Added links to task info table, switched out license with metric
* fix: loading pre 1.11.0 (#1460)
* small fix
* fix: fix
* 1.19.8
Automatically generated by python-semantic-release
* fix: swap touche2020 to maintain compatibility (#1469)
swap touche2020 for parity
* 1.19.9
Automatically generated by python-semantic-release
* docs: Add sum per language for task counts (#1468)
* add sum per lang
* add sort by sum option
* make lint
* fix: pinned datasets to <3.0.0 (#1470)
* 1.19.10
Automatically generated by python-semantic-release
* feat: add CUREv1 retrieval dataset (#1459)
* feat: add CUREv1 dataset
---------
Co-authored-by: nadshe <nadia.sheikh@clinia.com>
Co-authored-by: olivierr42 <olivier.rousseau@clinia.com>
Co-authored-by: Daniel Buades Marcos <daniel@buad.es>
* feat: add missing domains to medical tasks
* feat: modify benchmark tasks
* chore: benchmark naming
---------
Co-authored-by: nadshe <nadia.sheikh@clinia.com>
Co-authored-by: olivierr42 <olivier.rousseau@clinia.com>
* Update tasks table
* 1.20.0
Automatically generated by python-semantic-release
* fix: check if `model` attr of model exists (#1499)
* check if model attr of model exists
* lint
* Fix retrieval evaluator
* 1.20.1
Automatically generated by python-semantic-release
* fix: Leaderboard demo data loading (#1507)
* Made get_scores error tolerant
* Added join_revisions, made get_scores failsafe
* Fetching metadata fixed fr HF models
* Added failsafe metadata fetching to leaderboard code
* Added revision joining to leaderboard app
* fix
* Only show models that have metadata, when filter_models is called
* Ran linting
* 1.20.2
Automatically generated by python-semantic-release
* fix: leaderboard only shows models that have ModelMeta (#1508)
Filtering for models that have metadata
* 1.20.3
Automatically generated by python-semantic-release
* fix: align readme with current mteb (#1493)
* align readme with current mteb
* align with mieb branch
* fix test
* 1.20.4
Automatically generated by python-semantic-release
* docs: Add lang family mapping and map to task table (#1486)
* add lang family mapping and map to task table
* make lint
* add back some unclassified lang codes
* Update tasks table
* fix: Ensure that models match the names on embedding-benchmarks/results (#1519)
* 1.20.5
Automatically generated by python-semantic-release
* fix: Adding missing metadata on models and mathcing names up with the results repo (#1528)
* Added Voyage 3 models
* Added correct metadata to Cohere models and matched names with the results repo
* 1.20.6
Automatically generated by python-semantic-release
* feat: Evaluate missing splits (#1525)
* fix: evaluate missing splits (#1268)
* implement partial evaluation for missing splits
* lint
* requested changes done from scratch
* test for missing split evaluation added
* uncomment test
* lint
* avoid circular import
* use TaskResult
* skip tests for now
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* got test_all_splits_evaluated passing
* tests passing
* address review comments
* make lint
* handle None cases for kg_co2_emissions
* use new results info
---------
Co-authored-by: Thivyanth <thivyanth2004@gmail.com>
* 1.21.0
Automatically generated by python-semantic-release
* fix: Correct typos superseeded -> superseded (#1532)
fix typo -> superseded
* 1.21.1
Automatically generated by python-semantic-release
* fix: Task load data error for SICK-BR-STS and XStance (#1534)
* fix task load data for two tasks
* correct dataset keys
* 1.21.2
Automatically generated by python-semantic-release
* fix: Proprietary models now get correctly shown in leaderboard (#1530)
* Fixed showing proprietary models in leaderboard
* Added links to all OpenAI models
* Fixed table formatting issues
* Bumped Gradio version
* 1.21.3
Automatically generated by python-semantic-release
* docs: Add Model Meta parameters and metadata (#1536)
* add multi_qa_MiniLM_L6_cos_v1 model meta
* add all_mpnet_base_v2
* add parameters to model meta
* make lint
* add extra params to meta
* fix: add more model meta (jina, e5) (#1537)
* add e5 model meta
* address review comments
* 1.21.4
Automatically generated by python-semantic-release
* Add cohere models (#1538)
* fix: bug cohere names
* format
* fix: add nomic models (#1543)
#1515
* fix: Added all-minilm-l12-v2 (#1542)
#1515
* fix: Added arctic models (#1541)
#1515
* fix: add sentence trimming to OpenAIWrapper (#1526)
* fix: add sentence trimming to OpenAIWrapper
* fix: import tiktoken library inside encode function
* fix: check tokenizer library installed and update ModelMeta to pass tokenizer_name
* fix: pass tokenizer_name, max_tokens to loader
* fix: make tokenizer_name None for default
* fix: delete changes for ModelMeta
* fix: fix revision to 2 for OpenAI models
* fix: add docstring for OpenAIWrapper
* fix: lint
* feat: add openai optional dependency set
* fix: add sleep for too many requests
* fix: add lint
* fix: delete evaluate file
* 1.21.5
Automatically generated by python-semantic-release
* fix: Fixed metadata errors (#1547)
* 1.21.6
Automatically generated by python-semantic-release
* fix: remove curev1 from multlingual (#1552)
Seems like it was added here:
https://github.com/embeddings-benchmark/mteb/commit/1cc6c9e0fe62ca4e77708b641823fa1a121f048b
* 1.21.7
Automatically generated by python-semantic-release
* fix: Add Model2vec (#1546)
* Added Model2Vec wrapper
* Added Model2vec models
* Added model2vec models to registry
* Added model2vec as a dependency
* Ran linting
* Update mteb/models/model2vec_models.py
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Update mteb/models/model2vec_models.py
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Added adapted_from and superseeded_by to model2vec models.
* Added missing import
* Moved pyproject.toml to optional dependencies
* Fixed typos
* Added import error and changed model to model_name
* Added Numpy to frameworks
* Added Numpy to frameworks
* Corrected false info on model2vec models
* Replaced np.inf with maxint
* Update mteb/models/mode…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants