Add WebFAQ bitext mining tasks #2326
Merged
+4,761
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi all,
this Pull Request follows up on a previous one regarding the WebFAQ Retrieval Dataset. This PR aims to add two Bitext Mining tasks. The source data remains the same (Question-Answer pairs extracted from FAQ pages), as described in the old PR and in the preprint of the dataset paper.
Overall, the Bitext Mining dataset consists of 1001 language pairs, with each at least 100 aligned QAs. If these are too many, I could also lower the number of language pairs to: 368 (all language pairs with at least 1000 samples) or 65 (all language pairs with at least 4000 samples)
The two Bitext Mining tasks are
WebFAQBitextMiningQuestions
andWebFAQBitextMiningQAs
:s2s
.p2p
.Example
Input data:
Task
WebFAQBitextMiningQuestions
:Task
WebFAQBitextMiningQAs
:! Note that in this exemplary case, the question is longer than the answer. But in the majority of cases, it is the other way around: the answer is quite descriptive and therefore longer than the question.
Code Quality
make lint
to maintain consistent style.Documentation
Testing
make test-with-coverage
.make test
ormake test-with-coverage
to ensure no existing functionality is broken.Adding datasets checklist
Reason for dataset addition:
Afaik, resource (FAQ) not utilized yet.
Covers many different language pairs.
Due to SotA process of bitext mining, the dataset yields high-quality translations (see preprint).
I have run the following models on the task (adding the results to the pr). These can be run using the
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using
self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using
make test
.Run the formatter to format the code using
make lint
.