Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/cross encoder trainer lambdaloss #4

Open
wants to merge 16 commits into
base: feat/cross_encoder_trainer
Choose a base branch
from

Conversation

milistu
Copy link

@milistu milistu commented Feb 26, 2025

LambdaLoss Implementation for Cross Encoder Trainer

This PR adds LambdaLoss functionality to the Cross Encoder Trainer feature.

Changes

  • Implemented LambdaLoss as a pairwise loss function
  • Added support for various weighing schemes for LambdaLoss
  • Created an example script demonstrating LambdaLoss usage

Implementation Details

LambdaLoss is a pairwise loss function that can be used for ranking problems. It's particularly useful for information retrieval tasks where the relative order of results is more important than absolute scores.

The implementation includes flexible weighing schemes that allow for different prioritization strategies when training the model.

Reference: https://marc.najork.org/papers/cikm2018.pdf

@milistu
Copy link
Author

milistu commented Mar 8, 2025

Hi @tomaarsen,

I updated LambdaLoss with the changes you made in ListNetLoss.

I've trained the model using the same parameters you used for ListNetLoss, and I'm very pleased with the results. Notably, there's an improvement compared to ListNetLoss on evaluation, and training on just 1 epoch produced better results compared to models trained on the old implementation of this loss with 20 epochs.

I would kindly ask you to review these changes and let me know if you're satisfied with them. Additionally, I plan to remove the cache from the example training script once we finalize the implementation.

I'm also considering completely removing this argument:

reduction: Literal["sum", "mean"] = "sum"

When "sum" is set, the loss is in hundreds (approximately ~190), however, when using "mean" I get something more reasonable (~1). What do you think? I am not sure if this argument gives any value, but maybe I am missing something

Here's the model from my test: https://huggingface.co/Studeni/reranker-msmarco-v1.1-MiniLM-L12-H384-uncased-lambdaloss

Lastly, I added mini_batch_size as an argument in get_config_dict. In my opinion, it's good to include all the details - if you agree, we could add this to ListNetLoss as well.

@milistu
Copy link
Author

milistu commented Mar 10, 2025

Hi @tomaarsen,

I was testing an idea where I wanted to see if we could get better results when expanding the MS_MARCO dataset. I used mini_hard_negatives for every query-positive pair to find 9 negative texts based on similarity between "positive_text". I concatenated this new dataset to v1.1 train and trained the model. I think it's a good example of how to use mini_hard_negatives for this use case. What do you think?

Here is the trained model: https://huggingface.co/Studeni/reranker-msmarco-v1.1-MiniLM-L12-H384-uncased-lambdaloss-hard-neg

@tomaarsen
Copy link
Owner

I'm also considering completely removing this argument:

reduction: Literal["sum", "mean"] = "sum"

When "sum" is set, the loss is in hundreds (approximately ~190), however, when using "mean" I get something more reasonable (~1). What do you think? I am not sure if this argument gives any value, but maybe I am missing something

I think exclusively having mean is fine: otherwise the loss depends a lot on the batch size, and people get the (incorrect) assumption that a smaller batch size makes for a better model (it's a smaller loss, after all), when that's not the case.

Lastly, I added mini_batch_size as an argument in get_config_dict. In my opinion, it's good to include all the details - if you agree, we could add this to ListNetLoss as well.

I agree, I think this is fine. mini_batch_size does not affect the model performance (only training speed and memory usage), but it's nice for reproducibility.

I was testing an idea where I wanted to see if we could get better results when expanding the MS_MARCO dataset. I used mini_hard_negatives for every query-positive pair to find 9 negative texts based on similarity between "positive_text". I concatenated this new dataset to v1.1 train and trained the model. I think it's a good example of how to use mini_hard_negatives for this use case. What do you think?

I like it! Sounds good to me. I'll review this later today, and try and get it merged as well!

  • Tom Aarsen

@tomaarsen
Copy link
Owner

I made some changes here and there, and trained a few models:

I think this is just about ready to go!

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants