Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching losses with miners #169

Closed
goksinan opened this issue Aug 5, 2020 · 4 comments
Closed

Matching losses with miners #169

goksinan opened this issue Aug 5, 2020 · 4 comments
Labels
Frequently Asked Questions Frequently Asked Questions question A general question about the library

Comments

@goksinan
Copy link

goksinan commented Aug 5, 2020

I have two questions regarding using losses with miners:

  • Documentation says "...the library automatically converts pairs to triplets and triplets to pairs, when necessary", from this I understand that we can mix&match miners and losses. Is that correct or each loss works best with a specific miner?
  • Loss function uses samples to determine the distance, and distance is used to mine samples, so they kind of depend on each other. What is the relationship between the miner parameters (for instance the cutoff for DistanceWeightedMiner or margin for TripletMarginMiner) and the loss parameter (margin of TripletMarginLoss)? Are there any best practices in determining them?
@KevinMusgrave KevinMusgrave added the question A general question about the library label Aug 6, 2020
@KevinMusgrave
Copy link
Owner

KevinMusgrave commented Aug 8, 2020

A primary goal of this library is a high degree of flexibility to make it easy to try new ideas. So yes, you can pass in a tuple miner into any loss function, and it'll do something with it. Specifically:

  • If you pass pairs into a triplet loss, then triplets will be formed by combining each positive pair and negative pair that share the same anchor.
  • If you pass triplets into a pair loss, then pairs will be formed by splitting each triplet into two pairs
  • If you pass pairs or triplets into a classification loss, then each embedding's loss will be weighted by how frequently the embedding occurs in the pairs or triplets.

There probably are certain combinations of loss+miner that work better than others, but I don't know which ones.

There also isn't a great way of selecting hyperparameters, and there are so many other factors like model, optimizer, batch size, and learning rate. So unfortunately I don't have a great answer for this.

Based on my experience, if I were trying to solve a new problem I would limit my focus to:

  • ContrastiveLoss
  • MultiSimilarityLoss
  • ArcFaceLoss or CosFaceLoss

And maybe also try combining these with the MultiSimilarityMiner.

Since you're interested in using miners, I suggest trying out the ThresholdReducer, which can be passed into any loss function using the reducer argument. It discards losses that fall outside a certain range.

@KevinMusgrave KevinMusgrave added the Frequently Asked Questions Frequently Asked Questions label Aug 8, 2020
@goksinan
Copy link
Author

goksinan commented Aug 8, 2020

Thank you very much for the detailed answer. As a follow up, I have one observation regarding the loss functions that use cosine similarity as the distance metric, and I am curious to hear your opinion on that. The euclidean distance depends on the embedding size and the values it contains. So, the optimal margin parameter may change according to the specific problem at hand, and is not easily transferable from one domain to another. Applying L2 normalization right before loss calculation might help, but I am skeptical of that since doing so can alter the relative similarities among embeddings. However, cosine similarity gives a value between -1 and 1 (between 0 and 2 for cosine distance) regardless of what the embeddings contain. Do you think the optimized loss parameters (which employs cosine similarity) that were reported in the literature can safely be transferable to other domains? I am just trying to figure out if there is one less hyperparameter to worry about.

@KevinMusgrave
Copy link
Owner

Yeah I think that is one advantage of cosine similarity vs unnormalized embeddings, i.e. you can usually expect to use a triplet margin of between 0 and 0.2. I think the optimized hyperparameters in one domain are a good starting place, but probably the true optimal is something different.

@goksinan
Copy link
Author

I see. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequently Asked Questions Frequently Asked Questions question A general question about the library
Projects
None yet
Development

No branches or pull requests

2 participants