-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative loss and extreme cases. #18
Comments
I did read your responses in OpenReviews and you've explained it well in the paper that the global minima is definitely at a negative. I guess it becomes a case by case basis wherein users must balance with GDL on a case by case basis. I also noticed that the Hausdorff distance could potentially contain an error.
The directed_hausdorff function, in your case, directly takes a 0/1 map. However, this function takes a list of |
Hey there,
I haven't bothered with that yet, as I consider those trivial cases would be dealt with by the other loss. Notice that also, if there is no positive pixels at all then the distance maps stays 0.
I would consider it if the images were much bigger, but so far I did not try that. I don't think it would change much (but I could be wrong).
No. The negative values and/or gradients do not make it a gradient ascent.
You raise a very good point with the Hausdorff function. While the way I call the function is correct, it is not optimal for two reasons:
I might have missed something (currently traveling) so let me know if I was not clear or misunderstood you. |
Yes, you are right about the negative surface loss values - it does not affect the gradient. Apologies. You've also rightly pointed out in the paper that the optimum value is negative. Glad to know you made the shift in Hausdorff function. I'm still not sure how the We've successfully integrated Surface loss in our application (I normalized it with the maximum distance in our case - we noticed that due to the much larger scale of surface loss, it dominated the convergence over other loss functions). We will be citing your work. |
Glad to hear that ! And thanks for the feedback about normalizing, this is interesting. |
Hi @HKervadec , I am confused about one aspect. Since surface loss solely has no objective (the value is a weighted summation of probability outputs), the network doesn't know where to move.
Here G is the gradient of that weight with respect to the input For usual loss functions G is derived from My questions are:
Edit: I've answered question 1 myself. The default premise behind any gradient descent algorithm is to move the objective function (in this case, a weighted summation) towards a large negative value. I fail to understand though, why do we find difficulty in training solely with boundary loss? |
Hejhej,
A weighted cross-entropy will simply put higher penalty depending on the distance, but it does not actually compute the distance between the two segmentations. This is the goal of our loss ; where we show that our final loss is equivalent to the distance computation of Equation (2). It is much more principled than just throwing weights around.
It is a problem of trivial solutions. You've noticed that the objective function is supposed to end-up at a negative value. A perfect segmentation would have the following probabilities:
Now, for small objects (but it is true in a lot of cases), that negative value might not be that far of This is why, for binary problems, we used the loss in cojonction with another loss. Another solution (that I've not investigated yet), would be to use a pre-trained network in place of a newly initialized one ; the initial predictions might be decent enough to avoid falling into that pit. The problem I've just described doesn't really exist for multi-class problems, where the trivial solutions for one class would be a terrible solution for another class, which means the network cannot easily get stuck with such trivial solutions. |
Can someone explain the role of '-1' in (distance(posmask) - 1) * posmask? If all the distances are subtracted by 1 in distance(posmask), i'm guessing it does not make a big different as the values are usually between 0-300 in my application. So what is the purpose of -1? |
Please see #8 |
Thank you |
Hi: Thank you very much for your explanation here. the problem with trivial solutions in binary segmentation, would it be possible to avoid such problem by doing a one-hot binary segmentation? in another word, we cast the binary segmentation into a two-classes segmentation (foreground and background), and would this help to avoid trivial solutions like u comment for multi-class cases? thank you! |
In that case no, as the two classes are a perfect negative of each others -- so unfortunately the trivial minima remains the same. (Notice that In the meantime, I have reconfirmed that the problem does not exist in a true multi-class setting: |
Hi @RSKothari, How you normalized the loss, is it every epoch or batch? Can you kindly share your code? Is the max remains constant thought? |
I do not know how they did it exactly, but this might interest you: #14 (comment) This code-gist normalize on a per-scan (2D or 3D) basis, between |
Hi, thanks for sharing your repository.
I went over the paper and I'm stuck at line 208 of
utils.py
res[c] = distance(negmask) * negmask - (distance(posmask) - 1) * posmask
I noted that if the ground truth had all 1s, it would result in a irreversible negative loss (the edt function returns distances from matrix edges in the case of all 1s). Might be a good idea to hardwire a loss of 0s during such events.
Do you think it would be wise to weight by the maximum distance possible for a particular image? That is, sqrt((h-1)**2 + (w-1)**2). This ensures that the loss function will always be capped to 1.
Finally, does the surface loss function provide a safe guard against gradient ascent (i.e negative loss)? I noticed that it tends to get negative very quickly in the early epochs.
The text was updated successfully, but these errors were encountered: