Negative loss and extreme cases. #18

RSKothari · 2019-09-11T18:37:08Z

Hi, thanks for sharing your repository.

I went over the paper and I'm stuck at line 208 of utils.py

res[c] = distance(negmask) * negmask - (distance(posmask) - 1) * posmask

I noted that if the ground truth had all 1s, it would result in a irreversible negative loss (the edt function returns distances from matrix edges in the case of all 1s). Might be a good idea to hardwire a loss of 0s during such events.

Do you think it would be wise to weight by the maximum distance possible for a particular image? That is, sqrt((h-1)**2 + (w-1)**2). This ensures that the loss function will always be capped to 1.

Finally, does the surface loss function provide a safe guard against gradient ascent (i.e negative loss)? I noticed that it tends to get negative very quickly in the early epochs.

The text was updated successfully, but these errors were encountered:

RSKothari · 2019-09-11T23:45:10Z

I did read your responses in OpenReviews and you've explained it well in the paper that the global minima is definitely at a negative. I guess it becomes a case by case basis wherein users must balance with GDL on a case by case basis.

I also noticed that the Hausdorff distance could potentially contain an error.

def numpy_haussdorf(pred: np.ndarray, target: np.ndarray) -> float:
    assert len(pred.shape) == 2
    assert pred.shape == target.shape

    return max(directed_hausdorff(pred, target)[0], directed_hausdorff(target, pred)[0])

The directed_hausdorff function, in your case, directly takes a 0/1 map. However, this function takes a list of x and y coordinates of pixel locations. Perhaps I misunderstood the function?

HKervadec · 2019-09-21T09:22:42Z

Hey there,

I noted that if the ground truth had all 1s, it would result in a irreversible negative loss (the edt function returns distances from matrix edges in the case of all 1s). Might be a good idea to hardwire a loss of 0s during such events.

I haven't bothered with that yet, as I consider those trivial cases would be dealt with by the other loss. Notice that also, if there is no positive pixels at all then the distance maps stays 0.

Do you think it would be wise to weight by the maximum distance possible for a particular image? That is, sqrt((h-1)**2 + (w-1)**2). This ensures that the loss function will always be capped to 1.

I would consider it if the images were much bigger, but so far I did not try that. I don't think it would change much (but I could be wrong).

Finally, does the surface loss function provide a safe guard against gradient ascent (i.e negative loss)? I noticed that it tends to get negative very quickly in the early epochs.

No. The negative values and/or gradients do not make it a gradient ascent.

The directed_hausdorff function, in your case, directly takes a 0/1 map. However, this function takes a list of x and y coordinates of pixel locations. Perhaps I misunderstood the function?

You raise a very good point with the Hausdorff function. While the way I call the function is correct, it is not optimal for two reasons:

the computed value when both the predictions and ground truth are empty is 0 (which lowers the overall value when we average over the dataset)
I did not use the spatial resolution of the image when computing it (is varies a tiny bit between some images).
While this was not really an issue for the comparison we were making in the paper, I changed it recently to the MedPy Haussdorff function. This is for the extension of the paper, which will include code for the 3D case, and I haven't made it public yet.

I might have missed something (currently traveling) so let me know if I was not clear or misunderstood you.

RSKothari · 2019-09-21T16:56:34Z

Yes, you are right about the negative surface loss values - it does not affect the gradient. Apologies. You've also rightly pointed out in the paper that the optimum value is negative.

Glad to know you made the shift in Hausdorff function. I'm still not sure how the directed_hausdorff function would be correct, but I would need to conduct detailed experiments comparing the results.

We've successfully integrated Surface loss in our application (I normalized it with the maximum distance in our case - we noticed that due to the much larger scale of surface loss, it dominated the convergence over other loss functions).

We will be citing your work.

HKervadec · 2019-09-22T02:25:15Z

We've successfully integrated Surface loss in our application (I normalized it with the maximum distance in our case - we noticed that due to the much larger scale of surface loss, it dominated the convergence over other loss functions).

Glad to hear that ! And thanks for the feedback about normalizing, this is interesting.

RSKothari · 2020-02-03T00:12:43Z

Hi @HKervadec , I am confused about one aspect. Since surface loss solely has no objective (the value is a weighted summation of probability outputs), the network doesn't know where to move.

W^1 = W^0 + a*G

Here G is the gradient of that weight with respect to the input x.

For usual loss functions G is derived from G = d F(Y_gndtrth, Y_op) / d(X). F is a function dependent on Y_gndtrth. However surface loss has no Y_gndtrth which results in empty gradients.

My questions are:

How do you prove that when used in conjunction with other losses, boundary loss contributes to the gradient in any manner?
What is the advantage of surface loss over weighted cross entropy wherein the weight is derived using, say, Hausdorff?

Edit: I've answered question 1 myself. The default premise behind any gradient descent algorithm is to move the objective function (in this case, a weighted summation) towards a large negative value. I fail to understand though, why do we find difficulty in training solely with boundary loss?

HKervadec · 2020-02-10T20:17:39Z

Hejhej,

What is the advantage of surface loss over weighted cross entropy wherein the weight is derived using, say, Hausdorff?

A weighted cross-entropy will simply put higher penalty depending on the distance, but it does not actually compute the distance between the two segmentations. This is the goal of our loss ; where we show that our final loss is equivalent to the distance computation of Equation (2). It is much more principled than just throwing weights around.

I fail to understand though, why do we find difficulty in training solely with boundary loss?

It is a problem of trivial solutions. You've noticed that the objective function is supposed to end-up at a negative value. A perfect segmentation would have the following probabilities:

0 for all background pixels (with a distance value > 0)
1 for all foreground pixels (with a distance value < 0)
Meaning that the ideal loss is negative.

Now, for small objects (but it is true in a lot of cases), that negative value might not be that far of 0. Therefore, if you predict everything with a probability of 0, you end-up with a loss of 0, which is a local minima, and might not be that far from the global minima. The gradient of that prediction won't be very high either, which means that your network will remain "stuck" there.

This is why, for binary problems, we used the loss in cojonction with another loss. Another solution (that I've not investigated yet), would be to use a pre-trained network in place of a newly initialized one ; the initial predictions might be decent enough to avoid falling into that pit.

The problem I've just described doesn't really exist for multi-class problems, where the trivial solutions for one class would be a terrible solution for another class, which means the network cannot easily get stuck with such trivial solutions.

bluesky314 · 2020-04-19T11:39:21Z

Can someone explain the role of '-1' in (distance(posmask) - 1) * posmask? If all the distances are subtracted by 1 in distance(posmask), i'm guessing it does not make a big different as the values are usually between 0-300 in my application. So what is the purpose of -1?

HKervadec · 2020-04-19T17:03:08Z

Please see #8

bluesky314 · 2020-04-20T08:58:10Z

Thank you

WingsOfPanda · 2020-11-30T07:46:13Z

Hejhej,

What is the advantage of surface loss over weighted cross entropy wherein the weight is derived using, say, Hausdorff?

A weighted cross-entropy will simply put higher penalty depending on the distance, but it does not actually compute the distance between the two segmentations. This is the goal of our loss ; where we show that our final loss is equivalent to the distance computation of Equation (2). It is much more principled than just throwing weights around.

I fail to understand though, why do we find difficulty in training solely with boundary loss?

It is a problem of trivial solutions. You've noticed that the objective function is supposed to end-up at a negative value. A perfect segmentation would have the following probabilities:

0 for all background pixels (with a distance value > 0)

1 for all foreground pixels (with a distance value < 0)
Meaning that the ideal loss is negative.

Now, for small objects (but it is true in a lot of cases), that negative value might not be that far of 0. Therefore, if you predict everything with a probability of 0, you end-up with a loss of 0, which is a local minima, and might not be that far from the global minima. The gradient of that prediction won't be very high either, which means that your network will remain "stuck" there.

This is why, for binary problems, we used the loss in cojonction with another loss. Another solution (that I've not investigated yet), would be to use a pre-trained network in place of a newly initialized one ; the initial predictions might be decent enough to avoid falling into that pit.

The problem I've just described doesn't really exist for multi-class problems, where the trivial solutions for one class would be a terrible solution for another class, which means the network cannot easily get stuck with such trivial solutions.

Hi:

Thank you very much for your explanation here. the problem with trivial solutions in binary segmentation, would it be possible to avoid such problem by doing a one-hot binary segmentation? in another word, we cast the binary segmentation into a two-classes segmentation (foreground and background), and would this help to avoid trivial solutions like u comment for multi-class cases?

thank you!

HKervadec · 2020-11-30T20:41:29Z

would it be possible to avoid such problem by doing a one-hot binary segmentation? in another word, we cast the binary segmentation into a two-classes segmentation (foreground and background), and would this help to avoid trivial solutions like u comment for multi-class cases?

In that case no, as the two classes are a perfect negative of each others -- so unfortunately the trivial minima remains the same. (Notice that distmap of class 0 is -distmap of class 1.)

In the meantime, I have reconfirmed that the problem does not exist in a true multi-class setting:

creativesalam · 2021-11-23T06:05:27Z

Yes, you are right about the negative surface loss values - it does not affect the gradient. Apologies. You've also rightly pointed out in the paper that the optimum value is negative.

Glad to know you made the shift in Hausdorff function. I'm still not sure how the directed_hausdorff function would be correct, but I would need to conduct detailed experiments comparing the results.

We've successfully integrated Surface loss in our application (I normalized it with the maximum distance in our case - we noticed that due to the much larger scale of surface loss, it dominated the convergence over other loss functions).

We will be citing your work.

Hi @RSKothari,

How you normalized the loss, is it every epoch or batch? Can you kindly share your code? Is the max remains constant thought?

HKervadec · 2021-11-24T21:59:57Z

I do not know how they did it exactly, but this might interest you: #14 (comment)

This code-gist normalize on a per-scan (2D or 3D) basis, between [-1; 1] ; though the result values might be in between.

RSKothari changed the title ~~Purpose of -1 in one_hot2dist~~ Negative loss and extreme cases. Sep 11, 2019

RSKothari closed this as completed Sep 21, 2019

HKervadec mentioned this issue Nov 22, 2019

Surface loss in keras-tensorflow #14

Closed

RSKothari reopened this Feb 3, 2020

HKervadec closed this as completed Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative loss and extreme cases. #18

Negative loss and extreme cases. #18

RSKothari commented Sep 11, 2019 •

edited

Loading

RSKothari commented Sep 11, 2019

HKervadec commented Sep 21, 2019

RSKothari commented Sep 21, 2019

HKervadec commented Sep 22, 2019

RSKothari commented Feb 3, 2020 •

edited

Loading

HKervadec commented Feb 10, 2020 •

edited

Loading

bluesky314 commented Apr 19, 2020

HKervadec commented Apr 19, 2020

bluesky314 commented Apr 20, 2020

WingsOfPanda commented Nov 30, 2020

HKervadec commented Nov 30, 2020

creativesalam commented Nov 23, 2021

HKervadec commented Nov 24, 2021

Negative loss and extreme cases. #18

Negative loss and extreme cases. #18

Comments

RSKothari commented Sep 11, 2019 • edited Loading

RSKothari commented Sep 11, 2019

HKervadec commented Sep 21, 2019

RSKothari commented Sep 21, 2019

HKervadec commented Sep 22, 2019

RSKothari commented Feb 3, 2020 • edited Loading

HKervadec commented Feb 10, 2020 • edited Loading

bluesky314 commented Apr 19, 2020

HKervadec commented Apr 19, 2020

bluesky314 commented Apr 20, 2020

WingsOfPanda commented Nov 30, 2020

HKervadec commented Nov 30, 2020

creativesalam commented Nov 23, 2021

HKervadec commented Nov 24, 2021

RSKothari commented Sep 11, 2019 •

edited

Loading

RSKothari commented Feb 3, 2020 •

edited

Loading

HKervadec commented Feb 10, 2020 •

edited

Loading