Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative loss and extreme cases. #18

Closed
RSKothari opened this issue Sep 11, 2019 · 13 comments
Closed

Negative loss and extreme cases. #18

RSKothari opened this issue Sep 11, 2019 · 13 comments

Comments

@RSKothari
Copy link

RSKothari commented Sep 11, 2019

Hi, thanks for sharing your repository.

I went over the paper and I'm stuck at line 208 of utils.py

res[c] = distance(negmask) * negmask - (distance(posmask) - 1) * posmask

I noted that if the ground truth had all 1s, it would result in a irreversible negative loss (the edt function returns distances from matrix edges in the case of all 1s). Might be a good idea to hardwire a loss of 0s during such events.

Do you think it would be wise to weight by the maximum distance possible for a particular image? That is, sqrt((h-1)**2 + (w-1)**2). This ensures that the loss function will always be capped to 1.

Finally, does the surface loss function provide a safe guard against gradient ascent (i.e negative loss)? I noticed that it tends to get negative very quickly in the early epochs.

@RSKothari RSKothari changed the title Purpose of -1 in one_hot2dist Negative loss and extreme cases. Sep 11, 2019
@RSKothari
Copy link
Author

I did read your responses in OpenReviews and you've explained it well in the paper that the global minima is definitely at a negative. I guess it becomes a case by case basis wherein users must balance with GDL on a case by case basis.

I also noticed that the Hausdorff distance could potentially contain an error.

def numpy_haussdorf(pred: np.ndarray, target: np.ndarray) -> float:
    assert len(pred.shape) == 2
    assert pred.shape == target.shape

    return max(directed_hausdorff(pred, target)[0], directed_hausdorff(target, pred)[0])

The directed_hausdorff function, in your case, directly takes a 0/1 map. However, this function takes a list of x and y coordinates of pixel locations. Perhaps I misunderstood the function?

@HKervadec
Copy link
Member

Hey there,

I noted that if the ground truth had all 1s, it would result in a irreversible negative loss (the edt function returns distances from matrix edges in the case of all 1s). Might be a good idea to hardwire a loss of 0s during such events.

I haven't bothered with that yet, as I consider those trivial cases would be dealt with by the other loss. Notice that also, if there is no positive pixels at all then the distance maps stays 0.

Do you think it would be wise to weight by the maximum distance possible for a particular image? That is, sqrt((h-1)**2 + (w-1)**2). This ensures that the loss function will always be capped to 1.

I would consider it if the images were much bigger, but so far I did not try that. I don't think it would change much (but I could be wrong).

Finally, does the surface loss function provide a safe guard against gradient ascent (i.e negative loss)? I noticed that it tends to get negative very quickly in the early epochs.

No. The negative values and/or gradients do not make it a gradient ascent.

The directed_hausdorff function, in your case, directly takes a 0/1 map. However, this function takes a list of x and y coordinates of pixel locations. Perhaps I misunderstood the function?

You raise a very good point with the Hausdorff function. While the way I call the function is correct, it is not optimal for two reasons:

  • the computed value when both the predictions and ground truth are empty is 0 (which lowers the overall value when we average over the dataset)
  • I did not use the spatial resolution of the image when computing it (is varies a tiny bit between some images).
    While this was not really an issue for the comparison we were making in the paper, I changed it recently to the MedPy Haussdorff function. This is for the extension of the paper, which will include code for the 3D case, and I haven't made it public yet.

I might have missed something (currently traveling) so let me know if I was not clear or misunderstood you.

@RSKothari
Copy link
Author

Yes, you are right about the negative surface loss values - it does not affect the gradient. Apologies. You've also rightly pointed out in the paper that the optimum value is negative.

Glad to know you made the shift in Hausdorff function. I'm still not sure how the directed_hausdorff function would be correct, but I would need to conduct detailed experiments comparing the results.

We've successfully integrated Surface loss in our application (I normalized it with the maximum distance in our case - we noticed that due to the much larger scale of surface loss, it dominated the convergence over other loss functions).

We will be citing your work.

@HKervadec
Copy link
Member

We've successfully integrated Surface loss in our application (I normalized it with the maximum distance in our case - we noticed that due to the much larger scale of surface loss, it dominated the convergence over other loss functions).

Glad to hear that ! And thanks for the feedback about normalizing, this is interesting.

@RSKothari
Copy link
Author

RSKothari commented Feb 3, 2020

Hi @HKervadec , I am confused about one aspect. Since surface loss solely has no objective (the value is a weighted summation of probability outputs), the network doesn't know where to move.

W^1 = W^0 + a*G

Here G is the gradient of that weight with respect to the input x.

For usual loss functions G is derived from G = d F(Y_gndtrth, Y_op) / d(X). F is a function dependent on Y_gndtrth. However surface loss has no Y_gndtrth which results in empty gradients.

My questions are:

  1. How do you prove that when used in conjunction with other losses, boundary loss contributes to the gradient in any manner?

  2. What is the advantage of surface loss over weighted cross entropy wherein the weight is derived using, say, Hausdorff?

Edit: I've answered question 1 myself. The default premise behind any gradient descent algorithm is to move the objective function (in this case, a weighted summation) towards a large negative value. I fail to understand though, why do we find difficulty in training solely with boundary loss?

@RSKothari RSKothari reopened this Feb 3, 2020
@HKervadec
Copy link
Member

HKervadec commented Feb 10, 2020

Hejhej,

  1. What is the advantage of surface loss over weighted cross entropy wherein the weight is derived using, say, Hausdorff?

A weighted cross-entropy will simply put higher penalty depending on the distance, but it does not actually compute the distance between the two segmentations. This is the goal of our loss ; where we show that our final loss is equivalent to the distance computation of Equation (2). It is much more principled than just throwing weights around.

I fail to understand though, why do we find difficulty in training solely with boundary loss?

It is a problem of trivial solutions. You've noticed that the objective function is supposed to end-up at a negative value. A perfect segmentation would have the following probabilities:

  • 0 for all background pixels (with a distance value > 0)
  • 1 for all foreground pixels (with a distance value < 0)
    Meaning that the ideal loss is negative.

Now, for small objects (but it is true in a lot of cases), that negative value might not be that far of 0. Therefore, if you predict everything with a probability of 0, you end-up with a loss of 0, which is a local minima, and might not be that far from the global minima. The gradient of that prediction won't be very high either, which means that your network will remain "stuck" there.

This is why, for binary problems, we used the loss in cojonction with another loss. Another solution (that I've not investigated yet), would be to use a pre-trained network in place of a newly initialized one ; the initial predictions might be decent enough to avoid falling into that pit.

The problem I've just described doesn't really exist for multi-class problems, where the trivial solutions for one class would be a terrible solution for another class, which means the network cannot easily get stuck with such trivial solutions.

@bluesky314
Copy link

Can someone explain the role of '-1' in (distance(posmask) - 1) * posmask? If all the distances are subtracted by 1 in distance(posmask), i'm guessing it does not make a big different as the values are usually between 0-300 in my application. So what is the purpose of -1?

@HKervadec
Copy link
Member

Please see #8

@bluesky314
Copy link

Thank you

@WingsOfPanda
Copy link

Hejhej,

  1. What is the advantage of surface loss over weighted cross entropy wherein the weight is derived using, say, Hausdorff?

A weighted cross-entropy will simply put higher penalty depending on the distance, but it does not actually compute the distance between the two segmentations. This is the goal of our loss ; where we show that our final loss is equivalent to the distance computation of Equation (2). It is much more principled than just throwing weights around.

I fail to understand though, why do we find difficulty in training solely with boundary loss?

It is a problem of trivial solutions. You've noticed that the objective function is supposed to end-up at a negative value. A perfect segmentation would have the following probabilities:

  • 0 for all background pixels (with a distance value > 0)
  • 1 for all foreground pixels (with a distance value < 0)
    Meaning that the ideal loss is negative.

Now, for small objects (but it is true in a lot of cases), that negative value might not be that far of 0. Therefore, if you predict everything with a probability of 0, you end-up with a loss of 0, which is a local minima, and might not be that far from the global minima. The gradient of that prediction won't be very high either, which means that your network will remain "stuck" there.

This is why, for binary problems, we used the loss in cojonction with another loss. Another solution (that I've not investigated yet), would be to use a pre-trained network in place of a newly initialized one ; the initial predictions might be decent enough to avoid falling into that pit.

The problem I've just described doesn't really exist for multi-class problems, where the trivial solutions for one class would be a terrible solution for another class, which means the network cannot easily get stuck with such trivial solutions.

Hi:

Thank you very much for your explanation here. the problem with trivial solutions in binary segmentation, would it be possible to avoid such problem by doing a one-hot binary segmentation? in another word, we cast the binary segmentation into a two-classes segmentation (foreground and background), and would this help to avoid trivial solutions like u comment for multi-class cases?

thank you!

@HKervadec
Copy link
Member

would it be possible to avoid such problem by doing a one-hot binary segmentation? in another word, we cast the binary segmentation into a two-classes segmentation (foreground and background), and would this help to avoid trivial solutions like u comment for multi-class cases?

In that case no, as the two classes are a perfect negative of each others -- so unfortunately the trivial minima remains the same. (Notice that distmap of class 0 is -distmap of class 1.)

In the meantime, I have reconfirmed that the problem does not exist in a true multi-class setting:
Screenshot_2020-11-30 phd_thesis pdf

@creativesalam
Copy link

Yes, you are right about the negative surface loss values - it does not affect the gradient. Apologies. You've also rightly pointed out in the paper that the optimum value is negative.

Glad to know you made the shift in Hausdorff function. I'm still not sure how the directed_hausdorff function would be correct, but I would need to conduct detailed experiments comparing the results.

We've successfully integrated Surface loss in our application (I normalized it with the maximum distance in our case - we noticed that due to the much larger scale of surface loss, it dominated the convergence over other loss functions).

We will be citing your work.

Hi @RSKothari,

How you normalized the loss, is it every epoch or batch? Can you kindly share your code? Is the max remains constant thought?

@HKervadec
Copy link
Member

I do not know how they did it exactly, but this might interest you: #14 (comment)

This code-gist normalize on a per-scan (2D or 3D) basis, between [-1; 1] ; though the result values might be in between.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants