DICE coefficient loss function #99

mongoose54 · 2017-02-02T05:11:24Z

@FabianIsensee
I am trying to modify the categorical_crossentropy loss function to dice_coefficient loss function in the Lasagne Unet example. I found this implementation in Keras and I modified it for Theano like below:

def dice_coef(y_pred,y_true):
smooth = 1.0
y_true_f = T.flatten(y_true)
y_pred_f = T.flatten(T.argmax(y_pred, axis=1))
intersection = T.sum(y_true_f * y_pred_f)
return (2. * intersection + smooth) / (T.sum(y_true_f) + T.sum(y_pred_f) + smooth)

def dice_coef_loss(y_pred, y_true):
return dice_coef(y_pred, y_true)

I am not sure if there is problem with my implementation or Dice coefficient is not robust:. See output during training validation. In comparison when I use categorical crossentropy I get like AUC > 0.98. I was wondering if anyone has played with Dice on UNet.

Started Experiment
Epoch: 0
train accuracy: 0.129538 train loss: -0.146859272992
val accuracy: 0.209342 val loss: -0.282476756789 val AUC score: 0.776537015996
Epoch: 1
train accuracy: 0.418164 train loss: -0.110509629949
val accuracy: 0.820385 val loss: -0.00156800820105 val AUC score: 0.5
Epoch: 2
train accuracy: 0.375172 train loss: -0.129266330856
val accuracy: 0.790744 val loss: -0.00923636992406 val AUC score: 0.5
Epoch: 3
train accuracy: 0.581028 train loss: -0.0889976615506
val accuracy: 0.194278 val loss: -0.279695818208 val AUC score: 0.5

FabianIsensee · 2017-02-03T15:00:08Z

Hi Alex, I think the example in Keras is for a binary segmentation task. In that case you will not use categorical crossentropy and therefore never run into the argmax problem. The output of the network is simply the pseudo probability for a pixel being part of the foreground (some valve in medical images I believe). Since you use the argmax theano is not able to differentiate through your loss. Therefore it does not learn. Cheers, Fabian

…

On 02.02.2017 06:11, Alex wrote: @FabianIsensee <https://github.com/FabianIsensee> I am trying to modify the categorical_crossentropy loss function to dice_coefficient loss function in the Lasagne Unet example. I found this <https://github.com/jocicmarko/ultrasound-nerve-segmentation/blob/master/train.py#L21-L29> implementation in Keras and I modified it for Theano like below: def dice_coef(y_true, y_pred): smooth = 1.0 y_true_f = T.flatten(y_true) y_pred_f = T.flatten(T.argmax(y_pred,-1)) intersection = T.sum(y_true_f * y_pred_f) return (2. * intersection + smooth) / (T.sum(y_true_f) + T.sum(y_pred_f) + smooth) def dice_coef_loss(y_true, y_pred): return -dice_coef(y_true, y_pred) In the Lasagne UNet example the output of the flattened layer has shape (None, 2) where 2 is the number of classes. Hence there is argmax above. My problem is that the training loss and validation loss are very way larger (larger than 7000) compared to Keras implementation (e.g. smaller than 1.0). See below. Any thoughts? Suggestions? Started Experiment Epoch: 0 train accuracy: 0.895438 train loss: 21183.5203804 val accuracy: 0.827439 val loss: 7791.16928308 val AUC score: 0.0277577581283 Epoch: 1 train accuracy: 0.904572 train loss: 26913.4683549 val accuracy: 0.827439 val loss: 7791.16889587 val AUC score: 0.0278832780929 Epoch: 2 train accuracy: 0.902974 train loss: 26987.0089202 val accuracy: 0.827439 val loss: 7791.16909262 val AUC score: 0.493287245499 Epoch: 3 train accuracy: 0.901824 train loss: 23508.2736464 val accuracy: 0.827439 val loss: 7791.16787591 val AUC score: 0.0279265130406 Epoch: 4 train accuracy: 0.898711 train loss: 22958.4583263 val accuracy: 0.827439 val loss: 7791.16905814 val AUC score: 0.0279360519976 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#99>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKx1FMKOz0eHFhCppgAFHk7PQLsYfe1aks5rYWV9gaJpZM4L0uep>.

mongoose54 · 2017-02-04T00:34:56Z

Hi Fabian,

Your answer makes sense. My bad, argmax is not differentiable. (How does Theano even run though?Shouldn't it break? Or is it trying to approximate it?)

I will try with another solution to convert the target images into one-hot ones, so that no argmax is required. The reason I am insisting with dice coefficienct is that I think that it could be better than cross entropy for segmentation problems.

-Alex

FabianIsensee · 2017-02-07T14:19:58Z

Hi Alex, good Idea, I am doing something similar right now. Works quite well. You can look into the code of the theano argmax function. I think it returns constant 0, but i am not 100% sure. Cheers, Fabian

…

On 04.02.2017 01:34, Alex wrote: Hi Fabian, Your answer makes sense. My bad, argmax is not differentiable. (How does Theano even run though?Shouldn't it break? Or is it trying to approximate it?) I will try with another solution to convert the target images into one-hot ones, so that no argmax is required. The reason I am insisting with dice coefficienct is that I think that it could be better than cross entropy for segmentation problems. -Alex — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#99 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKx1FECQdYyALwnXKpDfw1tg1TouDo65ks5rY8exgaJpZM4L0uep>.

mongoose54 · 2017-02-21T03:27:00Z

@FabianIsensee I implemented the Dice score based on our discussion here. You can see the gist here: https://gist.github.com/mongoose54/71e174587fbec8c2fe970e8a1c14eff4 Although it is not complaining when using as a metric I am getting some weird numbers some times. Have you been able to implement it? Would it be possible to share?
-Alex

FabianIsensee · 2017-02-22T08:22:34Z

Hi Alex,

sure. It's not like it's a secret or anything ;-)

def soft_dice(y_pred, y_true):
    # y_pred is softmax output of shape (num_samples, num_classes)
    # y_true is one hot encoding of target (shape= (num_samples, num_classes))
    intersect = T.sum(y_pred * y_true, 0)
    denominator = T.sum(y_pred, 0) + T.sum(y_true, 0)
    dice_scores = T.constant(2) * intersect / (denominator + T.constant(1e-6))
    return dice_scores

Make sure that you input shapes are correct and that you use one hot encoding. My implementation will return a soft dice score for each class (output shape is (num_classes, )). I got some decent results with it (same as state of the art on BraTS 2015 train data). Note that this implementation ignores that there may be more than one sample in the batch. You should be able to modify my implementation so that it can deal with batches > 1 as well.

If you have any suggestions for improvements please let me know.

Cheers,
Fabian

IMG-PRCSNG · 2017-09-29T06:25:11Z

IIUC, Should we do a negative of weighted sum of the dice scores for computing the loss to back propagate for a multi class problem? @FabianIsensee

sajjo79 · 2017-11-27T20:25:23Z

Hi FabianIsensee,
Can you please explain the dice loss calculation for multi-class problem.
Best

f0k · 2017-11-28T18:58:06Z

Can you please explain the dice loss calculation for multi-class problem.

The way Fabian implemented it each class will be treated as a binary problem, and you will get a score per class in the end. For training, you can try using the negative sum of the scores as the total loss to minimize.

sajjo79 · 2017-11-28T19:11:20Z

@f0k,
Thanks for response.
I have implemented it in caffe and code is listed below: However when i generate predictions that are binary. An output is attached below:
`
class DiceLossLayer(caffe.Layer):

def forward(self, bottom, top):
	self.diff[...] = bottom[1].data
	top[0].data[...] = 1 - self.dice_coef_multi_class(bottom[0], bottom[1])

def backward(self, top, propagate_down,  bottom):
	if propagate_down[1]:
	 	raise Exception("label not diff")
	elif propagate_down[0]:
		a=(-2. * self.diff + self.dice) / self.sum
		bottom[0].diff[...] = a
	else:
	 	raise Exception("no diff")
	# =============================

def dice_coef_multi_class(self, y_pred, y_true):
	n_classes = 5
	smooth=np.float32(1e-7)
	y_true=y_true.data
	y_pred=y_pred.data
	y_pred = np.argmax(y_pred, 1)
	y_pred = np.expand_dims(y_pred,1)

	y_pred=np.ndarray.flatten(y_pred)
	y_true = np.ndarray.flatten(y_true)

	dice = np.zeros(n_classes)
	self.sum = np.zeros([n_classes])
	for i in range(n_classes):
		y_true_i = np.equal(y_true, i)
		y_pred_i = np.equal(y_pred, i)
		self.sum[i] = np.sum(y_true_i) + np.sum(y_pred_i) + smooth
		dice[i] = (2. * np.sum(y_true_i * y_pred_i) + smooth) / self.sum[i]
	self.sum=np.sum(self.sum)
	self.dice=np.sum(dice)
	return self.dice

`

f0k · 2017-11-28T19:25:26Z

Your implementation looks complicated; can't you directly translate Fabian's version to numpy?

sajjo79 · 2017-11-28T19:37:22Z

@f0k
Please look at only three methods which are def forward(...), def backward(...). these methods are required for caffe.

FabianIsensee · 2017-11-29T07:13:07Z

Hi, sorry for being late to the party. I would strongly suggest to directly translate my implementation to numpy because that is much easier to read. I have it somewhere as well (will try to find it).
One thing that I absolutely don't like about caffe is that you manually need to implement the gradient which is where it is very easy to make mistakes (I did not check your backward implementation).
The implementation of dice_coef_multi_class looks fine to me.
For how long did you train the network? It very often happens that the network will start out like in your figure and get better over time.

edit: I saw a mistake in dice_coef_multi_class (and thus probably the gradient as well). You want a loss function, so something that gets lower the better the network is. Therefore, like Jan mentioned previously, return -self.dice!

FabianIsensee · 2017-11-29T07:24:47Z

Could not find the numpy implementation, made a new one instead. This one will correctly work with batches (which the previous one did not) and is compatible with nd tensors (2d, 3d, etc segmentations):

def soft_dice_numpy(y_pred, y_true, eps=1e-7):
    '''
    c is number of classes
    :param y_pred: b x c x X x Y( x Z...) network output, must sum to 1 over c channel (such as after softmax)
    :param y_true: b x c x X x Y( x Z...) one hot encoding of ground truth
    :param eps: 
    :return: 
    '''
    
    axes = tuple(range(2, len(y_pred.shape)))
    intersect = np.sum(y_pred * y_true, axes)
    denom = np.sum(y_pred + y_true, axes)
    return - (2. *intersect / (denom + eps)).mean()

25b3nk · 2018-03-24T11:23:18Z

@mongoose54 Should the output of the dice_coef_loss be negated ?
def dice_coef_loss(y_pred, y_true):
return - dice_coef(y_pred, y_true)

FabianIsensee · 2018-03-25T19:01:01Z

Yes. It's a loss function and lasagne minimizes the loss. In order to maximize the dice, you need to minimize the negative dice loss

jeremyjordan · 2018-05-25T13:41:28Z

According to this dissertation (page 72) in which the author discusses their paper, V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation, the values in the denominator should be squared.

I've adapted @FabianIsensee 's solution in this Gist and welcome critique/discussion of the changes.

FabianIsensee · 2018-05-25T13:45:50Z

Hi Jeremy,
whether you square any of these values is a design choice. Basically it depends on how you approximate the intersection / cardinality of the segmentations. Since the dice loss only approximates the dice, you can choose whatever approximation you want. Both are correct.
The first paper that used the dice loss (that I am aware of) is this one: https://arxiv.org/pdf/1608.04117.pdf
They use the same approximation as I do. I very much prefer this formulation and I am using it for all my research.
Regards,
Fabian

jeremyjordan · 2018-05-25T13:49:57Z

Oh I see, thanks! :)

jizhang02 · 2019-03-04T17:26:24Z

Hello,@FabianIsensee
When I use dice as loss function, the predicted image was all zeros!
Could you help me to analyze it?

FabianIsensee · 2019-03-05T07:20:23Z

Hi,
you are not exactly giving a lot of detail. My guess is that your learning rate etc is not optimal OR you should consider optimizing the background as well (also compute the dice loss for the background task)

jizhang02 · 2019-03-06T14:52:18Z

Hi,
you are not exactly giving a lot of detail. My guess is that your learning rate etc is not optimal OR you should consider optimizing the background as well (also compute the dice loss for the background task)

thank you for reply!
This is my dice loss function. Under implemention of U-Net.
def dice_coef(y_true, y_pred):
smooth = 1
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
intersection = K.sum(y_true_f * y_pred_f)
return (2. * intersection +smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) +smooth)

def dice_coef_loss(y_true, y_pred):
print("dice loss")
return 1-dice_coef(y_true, y_pred)
...
model.compile(optimizer = Adam(lr = 1e-4), loss = dice_loss, metrics = ['accuracy'])
So ,I was so struggling where is going wrong.

liuyipei · 2019-12-18T01:36:36Z

@FabianIsensee
@jeremyjordan

whether you square any of these values is a design choice. Basically it depends on how you approximate the intersection / cardinality of the segmentations. Since the dice loss only approximates the dice, you can choose whatever approximation you want. Both are correct.

It is really unfair to call it a design choice. The squared formulation is better because it has an obvious mathematical geometric meaning. Consider the cosine law from high school:
https://www.mathsisfun.com/algebra/trig-cosine-law.html

Immediate from the cosine law, the squared formulation is the cosine of the prediction and the target, viewed as vectors. The non-squared version is merely "not wrong" per-se, but it takes some mental gymnastics to make it into something meaningful mathematically.

FabianIsensee · 2019-12-18T07:20:57Z

Hi @liuyipei , I must admit I am probably not as good of a mathematician as I would like to be - there may certainly be theoretical advantages of squaring vs not squaring. I am a man of results though and at least in my experiments squaring does not perform as well. That could be to a number of reasons (hyperparameters tuned for non-squaring), so I am not saying that this observation is going to be true for everyone. But since my results are performing really well on several segmentation leaderboards I am confident that not squaring is a non-issue in practice :-)
Best,
Fabian

f0k mentioned this issue Apr 5, 2017

Jaccard Similarity Coefficient (objective accuracy) Lasagne/Lasagne#818

Open

xychenunc mentioned this issue Mar 20, 2018

Generalized dice loss for multi-class segmentation keras-team/keras#9395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DICE coefficient loss function #99

DICE coefficient loss function #99

mongoose54 commented Feb 2, 2017 •

edited

Loading

FabianIsensee commented Feb 3, 2017 via email

mongoose54 commented Feb 4, 2017

FabianIsensee commented Feb 7, 2017 via email

mongoose54 commented Feb 21, 2017

FabianIsensee commented Feb 22, 2017 •

edited

Loading

IMG-PRCSNG commented Sep 29, 2017 •

edited

Loading

sajjo79 commented Nov 27, 2017

f0k commented Nov 28, 2017

sajjo79 commented Nov 28, 2017 •

edited

Loading

f0k commented Nov 28, 2017

sajjo79 commented Nov 28, 2017

FabianIsensee commented Nov 29, 2017 •

edited

Loading

FabianIsensee commented Nov 29, 2017

25b3nk commented Mar 24, 2018

FabianIsensee commented Mar 25, 2018

jeremyjordan commented May 25, 2018

FabianIsensee commented May 25, 2018

jeremyjordan commented May 25, 2018

jizhang02 commented Mar 4, 2019

FabianIsensee commented Mar 5, 2019

jizhang02 commented Mar 6, 2019

liuyipei commented Dec 18, 2019 •

edited

Loading

FabianIsensee commented Dec 18, 2019

DICE coefficient loss function #99

DICE coefficient loss function #99

Comments

mongoose54 commented Feb 2, 2017 • edited Loading

FabianIsensee commented Feb 3, 2017 via email

mongoose54 commented Feb 4, 2017

FabianIsensee commented Feb 7, 2017 via email

mongoose54 commented Feb 21, 2017

FabianIsensee commented Feb 22, 2017 • edited Loading

IMG-PRCSNG commented Sep 29, 2017 • edited Loading

sajjo79 commented Nov 27, 2017

f0k commented Nov 28, 2017

sajjo79 commented Nov 28, 2017 • edited Loading

f0k commented Nov 28, 2017

sajjo79 commented Nov 28, 2017

FabianIsensee commented Nov 29, 2017 • edited Loading

FabianIsensee commented Nov 29, 2017

25b3nk commented Mar 24, 2018

FabianIsensee commented Mar 25, 2018

jeremyjordan commented May 25, 2018

FabianIsensee commented May 25, 2018

jeremyjordan commented May 25, 2018

jizhang02 commented Mar 4, 2019

FabianIsensee commented Mar 5, 2019

jizhang02 commented Mar 6, 2019

liuyipei commented Dec 18, 2019 • edited Loading

FabianIsensee commented Dec 18, 2019

mongoose54 commented Feb 2, 2017 •

edited

Loading

FabianIsensee commented Feb 22, 2017 •

edited

Loading

IMG-PRCSNG commented Sep 29, 2017 •

edited

Loading

sajjo79 commented Nov 28, 2017 •

edited

Loading

FabianIsensee commented Nov 29, 2017 •

edited

Loading

liuyipei commented Dec 18, 2019 •

edited

Loading