negative loss #50

huangnengCSU · 2019-09-24T13:05:05Z

Hi :
I am using the GPU version RNNTLoss, but the loss is negative.
the shape of logits and targets is as follows:
targets: [batch_size, U]
logits: [batch_size, T, U+1, hidden]
RNNTloss(logits, targets, input_len, target_len)

However when I run on cpu, the loss is normal.

HawkAaron · 2019-09-24T13:54:47Z

The GPU version has done log_softmax inside. Did you apply log_softmax to activations?

huangnengCSU · 2019-09-24T14:48:29Z

@HawkAaron
there is no activation function before rnntloss

HawkAaron · 2019-09-24T14:51:31Z

Did you follow the same procedure as bellow?

GPU: logits -> RNNTLoss
CPU: logits -> log_softmax -> RNNTLoss

huangnengCSU · 2019-09-25T00:10:34Z

@HawkAaron
My code is as following,

crit = RNNTLoss()
logits = joint(enc_state, dec_state)
loss = crit(logits,targets,input_lengths,target_lengths)

when run on cpu, there is a log_softmax in your implemention before RNNTLoss, so I don't need to operate an activation to logits.

when run on gpu, the log_softmax is inside the RNNTLoss, so I also don't need to operate an activation to logits.

BTW, If I put the network on gpu for calculation and put the logits back to cpu to calculate the RNNTLoss, the loss is normal again.

HawkAaron · 2019-09-26T10:58:47Z

Could you update the code and rebuild again? Everything is ok in my environment.
please provide more information if this problem happened again. (python, pytorch, cuda, gcc)

huangnengCSU · 2019-09-26T12:15:34Z

I have updated the latest code. But I haven't solved the problem. So I calculated the RNNTLoss on cpu at last. Maybe I will find the problem later when I debug my code.

kate-chuikova · 2019-10-04T15:35:22Z

I have the same problem when using tensorflow 1.15.0.rc2 (CUDA 10, gcc 7.4.0)
all tests (gpu_test, cpu_tests, tensorflow tests) are passed

Previously I used tf 1.10 and everything was ok, then I upgraded tf, update and rebuild RNNTLoss, and get negative loss.

I install RNNTLoss using the following steps in docker:

ENV CUDA_HOME /usr/local/cuda
ENV LD_LIBRARY_PATH="/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"
RUN cd /rnnt/rnnt-loss && mkdir /rnnt/rnnt-loss/build
RUN cd /rnnt/rnnt-loss/build && cmake -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME .. &&
make
RUN ldconfig /usr/local/cuda/lib64/stubs &&
cd /rnnt/rnnt-loss/tensorflow_binding && CUDA=$CUDA_HOME python3 setup.py install

kate-chuikova · 2019-10-07T13:13:23Z

UPD:
I tried to rerun experiment with updated version of RNNTLoss and tf 1.10 and got negative loss.
My previous RNNTloss version was in 6e33845 commit and everything was OK in tf 1.10 (but this version doesn't work with tf 1.15)

Also, I got the following error using tf 1.10:
src/warprnnt_op.cc:8:52: fatal error: tensorflow/core/framework/bounds_check.h: No such file or directory
this file was renamed for tf >= 1.14 support and doesn't exist now for tf 1.10

UPD2:
I found a bug in my code: the problem was in wrong blank_index, everything is OK now.

mricepops · 2020-04-01T19:38:44Z

Just a quick question: Should the blank label be compulsorily 0? What about GPU vs CPU?

BuaaAlban · 2020-04-26T03:11:35Z

UPD:
I tried to rerun experiment with updated version of RNNTLoss and tf 1.10 and got negative loss.
My previous RNNTloss version was in 6e33845 commit and everything was OK in tf 1.10 (but this version doesn't work with tf 1.15)

Also, I got the following error using tf 1.10:
src/warprnnt_op.cc:8:52: fatal error: tensorflow/core/framework/bounds_check.h: No such file or directory
this file was renamed for tf >= 1.14 support and doesn't exist now for tf 1.10

UPD2:
I found a bug in my code: the problem was in wrong blank_index, everything is OK now.

I also have the question that is it reasonable to have negative loss, maybe it is because the probability i s the sum of all alignments, which means the probability can be larger than 1 and the loss can be negative?
BTW, what do you mean by "the problem was in wrong blank_index"??

HawkAaron · 2020-04-27T15:03:28Z

The recent pull request #64 may fix this issue.

HawkAaron · 2020-04-28T13:49:04Z

@Sudharsansai The blank label can be any class inside your vocab (with blank included).

HawkAaron closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

negative loss #50

negative loss #50

huangnengCSU commented Sep 24, 2019 •

edited

Loading

HawkAaron commented Sep 24, 2019

huangnengCSU commented Sep 24, 2019

HawkAaron commented Sep 24, 2019

huangnengCSU commented Sep 25, 2019 •

edited

Loading

HawkAaron commented Sep 26, 2019

huangnengCSU commented Sep 26, 2019 •

edited

Loading

kate-chuikova commented Oct 4, 2019 •

edited

Loading

kate-chuikova commented Oct 7, 2019 •

edited

Loading

mricepops commented Apr 1, 2020

BuaaAlban commented Apr 26, 2020

HawkAaron commented Apr 27, 2020

HawkAaron commented Apr 28, 2020

negative loss #50

negative loss #50

Comments

huangnengCSU commented Sep 24, 2019 • edited Loading

HawkAaron commented Sep 24, 2019

huangnengCSU commented Sep 24, 2019

HawkAaron commented Sep 24, 2019

huangnengCSU commented Sep 25, 2019 • edited Loading

HawkAaron commented Sep 26, 2019

huangnengCSU commented Sep 26, 2019 • edited Loading

kate-chuikova commented Oct 4, 2019 • edited Loading

kate-chuikova commented Oct 7, 2019 • edited Loading

mricepops commented Apr 1, 2020

BuaaAlban commented Apr 26, 2020

HawkAaron commented Apr 27, 2020

HawkAaron commented Apr 28, 2020

huangnengCSU commented Sep 24, 2019 •

edited

Loading

huangnengCSU commented Sep 25, 2019 •

edited

Loading

huangnengCSU commented Sep 26, 2019 •

edited

Loading

kate-chuikova commented Oct 4, 2019 •

edited

Loading

kate-chuikova commented Oct 7, 2019 •

edited

Loading