Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

negative loss #50

Closed
huangnengCSU opened this issue Sep 24, 2019 · 12 comments
Closed

negative loss #50

huangnengCSU opened this issue Sep 24, 2019 · 12 comments

Comments

@huangnengCSU
Copy link

huangnengCSU commented Sep 24, 2019

Hi :
I am using the GPU version RNNTLoss, but the loss is negative.
the shape of logits and targets is as follows:
targets: [batch_size, U]
logits: [batch_size, T, U+1, hidden]
RNNTloss(logits, targets, input_len, target_len)
屏幕快照 2019-09-24 下午9 04 43
However when I run on cpu, the loss is normal.
屏幕快照 2019-09-24 下午9 22 53

@HawkAaron
Copy link
Owner

The GPU version has done log_softmax inside. Did you apply log_softmax to activations?

@huangnengCSU
Copy link
Author

@HawkAaron
there is no activation function before rnntloss

@HawkAaron
Copy link
Owner

Did you follow the same procedure as bellow?

GPU: logits -> RNNTLoss
CPU: logits -> log_softmax -> RNNTLoss

@huangnengCSU
Copy link
Author

huangnengCSU commented Sep 25, 2019

@HawkAaron
My code is as following,

crit = RNNTLoss()
logits = joint(enc_state, dec_state)
loss = crit(logits,targets,input_lengths,target_lengths)

when run on cpu, there is a log_softmax in your implemention before RNNTLoss, so I don't need to operate an activation to logits.
屏幕快照 2019-09-25 上午10 46 46

when run on gpu, the log_softmax is inside the RNNTLoss, so I also don't need to operate an activation to logits.

BTW, If I put the network on gpu for calculation and put the logits back to cpu to calculate the RNNTLoss, the loss is normal again.

@HawkAaron
Copy link
Owner

Could you update the code and rebuild again? Everything is ok in my environment.
please provide more information if this problem happened again. (python, pytorch, cuda, gcc)

@huangnengCSU
Copy link
Author

huangnengCSU commented Sep 26, 2019

I have updated the latest code. But I haven't solved the problem. So I calculated the RNNTLoss on cpu at last. Maybe I will find the problem later when I debug my code.

@kate-chuikova
Copy link

kate-chuikova commented Oct 4, 2019

I have the same problem when using tensorflow 1.15.0.rc2 (CUDA 10, gcc 7.4.0)
all tests (gpu_test, cpu_tests, tensorflow tests) are passed

Previously I used tf 1.10 and everything was ok, then I upgraded tf, update and rebuild RNNTLoss, and get negative loss.

I install RNNTLoss using the following steps in docker:

ENV CUDA_HOME /usr/local/cuda
ENV LD_LIBRARY_PATH="/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"
RUN cd /rnnt/rnnt-loss && mkdir /rnnt/rnnt-loss/build
RUN cd /rnnt/rnnt-loss/build && cmake -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME .. &&
make
RUN ldconfig /usr/local/cuda/lib64/stubs &&
cd /rnnt/rnnt-loss/tensorflow_binding && CUDA=$CUDA_HOME python3 setup.py install

@kate-chuikova
Copy link

kate-chuikova commented Oct 7, 2019

UPD:
I tried to rerun experiment with updated version of RNNTLoss and tf 1.10 and got negative loss.
My previous RNNTloss version was in 6e33845 commit and everything was OK in tf 1.10 (but this version doesn't work with tf 1.15)

Also, I got the following error using tf 1.10:
src/warprnnt_op.cc:8:52: fatal error: tensorflow/core/framework/bounds_check.h: No such file or directory
this file was renamed for tf >= 1.14 support and doesn't exist now for tf 1.10

UPD2:
I found a bug in my code: the problem was in wrong blank_index, everything is OK now.

@mricepops
Copy link

Just a quick question: Should the blank label be compulsorily 0? What about GPU vs CPU?

@BuaaAlban
Copy link

UPD:
I tried to rerun experiment with updated version of RNNTLoss and tf 1.10 and got negative loss.
My previous RNNTloss version was in 6e33845 commit and everything was OK in tf 1.10 (but this version doesn't work with tf 1.15)

Also, I got the following error using tf 1.10:
src/warprnnt_op.cc:8:52: fatal error: tensorflow/core/framework/bounds_check.h: No such file or directory
this file was renamed for tf >= 1.14 support and doesn't exist now for tf 1.10

UPD2:
I found a bug in my code: the problem was in wrong blank_index, everything is OK now.

I also have the question that is it reasonable to have negative loss, maybe it is because the probability i s the sum of all alignments, which means the probability can be larger than 1 and the loss can be negative?
BTW, what do you mean by "the problem was in wrong blank_index"??

@HawkAaron
Copy link
Owner

The recent pull request #64 may fix this issue.

@HawkAaron
Copy link
Owner

@Sudharsansai The blank label can be any class inside your vocab (with blank included).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants