Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative loss in test with multiple batches on GPU #86

Open
gandroz opened this issue Jan 28, 2021 · 0 comments
Open

Negative loss in test with multiple batches on GPU #86

gandroz opened this issue Jan 28, 2021 · 0 comments

Comments

@gandroz
Copy link

gandroz commented Jan 28, 2021

I got this test error

FAIL: test_multiple_batches_gpu (__main__.WarpRNNTTest)
WarpRNNTTest.test_multiple_batches_gpu
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/guillaume/src/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py", line 96, in test_multiple_batches_gpu
    self._test_multiple_batches(use_gpu=True)
  File "/home/guillaume/src/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py", line 89, in _test_multiple_batches
    self._run_rnnt(acts, labels, input_lengths, label_lengths, expected_costs, expected_grads, 0, use_gpu)
  File "/home/guillaume/src/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py", line 31, in _run_rnnt
    self.assertAllClose(tf_costs, expected_costs, atol=1e-6)
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 1236, in decorated
    return f(*args, **kwds)
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 2711, in assertAllClose
    self._assertAllCloseRecursive(a, b, rtol=rtol, atol=atol, msg=msg)
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 2665, in _assertAllCloseRecursive
    self._assertArrayLikeAllClose(
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 2605, in _assertArrayLikeAllClose
    np.testing.assert_allclose(
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-06, atol=1e-06
Mismatched value: a is different from b. 
not close where = (array([0, 1]),)
not close lhs = [-5.3799906 -5.5812006]
not close rhs = [4.28065 3.93844]
not close dif = [9.660641 9.519641]
not close tol = [5.28065e-06 4.93844e-06]
dtype = float32, shape = (2,)
Mismatched elements: 2 / 2 (100%)
Max absolute difference: 9.660641
Max relative difference: 2.4171095
 x: array([-5.379991, -5.581201], dtype=float32)
 y: array([4.28065, 3.93844], dtype=float32)

with tensorflow 2.4.0 and CUDA 11.0. I did not have any issue with tf >=2.3 on CUDA 10.1

I adjusted the test to run on non eager mode as follow

# import tensorflow as tf
import tensorflow.compat.v1 as tf
import numpy as np
from warprnnt_tensorflow import rnnt_loss
from tensorflow.python.client import device_lib

tf.compat.v1.disable_eager_execution()
[...]

The test pass on CPU but not on GPU (GTX 1080Ti)

I also added this in the CMakeLists.txt

IF (CUDA_VERSION GREATER 10.1)
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_80,code=sm_80")
ENDIF()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant