Negative loss in test with multiple batches on GPU #86

gandroz · 2021-01-28T17:10:22Z

I got this test error

FAIL: test_multiple_batches_gpu (__main__.WarpRNNTTest)
WarpRNNTTest.test_multiple_batches_gpu
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/guillaume/src/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py", line 96, in test_multiple_batches_gpu
    self._test_multiple_batches(use_gpu=True)
  File "/home/guillaume/src/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py", line 89, in _test_multiple_batches
    self._run_rnnt(acts, labels, input_lengths, label_lengths, expected_costs, expected_grads, 0, use_gpu)
  File "/home/guillaume/src/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py", line 31, in _run_rnnt
    self.assertAllClose(tf_costs, expected_costs, atol=1e-6)
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 1236, in decorated
    return f(*args, **kwds)
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 2711, in assertAllClose
    self._assertAllCloseRecursive(a, b, rtol=rtol, atol=atol, msg=msg)
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 2665, in _assertAllCloseRecursive
    self._assertArrayLikeAllClose(
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 2605, in _assertArrayLikeAllClose
    np.testing.assert_allclose(
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/guillaume/miniconda3/envs/stt/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-06, atol=1e-06
Mismatched value: a is different from b. 
not close where = (array([0, 1]),)
not close lhs = [-5.3799906 -5.5812006]
not close rhs = [4.28065 3.93844]
not close dif = [9.660641 9.519641]
not close tol = [5.28065e-06 4.93844e-06]
dtype = float32, shape = (2,)
Mismatched elements: 2 / 2 (100%)
Max absolute difference: 9.660641
Max relative difference: 2.4171095
 x: array([-5.379991, -5.581201], dtype=float32)
 y: array([4.28065, 3.93844], dtype=float32)

with tensorflow 2.4.0 and CUDA 11.0. I did not have any issue with tf >=2.3 on CUDA 10.1

I adjusted the test to run on non eager mode as follow

# import tensorflow as tf
import tensorflow.compat.v1 as tf
import numpy as np
from warprnnt_tensorflow import rnnt_loss
from tensorflow.python.client import device_lib

tf.compat.v1.disable_eager_execution()
[...]

The test pass on CPU but not on GPU (GTX 1080Ti)

I also added this in the CMakeLists.txt

IF (CUDA_VERSION GREATER 10.1)
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_80,code=sm_80")
ENDIF()

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative loss in test with multiple batches on GPU #86

Negative loss in test with multiple batches on GPU #86

gandroz commented Jan 28, 2021 •

edited

Loading

Negative loss in test with multiple batches on GPU #86

Negative loss in test with multiple batches on GPU #86

Comments

gandroz commented Jan 28, 2021 • edited Loading

gandroz commented Jan 28, 2021 •

edited

Loading