RNNT loss in pure TF #95

monatis · 2021-01-03T17:31:52Z

Hi @usimarit
I'm trying to add support for TPU training in this branch. Changes I've made so far:

Introduce a pure TensorFlow implementation of RNNT loss as a fallback when warprnnt_tensorflow cannot be imported in tensorflow_asr/losses/rnnt_losses.py.
Refactor utils.utils.preprocess_paths to properly handle GCS paths starting with gs:// since TPUs can read from and write to GCS buckets.
Uncomment utils.setup_tpu and return an instance of TPUStrategy from there.
Modify examples/conformer/train_ga_subword_conformer.py to accept arguments regarding TPU training and set up TPUStrategy if --tpu flag is set.

So far so good, but when I run the script it creates training tfrecord files and uploads them to GCS and then throws an error saying Ignoring an error encountered when deleting remote tensors handles. Even if it ignores this error, the execution halts and it does not continue to create eval tfrecord files.

Full output is as follows. Do you have any idea what this error means?

python examples/conformer/train_ga_subword_conformer.py --config /content/config.yml --tfrecords --subwords /content/italian.subwords --subwords_corpus /content/mls/mls_italian/train/transcripts_tfasr.tsv --subwords_corpus /content/mls/mls_italian/dev/transcripts_tfasr.tsv --subwords_corpus /content/mls/mls_italian/test/transcripts_tfasr.tsv --cache --tpu --tbs 2 --ebs 2 --acs 8

2021-01-03 16:20:35.974796: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-03 16:20:35.974846: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-03 16:20:38.145419: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-03 16:20:38.166202: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-01-03 16:20:38.166255: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7c4318eba3d4): /proc/driver/nvidia/version does not exist
2021-01-03 16:20:38.167183: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-03 16:20:38.184788: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.12.113.122:8470}
2021-01-03 16:20:38.184845: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:37602}
2021-01-03 16:20:38.205508: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.12.113.122:8470}
2021-01-03 16:20:38.205585: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:37602}
2021-01-03 16:20:38.206265: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:37602
All TPUs:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')]
Cannot import RNNT loss in warprnnt. Falls back to RNNT in TensorFlow
Loading subwords ...
(long model definition)
Creating train.tfrecord ...
Reading /content/mls/mls_italian/train/transcripts_tfasr.tsv ...


Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_13.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_16.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_1.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_10.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_6.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_4.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_14.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_3.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_12.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_5.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_15.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_2.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_9.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_8.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_7.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_11.tfrecord
2021-01-03 16:26:39.958150: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 19532, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1609691199.954800734","description":"Error received from peer ipv4:10.12.113.122:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 19532, Output num: 0","grpc_status":3}

nglehuy · 2021-01-03T18:09:42Z

@monatis Thank you a lot for your contribution, I did try to use that rnnt package but no success 😄
I don't know what that error means but try to use scripts/create_tfrecords.py to see if it's really error from creating tfrecords or from distributed training (the training script only creates tfrecords after initializing everything, therefore it might be some connection when using tpu - the ipv4 flag in error message)

monatis · 2021-01-03T18:26:08Z

@usimarit Thanks! I'll try to debug with scripts/create_tfrecords.py as you said. I'll also try with Cloud TPUs --this was on Colab and there's a possibility that it relates to different TPU versions. I'll mark it as ready for review if I can make it run :)

monatis · 2021-01-04T11:18:53Z

Hi @usimarit
I managed to fix RNNT loss implementation in pure TF and it runs smoothly on GPU now. Then I modified ASRTFRecordDataset to use tf.io.gfile API to be nice with GCS paths, and it succeeds in creating all the TF Record files. However, I'm getting another error now during the forward pass on TPU. The full output is as follows, and I'm investigating this now:

!python examples/conformer/train_ga_subword_conformer.py --config /content/config.yml --tfrecords --subwords /content/polish.subwords --subwords_corpus /content/mls/mls_polish/train/transcripts_tfasr.tsv --subwords_corpus /content/mls/mls_polish/dev/transcripts_tfasr.tsv --subwords_corpus /content/mls/mls_polish/test/transcripts_tfasr.tsv --cache --tpu --tbs 2 --ebs 2 --acs 8
2021-01-04 10:57:47.155088: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-01-04 10:57:49.110004: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-04 10:57:49.110913: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-04 10:57:49.118857: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-01-04 10:57:49.118894: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (470dc3006e93): /proc/driver/nvidia/version does not exist
2021-01-04 10:57:49.120990: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-04 10:57:49.129478: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.141.226:8470}
2021-01-04 10:57:49.129511: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33578}
2021-01-04 10:57:49.146715: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.141.226:8470}
2021-01-04 10:57:49.146760: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33578}
2021-01-04 10:57:49.147117: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:33578
All TPUs:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')]
Cannot import RNNT loss in warprnnt. Falls back to RNNT in TensorFlow
Loading subwords ...
(long model definition)
TFRecords're already existed: train
TFRecords're already existed: eval
[Train] |                    | 0/? [00:00<?, ?batch/s]Traceback (most recent call last):
  File "examples/conformer/train_ga_subword_conformer.py", line 153, in <module>
    train_bs=args.tbs, eval_bs=args.ebs, train_acs=args.acs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 312, in fit
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 192, in run
    self._train_epoch()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 213, in _train_epoch
    raise e
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 207, in _train_epoch
    self._train_function(train_iterator)  # Run train step
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 871, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3887, in bound_method_wrapper
    return wrapped_fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/transducer_runners.py:98 _train_function  *
        self.strategy.run(self._train_step, args=(batch,))
    /usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/transducer_runners.py:112 _train_step  *
        logits = self.model([features, input_length, prediction, prediction_length], training=True)
    /usr/local/lib/python3.6/dist-packages/tensorflow_asr/models/transducer.py:284 call  *
        pred = self.predict_net([prediction, prediction_length], training=training, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_asr/models/transducer.py:101 call  *
        outputs = rnn["rnn"](outputs, training=training, mask=mask)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent.py:660 __call__  **
        return super(RNN, self).__call__(inputs, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py:1012 __call__
        outputs = call_fn(inputs, *args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent_v2.py:1270 call
        runtime) = lstm_with_backend_selection(**normal_lstm_kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent_v2.py:1655 lstm_with_backend_selection
        last_output, outputs, new_h, new_c, runtime = defun_standard_lstm(**params)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:2941 __call__
        filtered_flat_args) = self._maybe_define_function(args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:3361 _maybe_define_function
        graph_function = self._create_graph_function(args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:3206 _create_graph_function
        capture_by_value=self._capture_by_value),
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py:990 func_graph_from_py_func
        func_outputs = python_func(*func_args, **func_kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent_v2.py:1402 standard_lstm
        zero_output_for_mask=zero_output_for_mask)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:4364 rnn
        max_iterations = math_ops.reduce_max(input_length)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:2746 reduce_max
        _ReductionDims(input_tensor, axis))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:1907 _ReductionDims
        return range(0, array_ops.rank(x))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py:837 rank
        return rank_internal(input, name, optimize=True)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py:857 rank_internal
        input = ops.convert_to_tensor(input)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/profiler/trace.py:163 wrapped
        return func(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py:1540 convert_to_tensor
        ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:339 _constant_tensor_conversion_function
        return constant(v, dtype=dtype, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:265 constant
        allow_broadcast=True)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:283 _constant_impl
        allow_broadcast=allow_broadcast))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto
        raise ValueError("None values not supported.")

    ValueError: None values not supported.

nglehuy · 2021-01-04T11:59:03Z

@monatis It seems like the problem with lstm in tpu.

monatis · 2021-01-04T13:50:26Z

@usimarit yeah I think so. But this colab successfully uses LSTM with regular Keras .fit() method, so I think it is a problem related to custom training loop.

monatis · 2021-01-04T15:16:25Z

My understanding is that TPU requires RNNs with unrolling --and thus a fixed number of time steps. So I'm postponing this PR until I figure out how to run RNNs on TPU. However, RNNT loss in pure TF runs smoothly on GPU, and I can prepare a clean PR for that one if you want to merge it. It can be useful for serverless cloud GPUs like AI Platform at least.

…into tfrnnt

nglehuy · 2021-01-04T15:31:28Z

@monatis Yes, until we figure out how to run RNNs on TPU, this pull request should change to RNNT loss in pure TF. Thank you for such a great work 😄

nglehuy · 2021-01-04T15:36:10Z

@monatis Since as you said that it requires a fixed number of time steps, then we can find the maximum number of time step in the whole dataset, then config the dataset map function to pad smaller time step features to the maximum size with zeros, and build the model with that maximum time step.
If you try this way, you should create a separate ASRDataset class with a function that finds the maximum number of time step in dataset and update the parse function to pad features, and create some separate training scripts instead of using the same train_subword_conformer.py like train_tpu_subword_conformer.py for example.

monatis · 2021-01-04T16:41:59Z

@usimarit Totally makes sense. I'll try with a separate dataset that pads to globally maximum length as you said.
Now PR is ready for RNNT loss in pure TF. You can test it on this Colab and you can also link to it anywhere if necessary.
Thanks 😊

monatis added 6 commits January 3, 2021 10:39

Add support for RNNT loss in pure TensorFlow

764ee1e

Add support for TPU training

536bf46

Fix setup_tpu

5d57253

Fix path preprocessing

0eede0f

Fix syntax error

1790a85

Fix utils.preprocess_paths

3e6abee

monatis added 9 commits January 3, 2021 22:15

Fix rnnt_loss in tf

759867d

Compute shape dynamically for RNNT in TF

5318204

Cast indices to tf.int64

bb6ae79

Fix syntax error

273f121

Fix syntax error

ff8be12

Further type casting

c4836c6

Fix custom gradient

861cf6c

Use tf.io.gfile API instead of os.path for TF Record datasets

837fd41

Cherrypick the total_steps fix in base_runners.py from upstream repo

76cc3e0

TPU training for contextnet

acf0dd5

monatis added 2 commits January 4, 2021 17:29

Unrol RNNs

aac6f53

Revert unrolling RNNs

c017dbd

Merge branch 'main' of https://github.com/tensorspeech/TensorFlowASR …

cfa2e61

…into tfrnnt

Revert changes for TPU training for now

4a89ca5

monatis changed the title ~~Add support for TPU training~~ RNNT loss in pure TF Jan 4, 2021

monatis marked this pull request as ready for review January 4, 2021 16:36

nglehuy self-requested a review January 4, 2021 16:42

nglehuy approved these changes Jan 4, 2021

View reviewed changes

nglehuy merged commit 1724423 into TensorSpeech:main Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNNT loss in pure TF #95

RNNT loss in pure TF #95

monatis commented Jan 3, 2021

nglehuy commented Jan 3, 2021

monatis commented Jan 3, 2021

monatis commented Jan 4, 2021

nglehuy commented Jan 4, 2021

monatis commented Jan 4, 2021

monatis commented Jan 4, 2021

nglehuy commented Jan 4, 2021 •

edited

Loading

nglehuy commented Jan 4, 2021 •

edited

Loading

monatis commented Jan 4, 2021

RNNT loss in pure TF #95

RNNT loss in pure TF #95

Conversation

monatis commented Jan 3, 2021

nglehuy commented Jan 3, 2021

monatis commented Jan 3, 2021

monatis commented Jan 4, 2021

nglehuy commented Jan 4, 2021

monatis commented Jan 4, 2021

monatis commented Jan 4, 2021

nglehuy commented Jan 4, 2021 • edited Loading

nglehuy commented Jan 4, 2021 • edited Loading

monatis commented Jan 4, 2021

nglehuy commented Jan 4, 2021 •

edited

Loading

nglehuy commented Jan 4, 2021 •

edited

Loading