Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNNT loss in pure TF #95

Merged
merged 20 commits into from
Jan 4, 2021
Merged

RNNT loss in pure TF #95

merged 20 commits into from
Jan 4, 2021

Conversation

monatis
Copy link
Contributor

@monatis monatis commented Jan 3, 2021

Hi @usimarit
I'm trying to add support for TPU training in this branch. Changes I've made so far:

  • Introduce a pure TensorFlow implementation of RNNT loss as a fallback when warprnnt_tensorflow cannot be imported in tensorflow_asr/losses/rnnt_losses.py.
  • Refactor utils.utils.preprocess_paths to properly handle GCS paths starting with gs:// since TPUs can read from and write to GCS buckets.
  • Uncomment utils.setup_tpu and return an instance of TPUStrategy from there.
  • Modify examples/conformer/train_ga_subword_conformer.py to accept arguments regarding TPU training and set up TPUStrategy if --tpu flag is set.

So far so good, but when I run the script it creates training tfrecord files and uploads them to GCS and then throws an error saying Ignoring an error encountered when deleting remote tensors handles. Even if it ignores this error, the execution halts and it does not continue to create eval tfrecord files.

Full output is as follows. Do you have any idea what this error means?

python examples/conformer/train_ga_subword_conformer.py --config /content/config.yml --tfrecords --subwords /content/italian.subwords --subwords_corpus /content/mls/mls_italian/train/transcripts_tfasr.tsv --subwords_corpus /content/mls/mls_italian/dev/transcripts_tfasr.tsv --subwords_corpus /content/mls/mls_italian/test/transcripts_tfasr.tsv --cache --tpu --tbs 2 --ebs 2 --acs 8

2021-01-03 16:20:35.974796: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-03 16:20:35.974846: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-03 16:20:38.145419: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-03 16:20:38.166202: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-01-03 16:20:38.166255: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7c4318eba3d4): /proc/driver/nvidia/version does not exist
2021-01-03 16:20:38.167183: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-03 16:20:38.184788: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.12.113.122:8470}
2021-01-03 16:20:38.184845: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:37602}
2021-01-03 16:20:38.205508: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.12.113.122:8470}
2021-01-03 16:20:38.205585: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:37602}
2021-01-03 16:20:38.206265: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:37602
All TPUs:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')]
Cannot import RNNT loss in warprnnt. Falls back to RNNT in TensorFlow
Loading subwords ...
(long model definition)
Creating train.tfrecord ...
Reading /content/mls/mls_italian/train/transcripts_tfasr.tsv ...


Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_13.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_16.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_1.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_10.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_6.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_4.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_14.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_3.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_12.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_5.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_15.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_2.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_9.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_8.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_7.tfrecord

Created gs://ailabscomtrtpu/mls/mls_italian/tfrecords/train_11.tfrecord
2021-01-03 16:26:39.958150: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 19532, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1609691199.954800734","description":"Error received from peer ipv4:10.12.113.122:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 19532, Output num: 0","grpc_status":3}

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 3, 2021

@monatis Thank you a lot for your contribution, I did try to use that rnnt package but no success 😄
I don't know what that error means but try to use scripts/create_tfrecords.py to see if it's really error from creating tfrecords or from distributed training (the training script only creates tfrecords after initializing everything, therefore it might be some connection when using tpu - the ipv4 flag in error message)

@monatis
Copy link
Contributor Author

monatis commented Jan 3, 2021

@usimarit Thanks! I'll try to debug with scripts/create_tfrecords.py as you said. I'll also try with Cloud TPUs --this was on Colab and there's a possibility that it relates to different TPU versions. I'll mark it as ready for review if I can make it run :)

@monatis
Copy link
Contributor Author

monatis commented Jan 4, 2021

Hi @usimarit
I managed to fix RNNT loss implementation in pure TF and it runs smoothly on GPU now. Then I modified ASRTFRecordDataset to use tf.io.gfile API to be nice with GCS paths, and it succeeds in creating all the TF Record files. However, I'm getting another error now during the forward pass on TPU. The full output is as follows, and I'm investigating this now:

!python examples/conformer/train_ga_subword_conformer.py --config /content/config.yml --tfrecords --subwords /content/polish.subwords --subwords_corpus /content/mls/mls_polish/train/transcripts_tfasr.tsv --subwords_corpus /content/mls/mls_polish/dev/transcripts_tfasr.tsv --subwords_corpus /content/mls/mls_polish/test/transcripts_tfasr.tsv --cache --tpu --tbs 2 --ebs 2 --acs 8
2021-01-04 10:57:47.155088: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-01-04 10:57:49.110004: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-04 10:57:49.110913: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-04 10:57:49.118857: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-01-04 10:57:49.118894: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (470dc3006e93): /proc/driver/nvidia/version does not exist
2021-01-04 10:57:49.120990: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-04 10:57:49.129478: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.141.226:8470}
2021-01-04 10:57:49.129511: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33578}
2021-01-04 10:57:49.146715: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.141.226:8470}
2021-01-04 10:57:49.146760: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33578}
2021-01-04 10:57:49.147117: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:33578
All TPUs:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')]
Cannot import RNNT loss in warprnnt. Falls back to RNNT in TensorFlow
Loading subwords ...
(long model definition)
TFRecords're already existed: train
TFRecords're already existed: eval
[Train] |                    | 0/? [00:00<?, ?batch/s]Traceback (most recent call last):
  File "examples/conformer/train_ga_subword_conformer.py", line 153, in <module>
    train_bs=args.tbs, eval_bs=args.ebs, train_acs=args.acs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 312, in fit
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 192, in run
    self._train_epoch()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 213, in _train_epoch
    raise e
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 207, in _train_epoch
    self._train_function(train_iterator)  # Run train step
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 871, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3887, in bound_method_wrapper
    return wrapped_fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/transducer_runners.py:98 _train_function  *
        self.strategy.run(self._train_step, args=(batch,))
    /usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/transducer_runners.py:112 _train_step  *
        logits = self.model([features, input_length, prediction, prediction_length], training=True)
    /usr/local/lib/python3.6/dist-packages/tensorflow_asr/models/transducer.py:284 call  *
        pred = self.predict_net([prediction, prediction_length], training=training, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_asr/models/transducer.py:101 call  *
        outputs = rnn["rnn"](outputs, training=training, mask=mask)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent.py:660 __call__  **
        return super(RNN, self).__call__(inputs, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py:1012 __call__
        outputs = call_fn(inputs, *args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent_v2.py:1270 call
        runtime) = lstm_with_backend_selection(**normal_lstm_kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent_v2.py:1655 lstm_with_backend_selection
        last_output, outputs, new_h, new_c, runtime = defun_standard_lstm(**params)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:2941 __call__
        filtered_flat_args) = self._maybe_define_function(args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:3361 _maybe_define_function
        graph_function = self._create_graph_function(args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:3206 _create_graph_function
        capture_by_value=self._capture_by_value),
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py:990 func_graph_from_py_func
        func_outputs = python_func(*func_args, **func_kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent_v2.py:1402 standard_lstm
        zero_output_for_mask=zero_output_for_mask)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:4364 rnn
        max_iterations = math_ops.reduce_max(input_length)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:2746 reduce_max
        _ReductionDims(input_tensor, axis))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:1907 _ReductionDims
        return range(0, array_ops.rank(x))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py:837 rank
        return rank_internal(input, name, optimize=True)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py:857 rank_internal
        input = ops.convert_to_tensor(input)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/profiler/trace.py:163 wrapped
        return func(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py:1540 convert_to_tensor
        ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:339 _constant_tensor_conversion_function
        return constant(v, dtype=dtype, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:265 constant
        allow_broadcast=True)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:283 _constant_impl
        allow_broadcast=allow_broadcast))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto
        raise ValueError("None values not supported.")

    ValueError: None values not supported.

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 4, 2021

@monatis It seems like the problem with lstm in tpu.

@monatis
Copy link
Contributor Author

monatis commented Jan 4, 2021

@usimarit yeah I think so. But this colab successfully uses LSTM with regular Keras .fit() method, so I think it is a problem related to custom training loop.

@monatis
Copy link
Contributor Author

monatis commented Jan 4, 2021

My understanding is that TPU requires RNNs with unrolling --and thus a fixed number of time steps. So I'm postponing this PR until I figure out how to run RNNs on TPU. However, RNNT loss in pure TF runs smoothly on GPU, and I can prepare a clean PR for that one if you want to merge it. It can be useful for serverless cloud GPUs like AI Platform at least.

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 4, 2021

@monatis Yes, until we figure out how to run RNNs on TPU, this pull request should change to RNNT loss in pure TF. Thank you for such a great work 😄

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 4, 2021

@monatis Since as you said that it requires a fixed number of time steps, then we can find the maximum number of time step in the whole dataset, then config the dataset map function to pad smaller time step features to the maximum size with zeros, and build the model with that maximum time step.
If you try this way, you should create a separate ASRDataset class with a function that finds the maximum number of time step in dataset and update the parse function to pad features, and create some separate training scripts instead of using the same train_subword_conformer.py like train_tpu_subword_conformer.py for example.

@monatis monatis changed the title Add support for TPU training RNNT loss in pure TF Jan 4, 2021
@monatis monatis marked this pull request as ready for review January 4, 2021 16:36
@monatis
Copy link
Contributor Author

monatis commented Jan 4, 2021

@usimarit Totally makes sense. I'll try with a separate dataset that pads to globally maximum length as you said.
Now PR is ready for RNNT loss in pure TF. You can test it on this Colab and you can also link to it anywhere if necessary.
Thanks 😊

@nglehuy nglehuy self-requested a review January 4, 2021 16:42
@nglehuy nglehuy merged commit 1724423 into TensorSpeech:main Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants