-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RNNT loss in pure TF #95
Conversation
@monatis Thank you a lot for your contribution, I did try to use that rnnt package but no success 😄 |
@usimarit Thanks! I'll try to debug with |
Hi @usimarit !python examples/conformer/train_ga_subword_conformer.py --config /content/config.yml --tfrecords --subwords /content/polish.subwords --subwords_corpus /content/mls/mls_polish/train/transcripts_tfasr.tsv --subwords_corpus /content/mls/mls_polish/dev/transcripts_tfasr.tsv --subwords_corpus /content/mls/mls_polish/test/transcripts_tfasr.tsv --cache --tpu --tbs 2 --ebs 2 --acs 8
2021-01-04 10:57:47.155088: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-01-04 10:57:49.110004: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-04 10:57:49.110913: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-04 10:57:49.118857: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-01-04 10:57:49.118894: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (470dc3006e93): /proc/driver/nvidia/version does not exist
2021-01-04 10:57:49.120990: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-04 10:57:49.129478: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.141.226:8470}
2021-01-04 10:57:49.129511: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33578}
2021-01-04 10:57:49.146715: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.141.226:8470}
2021-01-04 10:57:49.146760: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33578}
2021-01-04 10:57:49.147117: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:33578
All TPUs: [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')]
Cannot import RNNT loss in warprnnt. Falls back to RNNT in TensorFlow
Loading subwords ...
(long model definition)
TFRecords're already existed: train
TFRecords're already existed: eval
[Train] | | 0/? [00:00<?, ?batch/s]Traceback (most recent call last):
File "examples/conformer/train_ga_subword_conformer.py", line 153, in <module>
train_bs=args.tbs, eval_bs=args.ebs, train_acs=args.acs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 312, in fit
self.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 192, in run
self._train_epoch()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 213, in _train_epoch
raise e
File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 207, in _train_epoch
self._train_function(train_iterator) # Run train step
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 871, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
*args, **kwds))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3887, in bound_method_wrapper
return wrapped_fn(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:
/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/transducer_runners.py:98 _train_function *
self.strategy.run(self._train_step, args=(batch,))
/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/transducer_runners.py:112 _train_step *
logits = self.model([features, input_length, prediction, prediction_length], training=True)
/usr/local/lib/python3.6/dist-packages/tensorflow_asr/models/transducer.py:284 call *
pred = self.predict_net([prediction, prediction_length], training=training, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow_asr/models/transducer.py:101 call *
outputs = rnn["rnn"](outputs, training=training, mask=mask)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent.py:660 __call__ **
return super(RNN, self).__call__(inputs, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py:1012 __call__
outputs = call_fn(inputs, *args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent_v2.py:1270 call
runtime) = lstm_with_backend_selection(**normal_lstm_kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent_v2.py:1655 lstm_with_backend_selection
last_output, outputs, new_h, new_c, runtime = defun_standard_lstm(**params)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:2941 __call__
filtered_flat_args) = self._maybe_define_function(args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:3361 _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:3206 _create_graph_function
capture_by_value=self._capture_by_value),
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py:990 func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/recurrent_v2.py:1402 standard_lstm
zero_output_for_mask=zero_output_for_mask)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201 wrapper
return target(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:4364 rnn
max_iterations = math_ops.reduce_max(input_length)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201 wrapper
return target(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:2746 reduce_max
_ReductionDims(input_tensor, axis))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:1907 _ReductionDims
return range(0, array_ops.rank(x))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201 wrapper
return target(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py:837 rank
return rank_internal(input, name, optimize=True)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py:857 rank_internal
input = ops.convert_to_tensor(input)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/profiler/trace.py:163 wrapped
return func(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py:1540 convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:339 _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:265 constant
allow_broadcast=True)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:283 _constant_impl
allow_broadcast=allow_broadcast))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported. |
@monatis It seems like the problem with lstm in tpu. |
@usimarit yeah I think so. But this colab successfully uses LSTM with regular Keras |
My understanding is that TPU requires RNNs with unrolling --and thus a fixed number of time steps. So I'm postponing this PR until I figure out how to run RNNs on TPU. However, RNNT loss in pure TF runs smoothly on GPU, and I can prepare a clean PR for that one if you want to merge it. It can be useful for serverless cloud GPUs like AI Platform at least. |
@monatis Yes, until we figure out how to run RNNs on TPU, this pull request should change to RNNT loss in pure TF. Thank you for such a great work 😄 |
@monatis Since as you said that it requires a fixed number of time steps, then we can find the maximum number of time step in the whole dataset, then config the dataset map function to pad smaller time step features to the maximum size with zeros, and build the model with that maximum time step. |
@usimarit Totally makes sense. I'll try with a separate dataset that pads to globally maximum length as you said. |
Hi @usimarit
I'm trying to add support for TPU training in this branch. Changes I've made so far:
warprnnt_tensorflow
cannot be imported intensorflow_asr/losses/rnnt_losses.py
.utils.utils.preprocess_paths
to properly handle GCS paths starting withgs://
since TPUs can read from and write to GCS buckets.utils.setup_tpu
and return an instance ofTPUStrategy
from there.examples/conformer/train_ga_subword_conformer.py
to accept arguments regarding TPU training and set upTPUStrategy
if--tpu
flag is set.So far so good, but when I run the script it creates training tfrecord files and uploads them to GCS and then throws an error saying
Ignoring an error encountered when deleting remote tensors handles
. Even if it ignores this error, the execution halts and it does not continue to create eval tfrecord files.Full output is as follows. Do you have any idea what this error means?