Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triangular2/Exponential cyclical learning rates do not work when logging with Tensorboard #2593

Open
varun-parthasarathy opened this issue Oct 31, 2021 · 4 comments

Comments

@varun-parthasarathy
Copy link

varun-parthasarathy commented Oct 31, 2021

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux CentOS 8/Windows 11
  • TensorFlow version and how it was installed (source or binary): 2.6.0 binary
  • TensorFlow-Addons version and how it was installed (source or binary): 0.14.0 binary
  • Python version: 3.8.6 (Linux)/3.9.7 (Windows)
  • Is GPU used? (yes/no): yes

Describe the bug

When trying to train a model with the Triangular2 cyclical learning rate policy with scale_mode as 'iterations' and step_size equal to the number of steps in an epoch, if a Tensorboard callback is included while training the model, training stops with the following error after 1 epoch -

Traceback (most recent call last):
  File "F:\error_example.py", line 49, in <module>
    model.fit(ds_train, epochs=6, validation_data=ds_test, callbacks=[tensorboard_callback])
  File "C:\Users\varun\mlenv\lib\site-packages\keras\engine\training.py", line 1230, in fit
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 413, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 2444, in on_epoch_end
    self._log_epoch_metrics(epoch, logs)
  File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 2492, in _log_epoch_metrics
    train_logs = self._collect_learning_rate(train_logs)
  File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 2471, in _collect_learning_rate
    logs['learning_rate'] = lr_schedule(self.model.optimizer.iterations)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow_addons\optimizers\cyclical_learning_rate.py", line 102, in __call__
    ) * tf.maximum(tf.cast(0, dtype), (1 - x)) * self.scale_fn(mode_step)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow_addons\optimizers\cyclical_learning_rate.py", line 238, in <lambda>
    scale_fn=lambda x: 1 / (2.0 ** (x - 1)),
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1399, in r_binary_op_wrapper
    y, x = maybe_promote_tensors(y, x)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1335, in maybe_promote_tensors
    ops.convert_to_tensor(tensor, dtype, name="x"))
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\profiler\trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 271, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 283, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 308, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 106, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
TypeError: Cannot convert 2.0 to EagerTensor of dtype int64

This error occurs irrespective of whether mixed precision was used or not. I was able to reproduce this error on both Linux and Windows environments. I attempted to use the nightly version of tensorflow-addons, but there was no luck. I also attempted defining my own lambda function for the scale_fn along with the CyclicalLearningRate class but the same error occurred.

This error is also confirmed to occur with the ExponentialCyclicLearningRate policy.
Only the TriangularCyclicLearningRate policy runs without errors of any sort.

The error only occurs when attempting to log training; if the Tensorboard callback is not included, training proceeds without any issues. Even if write_graph is set to True, this problem occurs.

Code to reproduce the issue
This code can reproduce the error -

import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow_datasets as tfds
from tensorflow.keras import mixed_precision

#policy = mixed_precision.Policy('mixed_float16')
#mixed_precision.set_global_policy(policy)

(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)
def normalize_img(image, label):
  return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)

lr = tfa.optimizers.Triangular2CyclicalLearningRate(initial_learning_rate=0.001,
                                                    maximal_learning_rate=0.1,
                                                    step_size=200,
                                                    scale_mode='iterations')
log_dir = './logs/log_now'
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, update_freq=100,
                                                                  write_graph=False)

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, dtype=tf.float32)
])
model.compile(
    optimizer=tf.keras.optimizers.SGD(lr),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
)

model.fit(ds_train, epochs=6, validation_data=ds_test, callbacks=[tensorboard_callback])

This is a very weird bug; it'd be great if there was a workaround of some kind that does not prevent generation of logs.

@varun-parthasarathy varun-parthasarathy changed the title Triangular2 cyclical learning rate does not work when logging with Tensorboard Triangular2 and Exponential cyclical learning rates do not work when logging with Tensorboard Oct 31, 2021
@varun-parthasarathy varun-parthasarathy changed the title Triangular2 and Exponential cyclical learning rates do not work when logging with Tensorboard Triangular2/Exponential cyclical learning rates do not work when logging with Tensorboard Oct 31, 2021
@vulkomilev
Copy link

Okay I can reproduce the bug .I will look at it .

@vulkomilev
Copy link

Okay I have fixed it. I just need to merge the solution

@varun-parthasarathy
Copy link
Author

@vulkomilev can I ask - why was this occurring, and how did you fix it?

@romanovzky
Copy link

I have this problem as well with ExponentialCyclicalLearningRate will there be a fix? This is incredibly off putting when running several experiments trying out optimiser and learning rate details...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants