Unable to use "mixed_float16" in Object detect API #11215

tq3940 · 2024-06-02T07:41:09Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[√] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[√] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[√] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection

2. Describe the bug

I'm trying to use "mixed_float16" to speed up my training on RTX 4090. Following the guide of official document of mixed_precision , I add the code: mixed_precision.set_global_policy('mixed_float16') in front of tf.compat.v1.app.run() in my train_tf2.py. However, the tensorflow reborted the following error:

        return _compute_losses_and_predictions_dicts(model, features, labels,
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 130, in _compute_losses_and_predictions_dicts  *
        losses_dict = model.loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3967, in loss  *
        object_center_loss = self._compute_object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3099, in _compute_object_center_loss  *
        loss += object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 94, in __call__  *
        return self._compute_loss(prediction_tensor, target_tensor, **params)
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 855, in _compute_loss  *
        negative_loss = (tf.math.pow((1 - target_tensor), self._beta)*

    TypeError: Input 'y' of 'Mul' Op has type float16 that does not match type float32 of argument 'x'.

I also tried to add this code: tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_float16') , which I modified on the basis of tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_bfloat16') found in the file model_lib_v2.py

or add Environment variables by os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1', which was suggesd in this answer

But all of my attempt have failed with above error. I want to know how to solve this issue.

3. Steps to reproduce

add the code: mixed_precision.set_global_policy('mixed_float16') in front of tf.compat.v1.app.run() in my train_tf2.py.

4. Expected behavior

The model can be trained in "mixed precision" mode.

5. Additional context

None

6. System information

OS Platform and Distribution : Linux Ubuntu 22.04
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 2.13.1
Python version: 3.8.10
CUDA/cuDNN version: 12.2(cuda) / 8.6.0.163(cudnn)
GPU model and memory: NVIDIA GeForce RTX 4090 24G

The text was updated successfully, but these errors were encountered:

google-ml-butler · 2024-06-02T07:41:33Z

Are you satisfied with the resolution of your issue?
Yes
No

tq3940 · 2024-06-02T08:35:42Z

I am training the pre-trained model: centernet_hg104_512x512_coco17_tpu-8

tq3940 · 2024-06-03T12:54:05Z

I repeated my first attempt again that adding tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_float16') in train_loop() (model_lib_v2.py) this time and get such a info which seemed to enable mixed precision successfully:

INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 4090, compute capability 8.9
I0603 20:37:33.016378 140087643665600 device_compatibility_check.py:130] Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 4090, compute capability 8.9

HOWEVER! The same error appeared again!!

Traceback (most recent call last):
  File "scripts/model_main_tf2.py", line 133, in <module>
    tf.compat.v1.app.run()
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/platform/app.py", line 36, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/root/miniconda3/lib/python3.8/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/root/miniconda3/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "scripts/model_main_tf2.py", line 105, in main
    model_lib_v2.train_loop(
  File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 609, in train_loop
    load_fine_tune_checkpoint(
  File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 401, in load_fine_tune_checkpoint
    _ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
  File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 176, in _ensure_model_is_built
    strategy.run(
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1673, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3250, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 696, in _call_for_each_replica
    return mirrored_run.call_for_each_replica(
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 84, in call_for_each_replica
    return wrapped(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/tmp/__autograph_generated_file7sbf228o.py", line 14, in tf__wrapped_fn
    retval_ = ag__.converted_call(ag__.ld(call_for_each_replica), (ag__.ld(strategy), ag__.ld(fn).python_function, ag__.ld(args), ag__.ld(kwargs)), None, fscope)
  File "/tmp/__autograph_generated_file3vmq98ob.py", line 18, in tf___dummy_computation_fn
    retval_ = ag__.converted_call(ag__.ld(_compute_losses_and_predictions_dicts), (ag__.ld(model), ag__.ld(features), ag__.ld(labels)), dict(training_step=0), fscope)
  File "/tmp/__autograph_generated_filen7syj8px.py", line 16, in tf___compute_losses_and_predictions_dicts
    losses_dict = ag__.converted_call(ag__.ld(model).loss, (ag__.ld(prediction_dict), ag__.ld(features)[ag__.ld(fields).InputDataFields.true_image_shape]), None, fscope)
  File "/tmp/__autograph_generated_filekxlnseiw.py", line 16, in tf__loss
    object_center_loss = ag__.converted_call(ag__.ld(self)._compute_object_center_loss, (), dict(object_center_predictions=ag__.ld(prediction_dict)[ag__.ld(OBJECT_CENTER)], input_height=ag__.ld(input_height), input_width=ag__.ld(input_width), per_pixel_weights=ag__.ld(valid_anchor_weights), maximum_normalized_coordinate=ag__.ld(maximum_normalized_coordinate)), fscope)
  File "/tmp/__autograph_generated_file9k9b4p7p.py", line 77, in tf___compute_object_center_loss
    ag__.for_stmt(ag__.ld(object_center_predictions), None, loop_body, get_state_2, set_state_2, ('loss',), {'iterate_names': 'pred'})
  File "/tmp/__autograph_generated_file9k9b4p7p.py", line 75, in loop_body
    loss += ag__.converted_call(object_center_loss, (pred, flattened_heatmap_targets), dict(weights=per_pixel_weights), fscope)
  File "/tmp/__autograph_generated_filelc2if2ji.py", line 69, in tf____call__
    retval_ = ag__.converted_call(ag__.ld(self)._compute_loss, (ag__.ld(prediction_tensor), ag__.ld(target_tensor)), dict(**ag__.ld(params)), fscope)
  File "/tmp/__autograph_generated_file3o0kthr0.py", line 15, in tf___compute_loss
    negative_loss = ((ag__.converted_call(ag__.ld(tf).math.pow, ((1 - ag__.ld(target_tensor)), ag__.ld(self)._beta), None, fscope) * ag__.converted_call(ag__.ld(tf).math.pow, (ag__.ld(prediction_tensor), ag__.ld(self)._alpha), None, fscope)) * ag__.converted_call(ag__.ld(tf).math.log, ((1 - ag__.ld(prediction_tensor)),), None, fscope))
TypeError: in user code:

    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 171, in _dummy_computation_fn  *
        return _compute_losses_and_predictions_dicts(model, features, labels,
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 130, in _compute_losses_and_predictions_dicts  *
        losses_dict = model.loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3967, in loss  *
        object_center_loss = self._compute_object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3099, in _compute_object_center_loss  *
        loss += object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 94, in __call__  *
        return self._compute_loss(prediction_tensor, target_tensor, **params)
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 855, in _compute_loss  *
        negative_loss = (tf.math.pow((1 - target_tensor), self._beta)*

    TypeError: Input 'y' of 'Mul' Op has type float16 that does not match type float32 of argument 'x'.

WHY??? 😭😭😭 Who can help me?? I need your help!

tq3940 added models:research models that come under research directory type:bug Bug in the code labels Jun 2, 2024

google-ml-butler bot assigned laxmareddyp Jun 2, 2024

tq3940 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 2, 2024

tq3940 reopened this Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use "mixed_float16" in Object detect API #11215

Unable to use "mixed_float16" in Object detect API #11215

tq3940 commented Jun 2, 2024 •

edited

Loading

google-ml-butler bot commented Jun 2, 2024

tq3940 commented Jun 2, 2024

tq3940 commented Jun 3, 2024 •

edited

Loading

Unable to use "mixed_float16" in Object detect API #11215

Unable to use "mixed_float16" in Object detect API #11215

Comments

tq3940 commented Jun 2, 2024 • edited Loading

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

5. Additional context

6. System information

google-ml-butler bot commented Jun 2, 2024

tq3940 commented Jun 2, 2024

tq3940 commented Jun 3, 2024 • edited Loading

tq3940 commented Jun 2, 2024 •

edited

Loading

tq3940 commented Jun 3, 2024 •

edited

Loading