Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use "mixed_float16" in Object detect API #11215

Open
tq3940 opened this issue Jun 2, 2024 · 3 comments
Open

Unable to use "mixed_float16" in Object detect API #11215

tq3940 opened this issue Jun 2, 2024 · 3 comments
Assignees
Labels
models:research models that come under research directory type:bug Bug in the code

Comments

@tq3940
Copy link

tq3940 commented Jun 2, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [√] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • [√] I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • [√] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection

2. Describe the bug

I'm trying to use "mixed_float16" to speed up my training on RTX 4090. Following the guide of official document of mixed_precision , I add the code: mixed_precision.set_global_policy('mixed_float16') in front of tf.compat.v1.app.run() in my train_tf2.py. However, the tensorflow reborted the following error:

        return _compute_losses_and_predictions_dicts(model, features, labels,
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 130, in _compute_losses_and_predictions_dicts  *
        losses_dict = model.loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3967, in loss  *
        object_center_loss = self._compute_object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3099, in _compute_object_center_loss  *
        loss += object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 94, in __call__  *
        return self._compute_loss(prediction_tensor, target_tensor, **params)
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 855, in _compute_loss  *
        negative_loss = (tf.math.pow((1 - target_tensor), self._beta)*

    TypeError: Input 'y' of 'Mul' Op has type float16 that does not match type float32 of argument 'x'.

I also tried to add this code: tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_float16') , which I modified on the basis of tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_bfloat16') found in the file model_lib_v2.py

or add Environment variables by os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1', which was suggesd in this answer

But all of my attempt have failed with above error. I want to know how to solve this issue.

3. Steps to reproduce

add the code: mixed_precision.set_global_policy('mixed_float16') in front of tf.compat.v1.app.run() in my train_tf2.py.

4. Expected behavior

The model can be trained in "mixed precision" mode.

5. Additional context

None

6. System information

  • OS Platform and Distribution : Linux Ubuntu 22.04
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 2.13.1
  • Python version: 3.8.10
  • CUDA/cuDNN version: 12.2(cuda) / 8.6.0.163(cudnn)
  • GPU model and memory: NVIDIA GeForce RTX 4090 24G
@tq3940 tq3940 added models:research models that come under research directory type:bug Bug in the code labels Jun 2, 2024
@tq3940 tq3940 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 2, 2024
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@tq3940 tq3940 reopened this Jun 2, 2024
@tq3940
Copy link
Author

tq3940 commented Jun 2, 2024

I am training the pre-trained model: centernet_hg104_512x512_coco17_tpu-8

@tq3940
Copy link
Author

tq3940 commented Jun 3, 2024

I repeated my first attempt again that adding tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_float16') in train_loop() (model_lib_v2.py) this time and get such a info which seemed to enable mixed precision successfully:

INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 4090, compute capability 8.9
I0603 20:37:33.016378 140087643665600 device_compatibility_check.py:130] Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 4090, compute capability 8.9

HOWEVER! The same error appeared again!!

Traceback (most recent call last):
  File "scripts/model_main_tf2.py", line 133, in <module>
    tf.compat.v1.app.run()
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/platform/app.py", line 36, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/root/miniconda3/lib/python3.8/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/root/miniconda3/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "scripts/model_main_tf2.py", line 105, in main
    model_lib_v2.train_loop(
  File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 609, in train_loop
    load_fine_tune_checkpoint(
  File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 401, in load_fine_tune_checkpoint
    _ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
  File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 176, in _ensure_model_is_built
    strategy.run(
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1673, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3250, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 696, in _call_for_each_replica
    return mirrored_run.call_for_each_replica(
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 84, in call_for_each_replica
    return wrapped(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/tmp/__autograph_generated_file7sbf228o.py", line 14, in tf__wrapped_fn
    retval_ = ag__.converted_call(ag__.ld(call_for_each_replica), (ag__.ld(strategy), ag__.ld(fn).python_function, ag__.ld(args), ag__.ld(kwargs)), None, fscope)
  File "/tmp/__autograph_generated_file3vmq98ob.py", line 18, in tf___dummy_computation_fn
    retval_ = ag__.converted_call(ag__.ld(_compute_losses_and_predictions_dicts), (ag__.ld(model), ag__.ld(features), ag__.ld(labels)), dict(training_step=0), fscope)
  File "/tmp/__autograph_generated_filen7syj8px.py", line 16, in tf___compute_losses_and_predictions_dicts
    losses_dict = ag__.converted_call(ag__.ld(model).loss, (ag__.ld(prediction_dict), ag__.ld(features)[ag__.ld(fields).InputDataFields.true_image_shape]), None, fscope)
  File "/tmp/__autograph_generated_filekxlnseiw.py", line 16, in tf__loss
    object_center_loss = ag__.converted_call(ag__.ld(self)._compute_object_center_loss, (), dict(object_center_predictions=ag__.ld(prediction_dict)[ag__.ld(OBJECT_CENTER)], input_height=ag__.ld(input_height), input_width=ag__.ld(input_width), per_pixel_weights=ag__.ld(valid_anchor_weights), maximum_normalized_coordinate=ag__.ld(maximum_normalized_coordinate)), fscope)
  File "/tmp/__autograph_generated_file9k9b4p7p.py", line 77, in tf___compute_object_center_loss
    ag__.for_stmt(ag__.ld(object_center_predictions), None, loop_body, get_state_2, set_state_2, ('loss',), {'iterate_names': 'pred'})
  File "/tmp/__autograph_generated_file9k9b4p7p.py", line 75, in loop_body
    loss += ag__.converted_call(object_center_loss, (pred, flattened_heatmap_targets), dict(weights=per_pixel_weights), fscope)
  File "/tmp/__autograph_generated_filelc2if2ji.py", line 69, in tf____call__
    retval_ = ag__.converted_call(ag__.ld(self)._compute_loss, (ag__.ld(prediction_tensor), ag__.ld(target_tensor)), dict(**ag__.ld(params)), fscope)
  File "/tmp/__autograph_generated_file3o0kthr0.py", line 15, in tf___compute_loss
    negative_loss = ((ag__.converted_call(ag__.ld(tf).math.pow, ((1 - ag__.ld(target_tensor)), ag__.ld(self)._beta), None, fscope) * ag__.converted_call(ag__.ld(tf).math.pow, (ag__.ld(prediction_tensor), ag__.ld(self)._alpha), None, fscope)) * ag__.converted_call(ag__.ld(tf).math.log, ((1 - ag__.ld(prediction_tensor)),), None, fscope))
TypeError: in user code:

    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 171, in _dummy_computation_fn  *
        return _compute_losses_and_predictions_dicts(model, features, labels,
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 130, in _compute_losses_and_predictions_dicts  *
        losses_dict = model.loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3967, in loss  *
        object_center_loss = self._compute_object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3099, in _compute_object_center_loss  *
        loss += object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 94, in __call__  *
        return self._compute_loss(prediction_tensor, target_tensor, **params)
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 855, in _compute_loss  *
        negative_loss = (tf.math.pow((1 - target_tensor), self._beta)*

    TypeError: Input 'y' of 'Mul' Op has type float16 that does not match type float32 of argument 'x'.

WHY??? 😭😭😭 Who can help me?? I need your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory type:bug Bug in the code
Projects
None yet
Development

No branches or pull requests

2 participants