Owl v2 speedup #759

isaacrob-roboflow · 2024-10-29T20:35:37Z

Description

This change introduces some simple approaches to speeding up inference of an OWLv2 model over the original Huggingface implementation. The original implementation on an L4 GPU took 460ms on each iteration of the new test_owlv2.py file, and the new one takes ~36ms. I wrote a tiny little latency test script at the bottom of test_owlv2.py, not sure if there's a better / more 'standard' way of implementing that functionality, or if I should just take it out.

First, we replace the Huggingface preprocessing pipeline with one that makes full utilization of a GPU if available. This reduced preprocessing time from ~200ms to <10ms. I borrowed this implementation from my own open source repo here.

Second, we run the image backbone in mixed precision using PyTorch's autocast. We use float16 instead of bfloat16 as it is compatible with older GPUs such as the T4. We disable mixed precision on CPU as it can sometimes lead to unexpected behavior such as silently hanging if running in float16. We don't run the rest of the model in mixed precision because the built-in NMS kernel does not automatically support mixed precision and working around that would require a refactor with little to no additional speedup.

Third, we compile the model using torch's built-in torch.compile method. I limited the scope of the compilation to the vision backbone's forward pass following my open source repo as it reduces potential issues with the compiler interacting with Python objects. Note that the Huggingface OWLv2 implementation uses a manual self attention kernel, which is very slow compared to existing optimized kernels such as flash attention. Originally I went in and manually replaced the attention mechanism with flash_attn (again following my open source code release) but that package has challenging hardware dependency issues. I then found that torch 2.4's torch.compile method achieved a similar runtime as the flash attention implementation. That and associated VRAM memory reduction led me to conclude that the compile was likely optimizing out the manual attention implementation and replacing it with something more effective. As torch.compile is very general and doesn't have weird hardware dependency issues, I opted to just use that instead of manually plugging in flash attention.

When combined, the second and third improvements on an L4 GPU bought the model time from ~200ms to ~20ms. Overall, I reduced the latency from ~460ms to ~36ms.

On a T4 GPU, the improvements reduce the latency from ~680ms to ~170ms. Additionally, the memory usage is higher on a T4 GPU than an L4 GPU. I suspect this is because torch.compile is not introducing flash attention to the pipeline in the same way as it does on the newer L4 GPU. The most popular flash attention implementation, flash_attn, does not in fact support T4 GPUs, which may be the source of the problem. This could be addressed by building hardware-conditional optimizations manually, changing which version of flash_attn is installed conditional on the available hardware, but I'm leaving that to a future version as it doesn't seem that there yet exist best practices for hardware-conditional code within this codebase.

Finally, running the larger version of the model takes ~140ms on an L4 and ~960ms on a T4 GPU. That means the bigger model with these optimizations is actually faster then the existing version on an L4, but meaningfully slower on a T4. This is likely due to the conjectured lack of flash attention, which could slow down the larger model more than the base model as the larger model uses a larger image size and therefore processes more tokens via attention.

I also changed some of the type signatures from Image.Image to np.ndarray as it looks like they were just receiving numpy arrays anyway. Let me know if that is not correct!

This is my first pull request with Roboflow! Please let me know in what ways I am deviating from expected behavior :)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

How has this change been tested, please provide a testcase or example of how you tested the change?

I updated the tolerance of the existing OWLv2 integration test and ran it. @probicheaux and I also tested it briefly in his OWLv2 branch. These changes shouldn't meaningfully change the logic of the model so I would expect no major behavior changes, unless I made a mistake in the prepocessing function.

Let me know if I should build out more thorough testing tools, and what inference best practices might be around them! At some point I would love to introduce more thorough testing anyway as I have a lot of ideas for further speedups that DO meaningfully change the behavior of the model.

Any specific deployment considerations

It's possible that different versions of torch could introduce issues, although I did some testing to that effect. Additionally, different hardware targets may have different behaviors, but torch SHOULD abstract most breaking issues away.

Docs

Docs updated? What were the changes:
no updates to docs

CLAassistant · 2024-10-29T20:35:43Z

All committers have signed the CLA.

PawelPeczek-Roboflow · 2024-10-30T08:48:49Z

tests/inference/models_predictions_tests/test_owlv2.py

+    assert abs(223 - response.predictions[0].x) < 1.5
+
+
+if __name__ == "__main__":


I suggest moving somewhere else - I do maintain stack of useful scripts in development directory

you gave me this comment and approved - should I interpret that as, move and rereview, or merge and keep this in mind for the future? I guess, what should my actionable be here?

thank you for the feedback! :)

isaacrob-roboflow added 4 commits October 24, 2024 19:31

low hanging fruit for owlv2 speedup

e295bf0

validating pytorch versioning issues

c655306

support pre Ampere GPUs

c1f6b73

style update

7d8e67c

isaacrob-roboflow requested review from PawelPeczek-Roboflow, grzegorz-roboflow, yeldarby, probicheaux and hansent as code owners October 29, 2024 20:35

isaacrob-roboflow added 3 commits October 29, 2024 20:38

fix imports order

3e9321a

fix type annotation to be compatible with 3.8

02534f2

Merge branch 'main' into owl-v2-speedup

2111c4e

PawelPeczek-Roboflow reviewed Oct 30, 2024

View reviewed changes

PawelPeczek-Roboflow previously approved these changes Oct 30, 2024

View reviewed changes

grzegorz-roboflow previously approved these changes Oct 30, 2024

View reviewed changes

move inference benchmark into its own file

3d93256

isaacrob-roboflow dismissed stale reviews from grzegorz-roboflow and PawelPeczek-Roboflow via 3d93256 October 30, 2024 20:41

PawelPeczek-Roboflow approved these changes Oct 30, 2024

View reviewed changes

isaacrob-roboflow merged commit 2cf0efa into main Oct 30, 2024
57 checks passed

isaacrob-roboflow deleted the owl-v2-speedup branch October 30, 2024 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Owl v2 speedup #759

Owl v2 speedup #759

isaacrob-roboflow commented Oct 29, 2024 •

edited

Loading

CLAassistant commented Oct 29, 2024 •

edited

Loading

PawelPeczek-Roboflow Oct 30, 2024

isaacrob-roboflow Oct 30, 2024

		assert abs(223 - response.predictions[0].x) < 1.5


		if __name__ == "__main__":

Owl v2 speedup #759

Owl v2 speedup #759

Conversation

isaacrob-roboflow commented Oct 29, 2024 • edited Loading

Description

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Any specific deployment considerations

Docs

CLAassistant commented Oct 29, 2024 • edited Loading

PawelPeczek-Roboflow Oct 30, 2024

Choose a reason for hiding this comment

isaacrob-roboflow Oct 30, 2024

Choose a reason for hiding this comment

isaacrob-roboflow commented Oct 29, 2024 •

edited

Loading

CLAassistant commented Oct 29, 2024 •

edited

Loading