WEBGL_PACK_DEPTHWISECONV=true seems to cause significant first inference performance drop #5343

wingman-jr-addon · 2021-07-19T05:30:37Z

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js): Yes. Production code at Check 3.4.0 with WEBGL_PACK_DEPTHWISECONV=false - good wingman-jr-addon/wingman_jr#135, minimal reproduction at Minimal reproduction wingman-jr-addon/wingman_jr#136
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Win 10 Home 21H1
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A, but laptop specs:

ideapad FLEX5-1570
Processor	Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz   2.90 GHz
Installed RAM	16.0 GB (15.9 GB usable)
System type	64-bit operating system, x64-based processor
Pen and touch	Pen and touch support with 10 touch points

TensorFlow.js installed from (npm or script link):
https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@3.4.0/dist/tf.min.js
https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-wasm@3.4.0/dist/tf-backend-wasm.js
https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-wasm@3.4.0/dist/tfjs-backend-wasm.wasm
https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-wasm@3.4.0/dist/tfjs-backend-wasm-simd.wasm
https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-wasm@3.4.0/dist/tfjs-backend-wasm-threaded-simd.wasm
TensorFlow.js version (use command below): 3.4.0
Browser version: Firefox 90.0 64 bit
Tensorflow.js Converter Version: Unknown, but probably 2.7.0

Current behavior - Upgrading from 3.3.0 to 3.4.0 experienced major performance drop on load+first inference time. 3.3.0 sees times of about 8.8s, 3.4.0 sees times about 14.4s. It pains me to report a bug related to WEBGL_PACK as so much work has gone into this feature, but ... It appears that setting WEBGL_PACK_DEPTHWISECONV=false on 3.4.0 returns to performance found in 3.3.0. Regression with default flags has been found to exist in at least 3.6.0 and 3.8.0 as well. (This was found on a bisection to upgrade from 2.7.0 to 3.8.0 to get the new shader compilation performance improvements started in #5205 )

Expected behavior - 3.4.0 with the flag default WEBGL_PACK_DEPTHWISECONV=true offers similar or better performance to 3.3.0.

Minimal reproduction: wingman-jr-addon/wingman_jr#136
Note this is a Firefox plugin, but TF.js is loaded via a content tab rather in the background context so it should be acting quite similarly to a normal browsing context.

Attached is output from Firefox's about:support, which includes more detailed graphics issues that may be relevant to the matter at hand.
FF90_about_support.txt

The text was updated successfully, but these errors were encountered:

jinjingforever · 2021-07-19T19:15:03Z

Hi Ahmed,

I think you probably have more knowledge about the webgl backend so assigning this to you:) Please help take a look when you have a chance. Really appreciate it!

wingman-jr-addon · 2021-07-20T00:26:23Z

(Also, I see that I neglected to link off to the PR that led me to this issue - #4909 - I don't know that the work done in the bulk of the PR is by any means the cause, but the change of the flag's default option is what caused the regression.)

pyu10055 · 2021-07-22T17:05:45Z

@wingman-jr-addon If the initialization is larger than before, there are possibly two causes:

There are configuration of the depthwise conv2d ops that require more packed shaders to be compiled comparing to unpacked depthwise kernel. But I do see any setup could cause that.
The packed shader takes much longer to compile compares to unpacked version.

Item 2 might be browser specific, can you help to verify if this behavior occurs on firefox and chrome web page? Thanks

wingman-jr-addon · 2021-07-23T01:52:51Z

Thanks for idea to try @pyu10055 . I changed the plugin to run in a web page and then ran across Firefox 90 and Chrome 92.
In general, Chrome had significantly better performance, but there was still a performance gap between the flag being on and off for reload times. I would run the test by opening the browser, loading the web page, recording the time, and then hitting refresh and making note of times after that. The first load times tended to be much higher than subsequent times in Chrome. Also, I was only running one browser at a time during testing in case of some sort of GPU contention.

Here are the raw times in seconds:
FF 90, defaults: 14.5, 13.5, 14.2
FF 90, flag=false: 9.6, 8.9, 9.1, 8.5
Chrome 92, defaults: 13.9, 7.0, 6.8, 6.9, 7.0
Chrome 92, flag=false: 14.9, 5.2, 4.8

I reloaded the last Chrome test and tried it several more times, seeing reload times of 4.5-5.5 seconds.

So, I guess I'm not sure what to make of the results for Chrome - I'm not sure if I should trust the reload results or if I should run a bunch of "first run" tests. For Firefox, the results are slower across the board but consistent on reload.

Let me know what you think. Thanks!

pyu10055 · 2021-07-23T03:47:51Z

My guess is that Chrome have better caching on the shader across page reloads.
One thing you could try, not necessarily relates to depthwise conv2d, use uniform variables for unary and binary ops.

tf.env().set('WEBGL_USE_SHAPES_UNIFORMS', true);

We have observe significant reduction of initial loading time.

wingman-jr-addon · 2021-07-23T22:23:35Z

Thanks for the tip, I am seeing about a 0.5-1.0s reduction in load time on FF 90!

But getting back to the matter at hand - any clue why we might be seeing such a performance gap on WEBGL_PACK_DEPTHWISECONV=true though?

It could be that it just depends on hardware and mine is not the primary target, I'd just like to make sure there isn't some other issue hanging out.

qjia7 · 2021-07-27T02:19:59Z

@wingman-jr-addon @pyu10055 My guess WEBGL_PACK_DEPTHWISECONV=true brings more shaders to compile. You can compare the binaryCaches's size in backend_webgl.ts between WEBGL_PACK_DEPTHWISECONV=true and WEBGL_PACK_DEPTHWISECONV=false to see how many shaders are added (they probably are encode/decode matrix shaders. If that's the case, I think this PR #5297 may resolve your issue by using uniforms to encode/decode matrix programs). And, in another side, we should check why so many encode/decode shaders are introduced. Can them be avoided/reduced? In my opinion, if we go to the packed path, it will be best if we only pack data at the beginning and unpack the data at the last and try to avoid encode/decode the data in the middle. Maybe we can check if there is any chance to optimize it.

vladmandic · 2021-07-29T19:21:53Z

Cross-posting from #5205

I've tested this on my notebook with 3 different models of medium-high complexity

Model	DataSet	WEBGL_PACK_DEPTHWISECONV	WEBGL_USE_SHAPES_UNIFORMS	Warmup	Execution	Note
Inception-v4	ImageNet	True	False	11.2sec	42ms	Default
Inception-v4	ImageNet	False	False	10.8sec	45ms
Inception-v4	ImageNet	False	True	10.8sec	45ms
Inception-v4	ImageNet	True	True	11.2sec	42ms
SSD/MobileNet-v2	OpenImages	True	False	14.7	2.1sec	Default
SSD/MobileNet-v2	OpenImages	False	False	13.3sec	2.2sec
SSD/MobileNet-v2	OpenImages	False	True	12.7sec	2.1sec
SSD/MobileNet-v2	OpenImages	True	True	13.6sec	2.1sec
EfficientDet-D4	CoCo	True	False	23.1sec	12.9sec	Default
EfficientDet-D4	CoCo	False	False	16.1sec	14.5sec
EfficientDet-D4	CoCo	False	True	15.9sec	14.0sec
EfficientDet-D4	CoCo	True	True	21.1sec	13.00sec

All-in-all:

WEBGL_USE_SHAPES_UNIFORMS helps to significantly reduce warmup with NO negative impact on subsequent inference
WEBGL_PACK_DEPTHWISECONV increases warmup too much even if subsequent inference is slightly faster

As it is, I'll be setting WEBGL_USE_SHAPES_UNIFORMS=True and WEBGL_PACK_DEPTHWISECONV=False on my projects as even with uniforms enabled (which does help), it's still too slow on warmup

Note: Chrome does extensive shader caching between sessions, so simple page reload is not sufficient and full browser restart is needed between tests

wingman-jr-addon · 2021-08-12T04:58:28Z

@rthadur Would you agree the awaiting response label can probably be removed from this issue now?

vladmandic · 2021-09-15T22:54:37Z

@ahmedsabie @qjia7 @rthadur @pyu10055 Any updates on this? As you can see, WEBGL_PACK_DEPTHWISECONV=True (which is default value) has a massive negative performance impact - and it's gotten far worse in newer versions of TFJS.

This is a major regression and it has very little updates.

And yes, using WEBGL_USE_SHAPES_UNIFORMS is much better, but - a) it's not a solution, it's an alternative, b) it's not widely implemented, c) almost nobody knows about it.

vladmandic · 2021-10-06T22:26:22Z

see #5689 for fully reproducible code and additional performance notes.

pyu10055 · 2021-10-08T21:28:00Z

@vladmandic @wingman-jr-addon I think this could be caused by the packed depthwise conv2d shader could be much larger in size than unpacked depthwise conv2d version. This could be related to the filter size, since the packed version expand the loop of the filter width into code. Can you share what is the filter size for depthwise conv2d in your model?

And the other question is, we have a way to make the initial warm non UI blocking, basically by yielding the JS thread and removing all GL block calls (parallel shader compilation). But the overall warm time might still be similar. Will this behavior be helpful for your use cases?

vladmandic · 2021-10-08T21:51:49Z

Can you share what is the filter size for depthwise conv2d in your model?

I have seen the same behavior in almost every off-the-shelf model
Just pick any from TFHub - I've provided performance data for Inception-v4, EfficientDet-D4, EfficientNet-B5, MobileNet-v2
And only model that doesn't have the problem is ancient MobileNet-v2
(and since I have an automated test for this, I can reproduce using any given model)

What is the intended benefit of the packed conv2d shader? I don't see much benefit of it:

few percent faster inference (never more than 2-5%, hardly worth it)
few hundred percent slowdown in warmup (average 50-400%, massive)

And the other question is, we have a way to make the initial warm non UI blocking, basically by yielding the JS thread and removing all GL block calls (parallel shader compilation). But the overall warm time might still be similar. Will this behavior be helpful for your use cases?

Yes and No :)

Yes for overall usage as any reduction of blocking calls is very welcome.
No as I already use web workers in most cases anyhow,
so my UI is not really blocked - but since app requires models to work, not much use of app until warmup has finished

pyu10055 · 2021-10-08T22:21:14Z

We have seen significant performance gain on mobile device with the packed depthwise conv2d shader, especially for android devices.
Thank you for the insights, we will see if we can avoid the slow start up time it introduces.
As for UI blocking, even with web worker, the UI can be blocked, since chrome has a single GL command queue, any blocking GL call will prevent new GL commands to be flushed to GPU.

vladmandic · 2021-10-08T22:47:42Z

Thanks @pyu10055,

Perhaps as a start, you could do conditional WEBGL_PACK_DEPTHWISECONV = isMobile() ?
It doesn't solve a problem, just makes less people immediately affected

I just tried on Android. Yes, inference performance difference on Android is visible (unlike on desktop)
IMO - still, negative impact of slow warmup causing users to simply give up and close the app outweighs performance benefits of actual inference

Re: UI Blocking - true, if there is any GL usage elsewhere. Anyhow, if you can make it non-blocking, that is very much welcome

And when will WEBGL_USE_SHAPES_UNIFORMS become a default?

google-ml-butler · 2021-10-12T20:01:47Z

Are you satisfied with the resolution of your issue?
Yes
No

vladmandic · 2021-10-21T12:41:59Z

closing the loop after testing using todays code in main branch:

warmup is now about 2x faster
no material difference regardless if WEBGL_PACK_DEPTHWISECONV is enabled or disabeld so that issue is resolved
do note that enabling WEBGL_USE_SHAPES_UNIFORMS performs much better (2x faster warmup) regardless of packing (actually packing improvements make it even faster)!

webgl default

warmup initial 53sec in 3.9.0 -> 32sec in main branch
warmup cached 15sec -> 12sec

webgl with uniforms enabled

warmup initial 23sec -> 18sec
warmup cached 14sec > 5sec

@pyu10055 please consider enabling uniforms as default

wingman-jr-addon added the type:bug Something isn't working label Jul 19, 2021

rthadur assigned jinjingforever Jul 19, 2021

rthadur added the comp:webgpu label Jul 19, 2021

jinjingforever assigned ahmedsabie and unassigned jinjingforever Jul 19, 2021

rthadur added the stat:awaiting response label Jul 22, 2021

rthadur removed the stat:awaiting response label Aug 12, 2021

wingman-jr-addon mentioned this issue Sep 3, 2021

Challenging Service Worker Use Case - Machine Learning Models w3c/webextensions#73

Open

rthadur assigned pyu10055 and unassigned ahmedsabie Sep 23, 2021

pyu10055 mentioned this issue Oct 11, 2021

[webgl] use shader loop instead of expanding with js to reduce shader size for packed Depthwise #5714

Merged

pyu10055 closed this as completed in #5714 Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WEBGL_PACK_DEPTHWISECONV=true seems to cause significant first inference performance drop #5343

WEBGL_PACK_DEPTHWISECONV=true seems to cause significant first inference performance drop #5343

wingman-jr-addon commented Jul 19, 2021

jinjingforever commented Jul 19, 2021

wingman-jr-addon commented Jul 20, 2021

pyu10055 commented Jul 22, 2021

wingman-jr-addon commented Jul 23, 2021

pyu10055 commented Jul 23, 2021 •

edited

Loading

wingman-jr-addon commented Jul 23, 2021

qjia7 commented Jul 27, 2021

vladmandic commented Jul 29, 2021 •

edited

Loading

wingman-jr-addon commented Aug 12, 2021

vladmandic commented Sep 15, 2021

vladmandic commented Oct 6, 2021

pyu10055 commented Oct 8, 2021 •

edited

Loading

vladmandic commented Oct 8, 2021 •

edited

Loading

pyu10055 commented Oct 8, 2021

vladmandic commented Oct 8, 2021

google-ml-butler bot commented Oct 12, 2021

vladmandic commented Oct 21, 2021

WEBGL_PACK_DEPTHWISECONV=true seems to cause significant first inference performance drop #5343

WEBGL_PACK_DEPTHWISECONV=true seems to cause significant first inference performance drop #5343

Comments

wingman-jr-addon commented Jul 19, 2021

jinjingforever commented Jul 19, 2021

wingman-jr-addon commented Jul 20, 2021

pyu10055 commented Jul 22, 2021

wingman-jr-addon commented Jul 23, 2021

pyu10055 commented Jul 23, 2021 • edited Loading

wingman-jr-addon commented Jul 23, 2021

qjia7 commented Jul 27, 2021

vladmandic commented Jul 29, 2021 • edited Loading

wingman-jr-addon commented Aug 12, 2021

vladmandic commented Sep 15, 2021

vladmandic commented Oct 6, 2021

pyu10055 commented Oct 8, 2021 • edited Loading

vladmandic commented Oct 8, 2021 • edited Loading

pyu10055 commented Oct 8, 2021

vladmandic commented Oct 8, 2021

google-ml-butler bot commented Oct 12, 2021

vladmandic commented Oct 21, 2021

pyu10055 commented Jul 23, 2021 •

edited

Loading

vladmandic commented Jul 29, 2021 •

edited

Loading

pyu10055 commented Oct 8, 2021 •

edited

Loading

vladmandic commented Oct 8, 2021 •

edited

Loading