[perf] improve shader compilation for WebGL with KHR_parallel_shader_compile extension #5205

pyu10055 · 2021-06-09T17:55:40Z

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow.js installed from (npm or script link):
TensorFlow.js version (use command below): 3.7.0
Browser version: NA
Tensorflow.js Converter Version: NA

Describe the current behavior
The initial inference on current TFJS webGL backend is much slower, which is caused by shader compilation and texture allocation.

Describe the expected behavior
With the latest extension KHR_parallel_shader_compile, there is a chance to speed up the shader compilation and reduce the initial inference time.
Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/CodePen/any notebook.

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

pyu10055 · 2021-06-09T17:56:00Z

cc @qjia7

qjia7 · 2021-06-10T08:20:34Z

@pyu10055 Will be someone be assigned for this issue or do you need any help from us?

pyu10055 · 2021-06-10T22:47:19Z

@qjia7 If you have bandwidth, we would love to have you help with the initial investigation. As of today, our shader compilations are performed at per-op execution time. It would be interesting to see how would the extension fit into this scenarios.

qjia7 · 2021-06-11T11:16:38Z

There are several things we can have a try:

Simply apply KHR_parallel_shader_compile extension and don't let the shader compilation block the main process. So it can partially hide the data uploading time. Previous the process is upload data to gpu -> compile shader -> run. Now, the two processes upload data to gpu and compile shader can be parallel.
Use uniforms instead of constant value so that the shader generation doesn't depend on runtime shapes. We can get the static shader string of each op before it's executed. It's also the ideal scenario to use KHR_parallel_shader_compile to let multiple shaders really parallel. However, in this scenario, we don't know how many shaders need to be pre-compiled. To compile all of them, it may doesn't make sense since the users may only use several of them. A method may be that we can pre-compile some widely/frequently used shaders, like conv2d, matmul, depthwiseConv2d, add, relu. And leave others to be compiled in execution. The advantage of uniforms is not only for KHR_parallel_shader_compile. It can also greatly reduce the shader variants, which means the total shader numbers will also be great reduced.
The issue in step 2 is that in backend, we don't know the model information. So we don't know the ops set to be executed. One idea is that in upper level, we can see the model and the whole graph. Maybe we can provide a path to backend and tell the backend 'Hi, these ops will be executed, can you precompile them?'. In this case, we can do more things in backend, not only for compilation, also execution optimization.

We'd love to try 1 and 2. But 3 needs your help since it will change the upper framework. How do you think?

vladmandic · 2021-06-12T11:57:38Z

just my $0.02...

first - i LOVE this proposal! This is probably the biggest issue with WebGL nowadays as slow app startup turns users away.

(1) doesn't do much due to how tfjs shader compilation is structured
(3) is really interesting and should be doable without any changes to existing code: you could extend class GraphModel to add method warmup which can be executed optionally
(and to make it clean, such warmup method should call equivalent method in the backend (for any backend other than GL, it would simply return immediately))

enumerating ops in GraphModel once it's loaded is easy and fast:

const model: GraphModel = tf.loadGraphModel('test/model.json');
const ops: Record<string, Array<string>> = {};
for (const op of Object.values(model.executor.graph.nodes) as Array<{category: string, op: string}>) {
  if (!ops[op.category]) ops[op.category] = [];
  if (!ops[op.category].includes(op.op)) ops[op.category].push(op.op);
}
console.log('ops used by model:', ops);

output:

ops used by model: {
  graph: [ 'Const', 'Placeholder', 'Identity' ],
  convolution: [ '_FusedConv2D', 'FusedDepthwiseConv2dNative', 'DepthwiseConv2dNative', 'Conv2D', 'MaxPool' ],
  arithmetic: [ 'Mul', 'Add', 'FloorDiv', 'FloorMod', 'Sub' ],
  basic_math: [ 'Relu6', 'Relu', 'Sigmoid' ],
  reduction: [ 'Mean' ],
  image: [ 'ResizeBilinear' ],
  slice_join: [ 'ConcatV2', 'GatherV2', 'StridedSlice' ],
  transformation: [ 'Reshape', 'Cast', 'ExpandDims' ],
  logical: [ 'Equal' ],
  evaluation: [ 'TopKV2' ]
}

pyu10055 · 2021-06-14T15:50:13Z

@qjia7 I agree with @vladmandic that option 2 and 3 looks like crucial to gain performance gain on the parallel compilation.

Similar to the warm up run, the graph model can have a compilation step, and the engine should have a compile API in comparison to current execution API, to avoid any texture upload.

wingman-jr-addon · 2021-06-15T01:18:38Z

(Non-technical comment: I write a browser plugin that basically blocks browser functionality for 10 seconds during model loading, so I'm quite happy to hear about performance improvement ideas here and plan to watch the progress eagerly! Thanks!)

qjia7 · 2021-06-15T06:44:20Z

Thanks for your inputs. I will take a look at the step 2 Use uniforms instead of constant value as a start.

PERF Fix tensorflow#5205 This PR adds the shapes uniforms support and enables it for unary/binary ops.

FEATURE * webgl: Add shapes uniforms to reduce shader compilation time PERF Fix #5205 This PR adds the shapes uniforms support and enables it for unary/binary ops. * fix the bot failure * Add annotation for the key composition. * address comments * Disable shapes uniforms by default and enable it in integration test

google-ml-butler · 2021-07-01T14:08:03Z

Are you satisfied with the resolution of your issue?
Yes
No

wingman-jr-addon · 2021-07-17T00:24:12Z

@qjia7 Thanks for your hard work! I was so excited to give this a try as I saw TF.js 3.8.0 was released! My plugin is still back on 2.7.0, so I did a quick upgrade.
Unfortunately, somewhere between 2.7.0 and 3.8.0 the performance for model load plus first inference time became much worse. I have linked off in the issue above but the overall time went from about 9 seconds to 13.5 seconds just due to TF.js version change, so it doesn't look like I'll be upgrading quite yet.

What kind of performance numbers were others here seeing from this PR? (also @vladmandic @pyu10055 )

qjia7 · 2021-07-19T01:42:04Z

@wingman-jr-addon This issue has not been finished. It may be closed by accident. Currently, using shapes uniforms is disabled by default. You need to set WEBGL_USE_SHAPES_UNIFORMS to true to use it. And so far, only unary/binary applied it. For other ops, they are on the way. For example, conv2d, you can find here #5297. I don't expect it will bring big perf regression since it's disabled by default. Can you share me a reproducible example? I can double-check it whether it's related with the changes by bringing uniforms.

wingman-jr-addon · 2021-07-19T02:58:25Z

Thank you for the detailed explanation @qjia7 - if it's hidden behind a flag, I'm guessing that this regression has nothing to do with your recent work. Based on that, let me do some bisecting on versions and see if I can narrow the cause down a bit further and then provide a minimal reproduction either here or in an appropriate issue.

wingman-jr-addon · 2021-07-19T04:10:04Z

@qjia7 Through bisection I've narrowed it down to a change that occurred between 3.3.0 and 3.4.0. I'll do some more looking but that means it is definitely not related to this functionality.

vladmandic · 2021-07-29T19:21:18Z

I've tested this on my notebook with 3 different models of medium-high complexity

Model	DataSet	WEBGL_PACK_DEPTHWISECONV	WEBGL_USE_SHAPES_UNIFORMS	Warmup	Execution	Note
Inception-v4	ImageNet	True	False	11.2sec	42ms	Default
Inception-v4	ImageNet	False	False	10.8sec	45ms
Inception-v4	ImageNet	False	True	10.8sec	45ms
Inception-v4	ImageNet	True	True	11.2sec	42ms
SSD/MobileNet-v2	OpenImages	True	False	14.7	2.1sec	Default
SSD/MobileNet-v2	OpenImages	False	False	13.3sec	2.2sec
SSD/MobileNet-v2	OpenImages	False	True	12.7sec	2.1sec
SSD/MobileNet-v2	OpenImages	True	True	13.6sec	2.1sec
EfficientDet-D4	CoCo	True	False	23.1sec	12.9sec	Default
EfficientDet-D4	CoCo	False	False	16.1sec	14.5sec
EfficientDet-D4	CoCo	False	True	15.9sec	14.0sec
EfficientDet-D4	CoCo	True	True	21.1sec	13.0ms

All-in-all:

WEBGL_USE_SHAPES_UNIFORMS helps to significantly reduce warmup with NO negative impact on subsequent inference
WEBGL_PACK_DEPTHWISECONV increases warmup too much even if subsequent inference is slightly faster

As it is, I'll be setting WEBGL_USE_SHAPES_UNIFORMS=True and WEBGL_PACK_DEPTHWISECONV=False on my projects as even with uniforms enabled (which does help), it's still too slow on warmup

Note: Chrome does extensive shader caching between sessions, so simple page reload is not sufficient and full browser restart is needed between tests

wingman-jr-addon · 2021-07-29T23:17:18Z

Thank you @vladmandic for your much more thorough analysis. I'm sure that took quite some time. I'll be watching over on the issue where you cross-posted as we look at this issue specifically.

Bug tensorflow#5205

Bug #5205 Co-authored-by: Na Li <linazhao@google.com>

vladmandic · 2021-10-06T22:26:35Z

see #5689 for fully reproducible code and additional performance notes.

rthadur · 2022-09-24T00:04:22Z

Related PR has been merged , closing this issue. Thank you

google-ml-butler · 2022-09-24T00:04:24Z

Are you satisfied with the resolution of your issue?
Yes
No

pyu10055 added the type:bug Something isn't working label Jun 9, 2021

qjia7 self-assigned this Jun 15, 2021

qjia7 mentioned this issue Jun 22, 2021

webgl: Add shapes uniforms to reduce shader compilation time #5240

Merged

qjia7 added a commit to qjia7/tfjs that referenced this issue Jun 22, 2021

webgl: Add shapes uniforms to reduce shader compilation time

8ac5bc2

PERF Fix tensorflow#5205 This PR adds the shapes uniforms support and enables it for unary/binary ops.

pyu10055 closed this as completed in #5240 Jul 1, 2021

wingman-jr-addon mentioned this issue Jul 4, 2021

Can shaders be cached to improve load-time performance? wingman-jr-addon/wingman_jr#45

Closed

qjia7 reopened this Jul 19, 2021

wingman-jr-addon mentioned this issue Jul 19, 2021

WEBGL_PACK_DEPTHWISECONV=true seems to cause significant first inference performance drop #5343

Closed

qjia7 added a commit to qjia7/tfjs that referenced this issue Aug 3, 2021

[webgl] Use uniforms for depthwise conv2d/reshape

19d4f82

Bug tensorflow#5205

qjia7 mentioned this issue Aug 3, 2021

[webgl] Use uniforms for depthwise conv2d/reshape #5422

Merged

lina128 added a commit that referenced this issue Aug 3, 2021

[webgl] Use uniforms for depthwise conv2d/reshape (#5422)

1548bc6

Bug #5205 Co-authored-by: Na Li <linazhao@google.com>

qjia7 mentioned this issue Aug 11, 2021

Collect kernels that are heavily used in models and call for uniforms #5460

Closed

rthadur mentioned this issue Sep 3, 2021

Upload shapes as uniforms wherever possible. #987

Closed

lina128 self-assigned this Oct 7, 2021

qjia7 mentioned this issue Nov 12, 2021

webgl: Support uniforms for pack/unpack programs #5835

Merged

rthadur closed this as completed Sep 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf] improve shader compilation for WebGL with KHR_parallel_shader_compile extension #5205

[perf] improve shader compilation for WebGL with KHR_parallel_shader_compile extension #5205

pyu10055 commented Jun 9, 2021

pyu10055 commented Jun 9, 2021

qjia7 commented Jun 10, 2021

pyu10055 commented Jun 10, 2021 •

edited

Loading

qjia7 commented Jun 11, 2021

vladmandic commented Jun 12, 2021 •

edited

Loading

pyu10055 commented Jun 14, 2021 •

edited

Loading

wingman-jr-addon commented Jun 15, 2021

qjia7 commented Jun 15, 2021

google-ml-butler bot commented Jul 1, 2021

wingman-jr-addon commented Jul 17, 2021

qjia7 commented Jul 19, 2021

wingman-jr-addon commented Jul 19, 2021

wingman-jr-addon commented Jul 19, 2021

vladmandic commented Jul 29, 2021

wingman-jr-addon commented Jul 29, 2021

vladmandic commented Oct 6, 2021

rthadur commented Sep 24, 2022

google-ml-butler bot commented Sep 24, 2022

[perf] improve shader compilation for WebGL with KHR_parallel_shader_compile extension #5205

[perf] improve shader compilation for WebGL with KHR_parallel_shader_compile extension #5205

Comments

pyu10055 commented Jun 9, 2021

pyu10055 commented Jun 9, 2021

qjia7 commented Jun 10, 2021

pyu10055 commented Jun 10, 2021 • edited Loading

qjia7 commented Jun 11, 2021

vladmandic commented Jun 12, 2021 • edited Loading

pyu10055 commented Jun 14, 2021 • edited Loading

wingman-jr-addon commented Jun 15, 2021

qjia7 commented Jun 15, 2021

google-ml-butler bot commented Jul 1, 2021

wingman-jr-addon commented Jul 17, 2021

qjia7 commented Jul 19, 2021

wingman-jr-addon commented Jul 19, 2021

wingman-jr-addon commented Jul 19, 2021

vladmandic commented Jul 29, 2021

wingman-jr-addon commented Jul 29, 2021

vladmandic commented Oct 6, 2021

rthadur commented Sep 24, 2022

google-ml-butler bot commented Sep 24, 2022

pyu10055 commented Jun 10, 2021 •

edited

Loading

vladmandic commented Jun 12, 2021 •

edited

Loading

pyu10055 commented Jun 14, 2021 •

edited

Loading