-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
webgpu: Support depthwise conv2d with nchw format #6084
Conversation
@qjia7 Hi Jiajia, from my perspective, computing through NCHW may have better locality for I/O, which has potential performance gains. I am curious why you did not merge this? |
Yes, you are right. This PR does have perf gain for depthwise conv2d. But at that time, I hadn't figured out how to optimize conv2d using NCHW. So in this PR, it introduces several extra transpose before and after depthwise-conv2d, which is not good for the whole model. But now, we have clear target to use NCHW for the whole model. I will revisit this PR soon after I finish some code cleanup/refactoring. Thanks. |
Got it, and thank you for explaining this! 👍🏻 |
0d02f71
to
7b2de64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Jiajia! The implementation of NCHW is very impressive!
Reviewable status: 0 of 1 approvals obtained (waiting on @lina128, @qjia7, and @xhcao)
tfjs-backend-webgpu/src/kernels/DepthwiseConv2dNative.ts
line 59 at r11 (raw file):
convInfo.strideHeight === 1 && convInfo.strideWidth === 1 && convInfo.dilationWidth === 1 && convInfo.dilationHeight === 1 && convInfo.inChannels === convInfo.outChannels) {
Since the workGroupSize
is fixed as [8, 8, 1]
, if we get into this branch with filterSize > 64, I think the assert of filterSize <= workGroupSize
will be triggered?
tfjs-backend-webgpu/src/depthwise_conv2d_nchw_shared_webgpu.ts
line 145 at r11 (raw file):
// Load one tile of X into local memory. mm_Asub[localRow][localCol] = readX(batch, d1, inputRowStart, inputColStart);
I think your logics here is right, but could you help me understand it?
It looks like all the workers will read the same data, because the parameters in readX(batch, d1, inputRowStart, inputColStart)
are constant for all workers, in-relevant from the workers' positions?
Also, why do we reads 4 times for mm_Asub
and why don't we just do something like:
let wRow = localIndex / ${this.filterWidth};
let wCol = localIndex % ${this.filterWidth};
mm_Asub[wRow][wCol] = getW(batch, d1, inputRowStart + wRow, inputColStart + wCol);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @lina128, @qjia7, and @xhcao)
tfjs-backend-webgpu/src/depthwise_conv2d_webgpu.ts
line 111 at r11 (raw file):
uniforms.dilation[1]; // Convolve x(?, ?, d1) with w(:, :, d1, q) to get y(yR, yC, d2).
Could you update this comment for NCHW?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @lina128, @Linchenn, and @xhcao)
tfjs-backend-webgpu/src/depthwise_conv2d_webgpu.ts
line 111 at r11 (raw file):
Previously, Linchenn wrote…
Could you update this comment for NCHW?
Done
tfjs-backend-webgpu/src/kernels/DepthwiseConv2dNative.ts
line 59 at r11 (raw file):
Previously, Linchenn wrote…
Since the
workGroupSize
is fixed as[8, 8, 1]
, if we get into this branch with filterSize > 64, I think the assert offilterSize <= workGroupSize
will be triggered?
Yes, you are right. But I have removed this limitation in the latest commit by using a for
to allow some workers load more data to fill in mm_Bsub
.
I also changed the workGroupSize to a larger one [16, 16, 1], just for more data reusing in a work group.
tfjs-backend-webgpu/src/depthwise_conv2d_nchw_shared_webgpu.ts
line 145 at r11 (raw file):
It looks like all the workers will read the same data
In fact, they read different data. inputRowStart
and inputColStart
have a 1:1 mapping to localRow
and localCol
. And for each worker in a work group, it has a unique localRow
and localCol
. All workers in a work group will be expected to load a tile of input x
. If the tile size of mm_Asub
is same with work group size, each worker only needs to load one data. However, the current tile size [16][16]
is 4 times of the original work group size [8, 8]
, which means each worker needs to load 4 data. That's way you see we read 4 times for mm_Asub
.
As for the reason, why we need a larger tile size, it's because we need to make sure each output data can get all the needed input data. So the right tile size should be [workGroupSizeX + filterWidth - 1][workGroupSizeY + filterHeight - 1]. In the first version, I just set it to [16][16] to avoid the bounder checking, but may waste some shared memory. In the latest commit, I change it to the really needed size.
And the purse of using shared memory is to reuse the data and reduce the time of accessing global memory. And the speed of accessing shared memory is much faster than accessing global memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! I guess this will have an obvious performance gain for large filters. Thank you Jiajia so much for the detailed explanation!
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @lina128, @Linchenn, @qjia7, and @xhcao)
tfjs-backend-webgpu/src/kernels/DepthwiseConv2dNative.ts
line 59 at r11 (raw file):
Previously, qjia7 (Jiajia Qin) wrote…
Yes, you are right. But I have removed this limitation in the latest commit by using a
for
to allow some workers load more data to fill inmm_Bsub
.
I also changed the workGroupSize to a larger one [16, 16, 1], just for more data reusing in a work group.
Great!
@Linchenn I pushed another commit to limit DepthwiseConv2DNCHWSharedProgram only for nchw format. Although |
Great catch and thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a question, thank you.
${ | ||
filterSize < workGroupSize ? | ||
`if (wIndex < ${filterSize})` : | ||
`for(; wIndex < ${filterSize}; wIndex = wIndex + ${workGroupSize})`} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is if
statement needed here? Is it already a subset of for
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be slightly good for perf to use if
instead of for
since it doesn't need to call wIndex = wIndex + ${workGroupSize})
and another wIndex < ${filterSize}
checking if filterSize is smaller than workGroupSize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thank you.
To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.
This change is