-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: The number of convolutional multiplication decreases but the communication cost increases in SPU #678
Comments
You can see, the number of truncations increases a lot. That should be normal. |
hi, @warpoons . Interesting idea. As pointed out by @fionser , the problem is due to the increasing amont of truncations. According to the Winograd algorithm, the matmul is separated into several parts (currently, each part shall incur additional truncations), which I believe is not friendly in SPU. In my opinion, to maximize the performance of Winograd, you may need to add a backend op for Winograd, and implement the algorithm in C++. |
Hi @llCurious @fionser ! Thanks for your response! As pointed out by @fionser , when doing matmul, the number of truncations should be quadratic to the matrix size. In Winograd, the input feature map is separated into several overlapped parts (or called tiles) and do element-wise matmul separately. Reasonably, there will be additional EWMMs among all the tiles than the standard conv. Is this understanding correct? I have another question, is there a method to estimate the theoretical communication cost of standard conv and Winograd conv (considering only the EWMMs in Winograd and do the transformation of weights offline) in SPU? Thanks! |
Hi @llCurious @fionser ! In this week, I have further tested the Winograd convolution for reducing multiplications in SPU. As I previously described in this ISSUE, Winograd converts standard conv into EWMM with fewer multiplications, coming at the cost of low parallelism in EWMM. Here is another way to convert the Winograd's EWMM into general matmul (GEMM) by transposing the Winograd weights and inputs. As suggested in a NeruIPS 2023 paper Copriv: Network/protocol co-optimization for communication-efficient private inference as below, the communication increases after using the Winograd with multiplication reduction. To reach the expected comm improvement, we should consider the EWMM->GEMM conversion. Hence, I have further tested the GEMM-based Winograd to observe that if there is an expected 2.25x comm reduction, but the answer is NO. The profiling is ("SEMI2K", "FM64"):
We observe that the comm is reduced compared to the EWMM-based Winograd but still far from the expected improvement. Another issue is that using jnp.integer still has trunc_a and comm in the profiling, and I cannot reach the reason behind it. And also, jnp.dtype = jnp.float32 has f_tensordot with comm but jnp.dtype = jnp.integer has i_tensordot without comm. To make it clear, this is my
Note that the line marked with ★★★ is used to initialize an all-ones weights inside the model definition since I think the specific parameters will not significantly affect the comm results. So the Sorry for taking your time. Thanks! |
Issue Type
Performance
Modules Involved
SPU runtime
Have you reproduced the bug with SPU HEAD?
Yes
Have you searched existing issues?
Yes
SPU Version
spu 0.9.0.dev20240311
OS Platform and Distribution
Ubuntu 18.04.6 LTS by WSL
Python Version
3.10
Compiler Version
GCC 11.3.0
Current Behavior?
Not a bug. Just an abnormal question: I have tested the Comm. cost to evaluate the first and individual conv layer
Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
of ResNet18 on CIFAR10. It costed 759296 byte Comm. and 0.015497988s latency.Since Conv is multiplication-intensive, a solution of reducing the Comm./Latency cost is to reduce the number of multiplications using Winograd algorithm. Winograd uses some pre-defined matrices to transform the weight and input to Winograd-domain counterparts and implement element-wise matrix multiplication (EWMM) between the transformed weight&input. The output of EWMM after an additional transformation is equivalent to that of standard Conv. On average, the number of multiplications can be reduced by 2.25 times using Winograd without any accuracy loss.
I have tested the Comm./Latency cost of the standard/Winograd Conv. But curiously, the cost of Winograd conv is significantly increased: 6291456 byte Comm. and 0.0487127s latency.
Theoretically, for the first layer of ResNet18 on CIFAR10, the standard conv has 1,769,472 multiplications, and Winograd conv has 786432 multiplications (2.25x reduction), but the Comm. increases by 8.2859 times.
May I ask if you understand the underlying reasons, or if there are some potential convolution-specific optimizations that I am not aware of?
Thanks a lot.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: