Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decomp: add prefetch for matched seq on aarch64 #3164

Merged
merged 1 commit into from
Jul 29, 2022

Conversation

JunHe77
Copy link
Contributor

@JunHe77 JunHe77 commented Jun 18, 2022

match is used for following sequence copy. It is
only updated when extDict is needed, which is a
low probability case. So it can be prefetched to
reduce cache miss.
The benchmarks on various Arm platforms showed
uplift from 1% ~ 14% with gcc-11/clang-14.

Signed-off-by: Jun He jun.he@arm.com
Change-Id: If201af4799d2455d74c79f8387404439d7f684ae

match is used for following sequence copy. It is
only updated when extDict is needed, which is a
low probability case. So it can be prefetched to
reduce cache miss.
The benchmarks on various Arm platforms showed
uplift from 1% ~ 14% with gcc-11/clang-14.

Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: If201af4799d2455d74c79f8387404439d7f684ae
@JunHe77
Copy link
Contributor Author

JunHe77 commented Jun 18, 2022

Benchmark changes on Arm N1/A72/A57 with gcc-11:

file N1/L2 N1/L8 N1/L15 A72/L2 A72/L8 A72/L15 A57/L2 A57/L8 A57/L15
dickens 3.144% 3.869% 3.967% 4.460% 4.389% 11.991% 5.341% 8.165% 12.130%
mr 1.192% 3.658% 3.606% 2.977% 6.147% 13.810% 1.826% 8.634% 12.909%
nci 0.737% 1.070% 1.811% -0.812% 1.065% 3.879% 0.164% 0.240% 2.399%
ooffice 0.134% 2.445% 2.389% -0.726% 0.080% 4.424% -0.306% 3.859% 5.377%
osdb 1.251% 2.669% 3.856% -0.125% 1.912% 6.158% 0.443% 3.182% 5.333%
reymont 2.639% 2.904% 3.641% 4.097% 3.379% 7.885% 3.649% 5.660% 9.412%
samba 1.129% 1.291% 1.393% 0.589% 1.950% 2.650% 0.758% 1.750% 1.401%
sao 0.728% 2.475% 2.633% -0.059% 2.364% 6.211% 0.633% 4.288% 6.761%
webster 2.138% 2.952% 2.483% 2.587% 5.809% 8.609% 3.530% 5.328% 7.289%
xml 0.604% 1.010% 0.934% -0.099% -0.110% 0.394% 0.497% 0.607% 1.045%
x-ray 0.102% 4.541% 4.543% -0.159% 6.198% 12.980% 0.102% 11.965% 14.064%

@JunHe77
Copy link
Contributor Author

JunHe77 commented Jun 18, 2022

Benchmark changes on Arm N1/A72/A57 with clang-14:

file N1/L2 N1/L8 N1/L15 A72/L2 A72/L8 A72/L15 A57/L2 A57/L8 A57/L15
dickens 2.755% 2.502% 2.235% 5.207% 6.836% 7.938% 5.987% 6.615% 6.607%
mozilla 0.741% 1.053% 1.032% 1.562% 2.610% 2.648% 1.066% 2.305% 2.180%
mr 2.270% 2.156% 2.110% 3.870% 4.890% 7.011% 3.409% 6.377% 6.168%
nci 0.841% 1.464% 2.739% 1.498% 2.558% 3.520% 1.122% 1.473% 2.337%
ooffice 1.103% 1.408% 1.704% 1.451% 3.761% 3.848% 1.181% 3.578% 3.808%
osdb 1.970% 2.545% 2.973% 4.665% 4.347% 5.006% 2.029% 3.241% 2.862%
reymont 2.379% 2.250% 2.490% 3.589% 5.267% 5.028% 4.517% 4.581% 4.335%
samba 1.738% 1.628% 1.501% 4.137% 3.137% 4.069% 2.723% 2.799% 3.032%
sao 0.445% 1.371% 1.314% 3.716% 4.120% 5.453% 1.299% 4.017% 4.459%
webster 2.782% 2.242% 1.893% 5.957% 5.264% 6.333% 4.216% 4.408% 3.738%
xml 1.512% 1.992% 1.557% 2.649% 1.462% 2.174% 2.044% 2.395% 2.080%
x-ray 0.363% 2.357% 2.477% 0.705% 7.155% 6.502% 0.721% 7.207% 6.986%

@JunHe77
Copy link
Contributor Author

JunHe77 commented Jun 18, 2022

This change may lead to regression on x86. My test on Xeon5218 is:

file gcc-11/L2 gcc-11/L8 gcc-11/L15 clang-14/L2 clang-14/L8 clang-14/L15
dickens -2.672% -3.085% -3.879% -0.514% -1.577% -3.390%
mozilla 0.813% 0.102% -0.066% -0.349% -0.721% -1.044%
mr -1.689% -3.238% -5.032% -0.313% -1.618% -3.501%
nci 1.174% 0.675% 1.035% -0.130% -0.362% -0.441%
ooffice -0.301% -1.163% -1.887% -0.427% -1.106% -1.597%
osdb 0.180% -1.443% 0.655% -0.138% -1.137% 1.217%
reymont -1.821% -0.914% -0.781% -0.493% -1.264% -1.666%
samba -0.278% -0.388% -0.472% -0.195% -0.410% -0.463%
sao -0.037% -2.148% -2.747% -0.184% -1.062% -1.971%
webster -1.548% -1.107% -0.989% -0.576% -1.172% -1.560%
xml 0.055% -0.056% 0.026% -0.036% -0.311% -0.431%
x-ray -0.155% -3.566% -4.536% -0.028% -2.175% -3.596%

@Cyan4973
Copy link
Contributor

I believe this is a rather good policy.
It's surprising that it doesn't work for x64 though (I assume you meant x64 when you stated "x86").

There are a few quirks, such as overlapped prefetch orders when decoding in long mode, or useless prefetching when extDict is present (which might not be captured by benchmark). But these are refinements that can be worked on later on.

@terrelln terrelln merged commit 558cf20 into facebook:dev Jul 29, 2022
@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
@JunHe77 JunHe77 deleted the prefetch branch March 12, 2023 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants