Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks #95

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Benchmarks #95

wants to merge 2 commits into from

Conversation

htot
Copy link
Contributor

@htot htot commented Jun 22, 2022

@aklomp @mayeut
Again a draft. Please ignore the Benchmarks patch, I was to far to drop that and rebase against HEAD.

The interesting one is codec: add ssse3_atom.

My experience with CRC32C with Silvermont Atom (SLM) processors is that in 64b certain combinations of instructions incur a penalty (see Intel manuals) making the advantage of running in 64b mode negative in some cases. In later Atoms (Goldmont, Airmont) this penalty likely does not occur, but I don't have the hardware to test. Running base64 on SLM shows strange performance regressions while core i7 shows improvement.

So, I revived the best ssse3 codec as ssse3_atom and tested on Intel Edison (dual core 500MHz) in 64b/32b mode (because that is easy to do) and on Intel NUC with Baytrail Atom in 64b (to show the relevancy on main stream CPU).

Min - Speed (MB/sec) Direction        
  decode     encode    
Processor plain SSSE3 SSSE3_ATOM plain SSSE3 SSSE3_ATOM
Atom E3815 @ 1.46GHz (64b) 326 449 565 441 569 556
Edison @ 500MHz (32b) 40 102 103 67 111 111
Edison @ 500MHz (64b) 119 164 206 162 209 204
i7-10700 CPU @ 2.90GHz 3997 9356 4685 4387 8823 7593

Improvement by going back to the revived codec in bold, degradation in italic.

We see that on i7 the latest version is indeed the fastest, on SLM 32 bit there is no difference. But on SLM 64b SSSE3_ATOM is 25% faster.
Now, having a fast algorithm has a much more noticable effect on a slow Atom then on a fast i7... So what do you guys think, should we add a specialized SSSE3 for SLM?

htot added 2 commits June 22, 2022 21:36
Signed-off-by: Ferry Toth <ftoth@exalondelft.nl>
By performing benchmarks on Intel Edison (a Silvermont Atom CPU) in x86_64 mode
from v0.3.0 we find that SSE3 had  various ups and down. Substantial changes
since v0.3.0 were:
HASH	SSSE3	SSSE3
e12e3cd	165	210
3f3f31c	206	150
67ee3fd	205	205
0a69845	145	205
a5b6739	145	218
6310c1f	157	218
9a0d1b2	158	210
5874921	165	210
Best performance was from 67ee3fd until decode performance regressed
from 205 to 145 MB/s with commit 0a69845. The commit before that
(b6417f3) had best decode performance with relatively good encode.
Core(-i7) processors do not should such large performance changes.
This patch adds the ssse3 codec from b6417f3 as ssse3_atom.

Signed-off-by: Ferry Toth <ftoth@exalondelft.nl>
@htot
Copy link
Contributor Author

htot commented Jun 23, 2022

@aqrit?

@htot htot marked this pull request as draft June 23, 2022 22:42
@aqrit
Copy link

aqrit commented Jun 23, 2022

For dec_loop: #46 is probably faster. Though, it does trade readability for speed.

dec_reshuffle without _mm_madd_epi16 could look like this:

// Pack 16 6-bit values into 12 bytes
// (wasm doesn't have pmaddubsw (but does have pmaddw))
const v128_t shuf = wasm_i8x16_const(2, 1, 0, 6, 5, 4, 10, 9, 8, 14, 13, 12, -1, -1, -1, -1);
v = wasm_v128_or(wasm_i16x8_shr_u(v, 6), wasm_i16x8_shl(v, 8));   // 00cccccc|dddddd00|00aaaaaa|bbbbbb00
v = wasm_v128_or(wasm_i32x4_shr_u(v, 18), wasm_i32x4_shl(v, 10)); // dddd0000|aaaaaabb|bbbbcccc|ccdddddd
v = wasm_i8x16_swizzle(v, shuf);                                  //       ..|ccdddddd|bbbbcccc|aaaaaabb

I'm don't know if it has better latency, but it does have fewer instructions and constants ... edit: in comparision to dec_reshuffle in this PR.

@htot
Copy link
Contributor Author

htot commented Jun 24, 2022

Yeah, this draft PR just revives an older version of the codec which showed better performance then currently (on SLM). I didn't try to create my own improvement. PR #46 is a bit older, did you benchmark it at the time on atom?

@htot
Copy link
Contributor Author

htot commented Jun 24, 2022

@aqrit would you rebase #46 on master? I'd like to run benchmarks on edison/atom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants