Skip to content

Conversation

indutny
Copy link
Member

@indutny indutny commented Apr 17, 2025

This makes wasm version of llhttp around 5-6% faster on real-world payloads. The produced code looks roughly like this:

        /* Load input */
        input = wasm_v128_load(p);
        /* Find first character that does not match `ranges` */
        single = wasm_i8x16_eq(input, wasm_u8x16_const_splat(0x9));
        mask = single;
        single = wasm_v128_and(
          wasm_i8x16_ge(input, wasm_u8x16_const_splat(' ')),
          wasm_i8x16_le(input, wasm_u8x16_const_splat('~'))
        );
        mask = wasm_v128_or(mask, single);
        single = wasm_v128_or(
          wasm_i8x16_ge(input, wasm_u8x16_const_splat(0x80)),
          wasm_i8x16_le(input, wasm_u8x16_const_splat(0xff))
        );
        mask = wasm_v128_or(mask, single);
        match_len = __builtin_ctz(
          ~wasm_i8x16_bitmask(mask)
        );

It is conceptually similar to SSE vectorization that we already support except that we can't multiple comparisons at once and have to check ranges individually.

See previous attempt #72
Benchmark: https://gist.github.com/indutny/d18d56fe3c7254d888a79eca98f39145

cc @nodejs/llhttp @mcollina

@indutny indutny force-pushed the feature/wasm-simd-2 branch from 2cf2e2a to 7cc6104 Compare April 17, 2025 18:50
out.push(`if (${ctx.endPosArg()} - ${ctx.posArg()} >= 16) {`);
out.push(' __m128i ranges;');
out.push(' __m128i input;');
out.push(' int avail;');
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused.

"mocha": "^9.2.2",
"ts-node": "^9.0.0",
"typescript": "^4.0.3"
"typescript": "^5.0.3"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package wasn't compiling without typescript update.

const hex: string[] = [];
for (let j = i; j < limit; j++) {
const value = buffer[j];
assert(value !== undefined);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With typescript version change - new errors popped up.

@indutny indutny force-pushed the feature/wasm-simd-2 branch 2 times, most recently from 190bc50 to 9e9e699 Compare April 17, 2025 21:43
Copy link
Contributor

@ShogunPanda ShogunPanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
That's just amazing!

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, this is amazing!!!!

@mcollina
Copy link
Member

@indutny go ahead and release! :D

@indutny
Copy link
Member Author

indutny commented Apr 18, 2025

Let me take a deeper look at assembly before merging, but if I don't come back let's merge in two days!

@indutny
Copy link
Member Author

indutny commented Apr 18, 2025

For posterity SIMD assembly on ARM64:

cmp w10, #0x10 ; (16)
b.ge #+0x20 ; (addr 0xa8cde2a2f8c)
ldr q4, [x4, x6]
add v5.16b, v4.16b, v0.16b
cmhi v5.16b, v1.16b, v5.16b
cmeq v6.16b, v4.16b, v2.16b
mvn v6.16b, v6.16b
and v5.16b, v6.16b, v5.16b
cmgt v4.16b, v4.16b, v3.16b
and v4.16b, v4.16b, v5.16b
movz x16, #0x4080
movk x16, #0x1020, lsl #16
movk x16, #0x408, lsl #32
movk x16, #0x102, lsl #48
dup v31.2d, x16
ushr v30.16b, v4.16b, #7
pmull2 v4.1q, v31.2d, v30.2d
pmull v30.1q, v31.1d, v30.1d
trn2 v30.8b, v30.8b, v4.8b
umov w6, v30.h[3]
orr w6, w6, #0x10000
rbit w6, w6
clz w6, w6
add w8, w6, w8
cmp w6, #0x10 ; (16)
b.ne #+0xa0 ; (addr 0xa8cde2a3088)

llvm-mca -iterations=1

Iterations:        1
Instructions:      26
Total Cycles:      35
Total uOps:        29

Dispatch Width:    6
uOps Per Cycle:    0.83
IPC:               0.74
Block RThroughput: 4.8

No SIMD

ldrb w6, [x4, x8]
add w6, w6, w5
ldrb w6, [x4, x6]
cmp w6, #0x1 ; (1)
b.ne #+0x7c ; (addr 0x31dfce0273a8)
add w8, w8, #0x1 ; (1)
b #+0x68 ; (addr 0x31dfce02739c)

llvm-mca -iterations=16

Iterations:        16
Instructions:      112
Total Cycles:      48
Total uOps:        144

Dispatch Width:    6
uOps Per Cycle:    3.00
IPC:               2.33
Block RThroughput: 2.0

This makes wasm version of llhttp around 5-6% faster on real-world
payloads. The produced code looks roughly like this:

    /* Load input */
    input = wasm_v128_load(p);
    /* Find first character that does not match `ranges` */
    single = wasm_i8x16_eq(input, wasm_u8x16_const_splat(0x9));
    mask = single;
    single = wasm_v128_and(
      wasm_i8x16_ge(input, wasm_u8x16_const_splat(' ')),
      wasm_i8x16_le(input, wasm_u8x16_const_splat('~'))
    );
    mask = wasm_v128_or(mask, single);
    single = wasm_v128_or(
      wasm_i8x16_ge(input, wasm_u8x16_const_splat(0x80)),
      wasm_i8x16_le(input, wasm_u8x16_const_splat(0xff))
    );
    mask = wasm_v128_or(mask, single);
    match_len = __builtin_ctz(
      ~wasm_i8x16_bitmask(mask)
    );

It is conceptually similar to SSE vectorization that we already support
except that we can't multiple comparisons at once and have to check
ranges individually.
@indutny indutny force-pushed the feature/wasm-simd-2 branch from 9e9e699 to 8f30ff9 Compare April 18, 2025 19:11
@indutny
Copy link
Member Author

indutny commented Apr 18, 2025

Found slightly more optimal code path.

@indutny
Copy link
Member Author

indutny commented Apr 18, 2025

I'm suddenly feeling a bit apprehensive about merging this since I currently have no way of running llhttp test suite on wasm file. Might need to look into it first.

@indutny
Copy link
Member Author

indutny commented Apr 19, 2025

Okay, with nodejs/llparse-test-fixture#22 I managed to test this PR and confirm that it passes all existing llhttp tests. Merging.

@indutny indutny merged commit 4d89e77 into nodejs:main Apr 19, 2025
@indutny indutny deleted the feature/wasm-simd-2 branch April 19, 2025 17:07
@indutny
Copy link
Member Author

indutny commented Apr 19, 2025

@mcollina llparse@7.2.0 was released. I believe llhttp has ^7.1.1 dependency spec so it should pick it up without changes. Please expect no performance difference on Mac CPUs (for reasons yet unknown), there should be a perf boost on x86_64, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants