JSON Protocol String & Base64 optimization for ARM #632

lukalt · 2024-12-19T17:58:38Z

This PR optimizes String Decoding & Encoding and Base64 Decoding & Encoding in JSONProtocol and SimpleJSONProtocol.

General performance problem was (on both x86 and ARM) that the compiler only generates single-byte load and store instructions for folly IOBuf writes and reads, which limits the achieved throughput. Our optimizations target large inputs. Evaluation was performed with ProtocolBench from this project.

String encoding:
Vectorizing this loop is difficult as each character produces one or two output characters, depending on whether it needs to be escaped.
We optimized this for the case that characters that need to be escaped occur infrequently. We unroll the loop by a factor of 16 and use ARM NEON intrinsics to load 16 consecutive chars in each iteration. We then check with vector comparisons if any character requires escaping. If no escaping is required, we immediately copy the 16 B chunk to the output. If at least one character in an 8 B chunk requires escaping, we escape every char in that chunk in a char-by-char fashion.

Achieved speedup: 14.8x for JSONProtocol_write_BigString

String decoding:

Unroll loop by factor of 16 and load 16 bytes at once with ARM NEON intrinsics. Check the vector for termination or escaped characters, if not, immediately forward vector to output buffer. This optimization utilizes that escaped characters occur rarely
In the baseline version, an occurrence of an UTF8 character caused the whole parsing to abort, the input being copied to a buffer, and this buffer being passed into folly’s JSON string decoding function. This causes a massive overhead but is expected to occur very rarely. We implemented decoding all these escape sequences directly into the function according to its specification: https://datatracker.ietf.org/doc/html/rfc7159#section-7.
Achieved speedup: 5.6x for JSONProtocol_read_BigString
This currently gives a ~25% performance penalty for small strings as the length of the string to read in unknown. We might need to be able to optimize this further.

Base64 encoding:

No vectorization implemented yet. We will also follow the approach described in the paper (see Base64 decoding).
Unrolled loop and inline base64 decoding, process 6 bytes in each iteration.
Inlined remainder
6.6x faster for JSONProtocol_write_BigBinary

Base64 decoding:

Base64 was first read into a dynamic string and then copied into an IOBuf. We now directly write to the IOBuf
Vectorize parsing by unrolling the loop to process 16x4 characters per iteration. We followed the algorithm described in https://arxiv.org/pdf/1910.05109 for this and used parts of their reference implementation.
Unroll and manually inline remainder
Speedup is 11x for JSONProtocol_read_BigBinary and JSONProtocol_read_SmallBinary
This part of the code is licensed under the BSD-2-Clause license

Performance data was collected with fbthrift and folly being compiled with the flags: "-mcpu=neoverse-v2+crypto+sve2-sm4+sve2-aes+sve2-sha3"

This PR is contributed by NVIDIA.

JSON Protocol String & Base64 optimization for ARM

733cee5

facebook-github-bot added the CLA Signed label Dec 19, 2024

Fixed x86 compilation issue

a866b18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Protocol String & Base64 optimization for ARM #632

JSON Protocol String & Base64 optimization for ARM #632

lukalt commented Dec 19, 2024 •

edited

Loading

JSON Protocol String & Base64 optimization for ARM #632

Are you sure you want to change the base?

JSON Protocol String & Base64 optimization for ARM #632

Conversation

lukalt commented Dec 19, 2024 • edited Loading

lukalt commented Dec 19, 2024 •

edited

Loading