Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Protocol String & Base64 optimization for ARM #632

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lukalt
Copy link
Contributor

@lukalt lukalt commented Dec 19, 2024

This PR optimizes String Decoding & Encoding and Base64 Decoding & Encoding in JSONProtocol and SimpleJSONProtocol.

General performance problem was (on both x86 and ARM) that the compiler only generates single-byte load and store instructions for folly IOBuf writes and reads, which limits the achieved throughput. Our optimizations target large inputs. Evaluation was performed with ProtocolBench from this project.

String encoding:
Vectorizing this loop is difficult as each character produces one or two output characters, depending on whether it needs to be escaped.
We optimized this for the case that characters that need to be escaped occur infrequently. We unroll the loop by a factor of 16 and use ARM NEON intrinsics to load 16 consecutive chars in each iteration. We then check with vector comparisons if any character requires escaping. If no escaping is required, we immediately copy the 16 B chunk to the output. If at least one character in an 8 B chunk requires escaping, we escape every char in that chunk in a char-by-char fashion.

  • Achieved speedup: 14.8x for JSONProtocol_write_BigString

String decoding:

  • Unroll loop by factor of 16 and load 16 bytes at once with ARM NEON intrinsics. Check the vector for termination or escaped characters, if not, immediately forward vector to output buffer. This optimization utilizes that escaped characters occur rarely
  • In the baseline version, an occurrence of an UTF8 character caused the whole parsing to abort, the input being copied to a buffer, and this buffer being passed into folly’s JSON string decoding function. This causes a massive overhead but is expected to occur very rarely. We implemented decoding all these escape sequences directly into the function according to its specification: https://datatracker.ietf.org/doc/html/rfc7159#section-7.
  • Achieved speedup: 5.6x for JSONProtocol_read_BigString
  • This currently gives a ~25% performance penalty for small strings as the length of the string to read in unknown. We might need to be able to optimize this further.

Base64 encoding:

  • No vectorization implemented yet. We will also follow the approach described in the paper (see Base64 decoding).
  • Unrolled loop and inline base64 decoding, process 6 bytes in each iteration.
  • Inlined remainder
  • 6.6x faster for JSONProtocol_write_BigBinary

Base64 decoding:

  • Base64 was first read into a dynamic string and then copied into an IOBuf. We now directly write to the IOBuf
  • Vectorize parsing by unrolling the loop to process 16x4 characters per iteration. We followed the algorithm described in https://arxiv.org/pdf/1910.05109 for this and used parts of their reference implementation.
  • Unroll and manually inline remainder
  • Speedup is 11x for JSONProtocol_read_BigBinary and JSONProtocol_read_SmallBinary
  • This part of the code is licensed under the BSD-2-Clause license

Performance data was collected with fbthrift and folly being compiled with the flags: "-mcpu=neoverse-v2+crypto+sve2-sm4+sve2-aes+sve2-sha3"

This PR is contributed by NVIDIA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants