-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025
gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025
Conversation
5344340
to
9b47c2b
Compare
orjson's implementation is still faster. |
Comparing to DuckDB's decoder.
When benchmarking short ASCII, performance is unstable because unicode_dealloc is slower than decoding. speed is vary on where the object is allocated. |
800452a
to
b0ce85c
Compare
b0ce85c
to
37715b6
Compare
This reverts commit c47d574.
This is tree I played microbenchmarks. |
orjson's benchmark_load result:
When seeing 0003 vs 0004 vs 0005 on twitter.json benchmark, this PR makes PyString_FromStringAndSize from 19% slower to 12% slower. |
72ed21d
to
96c7b19
Compare
DuckDB introduced optimization for UTF-8 decoder. It is up to 40% faster for short non-ASCII case. But it is 4x slower for long ASCII case. Python has optimized code to decode ASCII. So decoding UTF-8 containing long ASCII part is faster than UTF8Proc::UTF8ToCodepoint. And I am optimizing short non-ASCII case handling in CPython. ref: python/cpython#126025 (comment) ## Background * Using PEP 393 based API that heavily depending on current CPython internal in 3rd party code makes difficult to evolve Python internal (e.g. use UTF-8 as internal representation of Unicode). * Using PEP 393 slows down Python implementations other than CPython that use UTF-8 string representations. e.g. PyPy. * PyUnicode_FromStringAndSize is Stable ABI. Moving from non-Stable ABI to Stable ABI makes you possible to build Python modules that works with several Python versions.
static size_t | ||
load_unaligned(const unsigned char *p, size_t size) | ||
{ | ||
assert(size <= SIZEOF_SIZE_T); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion seems to make the build bots fail: https://buildbot.python.org/#/builders/509/builds/7864.
This optimization works only for strict error handler, because other error handler may remove or replace invalid UTF-8 sequence.
Benchmark
code
Result (wit
--enable-optimizations --with-lto
):