Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

Merged
merged 18 commits into from
Nov 29, 2024

Conversation

methane
Copy link
Member

@methane methane commented Oct 27, 2024

  • Test input UTF-8 is ASCII before allocating ASCII buffer.
  • If error handler is strict:
    • If input is not ASCII, estimate kind using first non-ASCII code unit.
    • Count number of codepoints before allocating the first buffer string.

This optimization works only for strict error handler, because other error handler may remove or replace invalid UTF-8 sequence.

Benchmark

code
import pyperf
import _testlimitedcapi

ascii10 = "hellohello".encode()
latin1_10 = "hello\u00e0\u00e1\u00e2\u00e3\u00e4".encode()
ucs2_10 = "こんにちはこんにちは".encode()
ucs4_10 = ("こんにちは" + "".join([chr(i) for i in range(0x1F0A0, 0x1F0A0+5)])).encode()

runner = pyperf.Runner()

def add_funcs(name, arg):
    assert len(arg.decode()) == 10
    runner.bench_func(f"{name}   10", _testlimitedcapi.unicode_decodeutf8, arg)
    runner.bench_func(f"{name}  100", _testlimitedcapi.unicode_decodeutf8, arg*10)
    runner.bench_func(f"{name} 1000", _testlimitedcapi.unicode_decodeutf8, arg*100)

for i in [0, 1, 2, 5, 8]:
    runner.bench_func(f"ASCII    {i}", _testlimitedcapi.unicode_decodeutf8, ascii10[:i])

add_funcs("ASCII", ascii10)
add_funcs("latin1", latin1_10)
add_funcs("ucs2", ucs2_10)
add_funcs("ucs4", ucs4_10)

Result (wit --enable-optimizations --with-lto):

Benchmark main-opt patched-5o
ASCII 0 87.1 ns 89.8 ns: 1.03x slower
ASCII 1 88.5 ns 89.8 ns: 1.01x slower
ASCII 2 100 ns 103 ns: 1.02x slower
ASCII 5 104 ns 103 ns: 1.01x faster
ASCII 8 100.0 ns 105 ns: 1.05x slower
ASCII 10 101 ns 104 ns: 1.02x slower
ASCII 100 110 ns 110 ns: 1.01x faster
ASCII 1000 239 ns 245 ns: 1.03x slower
latin1 10 220 ns 170 ns: 1.29x faster
latin1 100 385 ns 320 ns: 1.21x faster
latin1 1000 2.13 us 1.92 us: 1.11x faster
ucs2 10 217 ns 178 ns: 1.22x faster
ucs2 100 615 ns 473 ns: 1.30x faster
ucs2 1000 3.15 us 3.21 us: 1.02x slower
ucs4 10 268 ns 241 ns: 1.11x faster
ucs4 100 725 ns 581 ns: 1.25x faster
ucs4 1000 3.79 us 3.85 us: 1.02x slower
Geometric mean (ref) 1.07x faster

@methane methane linked an issue Oct 27, 2024 that may be closed by this pull request
@methane methane force-pushed the opt_decode_utf8_nonascii_numchars branch from 5344340 to 9b47c2b Compare October 27, 2024 01:41
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
@methane
Copy link
Member Author

methane commented Oct 27, 2024

orjson.loads() 1000 times with twitter.json in orjson benchmark suite:

  • with original orjson: 2.028s
  • patched orjson that uses PyUnicode_FromStringAndSize(): 2.397s
  • patched orjson + this branch: 2.142s

orjson's implementation is still faster.
But this PR reduces temptation of having own UTF-8 decoder and use of PEP 393 API.

@methane
Copy link
Member Author

methane commented Oct 27, 2024

Comparing to DuckDB's decoder.

  • code
  • Decoding short UTF-8 (10 codepoints & 30 bytes)
    • with duckdb: 82ns
    • with main branch: 115ns
    • with this branch: 92ns
  • Decoding long ASCII (1000 bytes)
    • with duckdb: 605ns
    • with main branch: 157ns
    • with this branch: 151ns

When benchmarking short ASCII, performance is unstable because unicode_dealloc is slower than decoding. speed is vary on where the object is allocated.

Objects/unicodeobject.c Outdated Show resolved Hide resolved
@methane methane force-pushed the opt_decode_utf8_nonascii_numchars branch from 800452a to b0ce85c Compare October 29, 2024 04:31
@methane methane force-pushed the opt_decode_utf8_nonascii_numchars branch from b0ce85c to 37715b6 Compare October 29, 2024 04:33
This reverts commit c47d574.
@methane
Copy link
Member Author

methane commented Oct 29, 2024

This is tree I played microbenchmarks.
https://github.com/methane/notes/tree/master/c/first_nonascii

@methane
Copy link
Member Author

methane commented Oct 29, 2024

orjson's benchmark_load result:

0001: Python 3.13 (python-build-standalone) + orjson (customized to use PyUnicode_FromStringAndSize).
0002: Python 3.13 + orjson (original)
0003: Python 3.14 (this PR) + orjson (original)
0004: Python 3.14 (this PR) + orjson (customized)
0005: Python 3.14 (main) + orjson (customized)

---------------------- benchmark 'canada.json deserialization': 5 tests ---------------------
Name (time in ms)                        Min               Mean                 OPS
---------------------------------------------------------------------------------------------
loads[orjson-canada.json] (0001)     12.6357 (1.37)     12.9228 (1.31)      77.3825 (0.77)
loads[orjson-canada.json] (0002)     12.6432 (1.37)     13.1927 (1.33)      75.7996 (0.75)
loads[orjson-canada.json] (0003)      9.2428 (1.00)      9.9514 (1.00)     100.4887 (1.00)
loads[orjson-canada.json] (0004)      9.2235 (1.0)       9.9023 (1.0)      100.9865 (1.0)
loads[orjson-canada.json] (0005)      9.2686 (1.00)      9.9312 (1.00)     100.6926 (1.00)
---------------------------------------------------------------------------------------------

--------------------- benchmark 'citm_catalog.json deserialization': 5 tests --------------------
Name (time in ms)                             Min              Mean                 OPS
-------------------------------------------------------------------------------------------------
loads[orjson-citm_catalog.json] (0001)     4.3377 (1.18)     4.3424 (1.18)     230.2881 (0.85)
loads[orjson-citm_catalog.json] (0002)     4.2391 (1.16)     4.2407 (1.15)     235.8106 (0.87)
loads[orjson-citm_catalog.json] (0003)     3.6644 (1.0)      3.6756 (1.0)      272.0645 (1.0)
loads[orjson-citm_catalog.json] (0004)     3.7146 (1.01)     3.7166 (1.01)     269.0665 (0.99)
loads[orjson-citm_catalog.json] (0005)     3.7173 (1.01)     3.7198 (1.01)     268.8317 (0.99)
-------------------------------------------------------------------------------------------------

------------------------- benchmark 'github.json deserialization': 5 tests ------------------------
Name (time in us)                         Min                Mean            OPS (Kops/s)
---------------------------------------------------------------------------------------------------
loads[orjson-github.json] (0001)     133.5345 (1.12)     133.6741 (1.12)           7.4809 (0.89)
loads[orjson-github.json] (0002)     133.9510 (1.12)     134.0037 (1.12)           7.4625 (0.89)
loads[orjson-github.json] (0003)     119.2316 (1.0)      119.3161 (1.0)            8.3811 (1.0)
loads[orjson-github.json] (0004)     134.1154 (1.12)     134.3008 (1.13)           7.4460 (0.89)
loads[orjson-github.json] (0005)     134.7232 (1.13)     135.2986 (1.13)           7.3911 (0.88)
---------------------------------------------------------------------------------------------------

-------------------- benchmark 'twitter.json deserialization': 5 tests ---------------------
Name (time in ms)                        Min              Mean                 OPS
--------------------------------------------------------------------------------------------
loads[orjson-twitter.json] (0001)     2.0658 (1.26)     2.0670 (1.26)     483.7979 (0.79)
loads[orjson-twitter.json] (0002)     1.7103 (1.04)     1.7108 (1.04)     584.5319 (0.96)
loads[orjson-twitter.json] (0003)     1.6404 (1.0)      1.6411 (1.0)      609.3393 (1.0)
loads[orjson-twitter.json] (0004)     1.8574 (1.13)     1.8588 (1.13)     537.9842 (0.88)
loads[orjson-twitter.json] (0005)     2.0286 (1.24)     2.0289 (1.24)     492.8792 (0.81)
--------------------------------------------------------------------------------------------

When seeing 0003 vs 0004 vs 0005 on twitter.json benchmark, this PR makes PyString_FromStringAndSize from 19% slower to 12% slower.

@methane methane force-pushed the opt_decode_utf8_nonascii_numchars branch from 72ed21d to 96c7b19 Compare November 19, 2024 08:43
Mytherin added a commit to duckdb/duckdb that referenced this pull request Nov 19, 2024
DuckDB introduced optimization for UTF-8 decoder.
It is up to 40% faster for short non-ASCII case.
But it is 4x slower for long ASCII case.

Python has optimized code to decode ASCII. So decoding UTF-8 containing
long ASCII part is faster than UTF8Proc::UTF8ToCodepoint.
And I am optimizing short non-ASCII case handling in CPython.

ref:
python/cpython#126025 (comment)

## Background

* Using PEP 393 based API that heavily depending on current CPython
internal in 3rd party code makes difficult to evolve Python internal
(e.g. use UTF-8 as internal representation of Unicode).
* Using PEP 393 slows down Python implementations other than CPython
that use UTF-8 string representations. e.g. PyPy.
* PyUnicode_FromStringAndSize is Stable ABI. Moving from non-Stable ABI
to Stable ABI makes you possible to build Python modules that works with
several Python versions.
@methane methane merged commit 322b486 into python:main Nov 29, 2024
39 checks passed
@methane methane deleted the opt_decode_utf8_nonascii_numchars branch November 29, 2024 10:48
static size_t
load_unaligned(const unsigned char *p, size_t size)
{
assert(size <= SIZEOF_SIZE_T);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion seems to make the build bots fail: https://buildbot.python.org/#/builders/509/builds/7864.

@methane

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve UTF-8 decode speed
3 participants