gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

methane · 2024-10-27T01:40:01Z

Test input UTF-8 is ASCII before allocating ASCII buffer.
If error handler is strict:
- If input is not ASCII, estimate kind using first non-ASCII code unit.
- Count number of codepoints before allocating the first buffer string.

This optimization works only for strict error handler, because other error handler may remove or replace invalid UTF-8 sequence.

Benchmark

code

import pyperf
import _testlimitedcapi

ascii10 = "hellohello".encode()
latin1_10 = "hello\u00e0\u00e1\u00e2\u00e3\u00e4".encode()
ucs2_10 = "こんにちはこんにちは".encode()
ucs4_10 = ("こんにちは" + "".join([chr(i) for i in range(0x1F0A0, 0x1F0A0+5)])).encode()

runner = pyperf.Runner()

def add_funcs(name, arg):
    assert len(arg.decode()) == 10
    runner.bench_func(f"{name}   10", _testlimitedcapi.unicode_decodeutf8, arg)
    runner.bench_func(f"{name}  100", _testlimitedcapi.unicode_decodeutf8, arg*10)
    runner.bench_func(f"{name} 1000", _testlimitedcapi.unicode_decodeutf8, arg*100)

for i in [0, 1, 2, 5, 8]:
    runner.bench_func(f"ASCII    {i}", _testlimitedcapi.unicode_decodeutf8, ascii10[:i])

add_funcs("ASCII", ascii10)
add_funcs("latin1", latin1_10)
add_funcs("ucs2", ucs2_10)
add_funcs("ucs4", ucs4_10)

Result (wit --enable-optimizations --with-lto):

Benchmark	main-opt	patched-5o
ASCII 0	87.1 ns	89.8 ns: 1.03x slower
ASCII 1	88.5 ns	89.8 ns: 1.01x slower
ASCII 2	100 ns	103 ns: 1.02x slower
ASCII 5	104 ns	103 ns: 1.01x faster
ASCII 8	100.0 ns	105 ns: 1.05x slower
ASCII 10	101 ns	104 ns: 1.02x slower
ASCII 100	110 ns	110 ns: 1.01x faster
ASCII 1000	239 ns	245 ns: 1.03x slower
latin1 10	220 ns	170 ns: 1.29x faster
latin1 100	385 ns	320 ns: 1.21x faster
latin1 1000	2.13 us	1.92 us: 1.11x faster
ucs2 10	217 ns	178 ns: 1.22x faster
ucs2 100	615 ns	473 ns: 1.30x faster
ucs2 1000	3.15 us	3.21 us: 1.02x slower
ucs4 10	268 ns	241 ns: 1.11x faster
ucs4 100	725 ns	581 ns: 1.25x faster
ucs4 1000	3.79 us	3.85 us: 1.02x slower
Geometric mean	(ref)	1.07x faster

Issue: Improve UTF-8 decode speed #126024

Objects/unicodeobject.c

methane · 2024-10-27T12:21:29Z

orjson.loads() 1000 times with twitter.json in orjson benchmark suite:

with original orjson: 2.028s
patched orjson that uses PyUnicode_FromStringAndSize(): 2.397s
patched orjson + this branch: 2.142s

orjson's implementation is still faster.
But this PR reduces temptation of having own UTF-8 decoder and use of PEP 393 API.

methane · 2024-10-27T12:46:31Z

Comparing to DuckDB's decoder.

code
Decoding short UTF-8 (10 codepoints & 30 bytes)
- with duckdb: 82ns
- with main branch: 115ns
- with this branch: 92ns
Decoding long ASCII (1000 bytes)
- with duckdb: 605ns
- with main branch: 157ns
- with this branch: 151ns

When benchmarking short ASCII, performance is unstable because unicode_dealloc is slower than decoding. speed is vary on where the object is allocated.

Objects/unicodeobject.c

This reverts commit c47d574.

methane · 2024-10-29T07:13:45Z

This is tree I played microbenchmarks.
https://github.com/methane/notes/tree/master/c/first_nonascii

methane · 2024-10-29T09:24:29Z

orjson's benchmark_load result:

0001: Python 3.13 (python-build-standalone) + orjson (customized to use PyUnicode_FromStringAndSize).
0002: Python 3.13 + orjson (original)
0003: Python 3.14 (this PR) + orjson (original)
0004: Python 3.14 (this PR) + orjson (customized)
0005: Python 3.14 (main) + orjson (customized)

---------------------- benchmark 'canada.json deserialization': 5 tests ---------------------
Name (time in ms)                        Min               Mean                 OPS
---------------------------------------------------------------------------------------------
loads[orjson-canada.json] (0001)     12.6357 (1.37)     12.9228 (1.31)      77.3825 (0.77)
loads[orjson-canada.json] (0002)     12.6432 (1.37)     13.1927 (1.33)      75.7996 (0.75)
loads[orjson-canada.json] (0003)      9.2428 (1.00)      9.9514 (1.00)     100.4887 (1.00)
loads[orjson-canada.json] (0004)      9.2235 (1.0)       9.9023 (1.0)      100.9865 (1.0)
loads[orjson-canada.json] (0005)      9.2686 (1.00)      9.9312 (1.00)     100.6926 (1.00)
---------------------------------------------------------------------------------------------

--------------------- benchmark 'citm_catalog.json deserialization': 5 tests --------------------
Name (time in ms)                             Min              Mean                 OPS
-------------------------------------------------------------------------------------------------
loads[orjson-citm_catalog.json] (0001)     4.3377 (1.18)     4.3424 (1.18)     230.2881 (0.85)
loads[orjson-citm_catalog.json] (0002)     4.2391 (1.16)     4.2407 (1.15)     235.8106 (0.87)
loads[orjson-citm_catalog.json] (0003)     3.6644 (1.0)      3.6756 (1.0)      272.0645 (1.0)
loads[orjson-citm_catalog.json] (0004)     3.7146 (1.01)     3.7166 (1.01)     269.0665 (0.99)
loads[orjson-citm_catalog.json] (0005)     3.7173 (1.01)     3.7198 (1.01)     268.8317 (0.99)
-------------------------------------------------------------------------------------------------

------------------------- benchmark 'github.json deserialization': 5 tests ------------------------
Name (time in us)                         Min                Mean            OPS (Kops/s)
---------------------------------------------------------------------------------------------------
loads[orjson-github.json] (0001)     133.5345 (1.12)     133.6741 (1.12)           7.4809 (0.89)
loads[orjson-github.json] (0002)     133.9510 (1.12)     134.0037 (1.12)           7.4625 (0.89)
loads[orjson-github.json] (0003)     119.2316 (1.0)      119.3161 (1.0)            8.3811 (1.0)
loads[orjson-github.json] (0004)     134.1154 (1.12)     134.3008 (1.13)           7.4460 (0.89)
loads[orjson-github.json] (0005)     134.7232 (1.13)     135.2986 (1.13)           7.3911 (0.88)
---------------------------------------------------------------------------------------------------

-------------------- benchmark 'twitter.json deserialization': 5 tests ---------------------
Name (time in ms)                        Min              Mean                 OPS
--------------------------------------------------------------------------------------------
loads[orjson-twitter.json] (0001)     2.0658 (1.26)     2.0670 (1.26)     483.7979 (0.79)
loads[orjson-twitter.json] (0002)     1.7103 (1.04)     1.7108 (1.04)     584.5319 (0.96)
loads[orjson-twitter.json] (0003)     1.6404 (1.0)      1.6411 (1.0)      609.3393 (1.0)
loads[orjson-twitter.json] (0004)     1.8574 (1.13)     1.8588 (1.13)     537.9842 (0.88)
loads[orjson-twitter.json] (0005)     2.0286 (1.24)     2.0289 (1.24)     492.8792 (0.81)
--------------------------------------------------------------------------------------------

When seeing 0003 vs 0004 vs 0005 on twitter.json benchmark, this PR makes PyString_FromStringAndSize from 19% slower to 12% slower.

DuckDB introduced optimization for UTF-8 decoder. It is up to 40% faster for short non-ASCII case. But it is 4x slower for long ASCII case. Python has optimized code to decode ASCII. So decoding UTF-8 containing long ASCII part is faster than UTF8Proc::UTF8ToCodepoint. And I am optimizing short non-ASCII case handling in CPython. ref: python/cpython#126025 (comment) ## Background * Using PEP 393 based API that heavily depending on current CPython internal in 3rd party code makes difficult to evolve Python internal (e.g. use UTF-8 as internal representation of Unicode). * Using PEP 393 slows down Python implementations other than CPython that use UTF-8 string representations. e.g. PyPy. * PyUnicode_FromStringAndSize is Stable ABI. Moving from non-Stable ABI to Stable ABI makes you possible to build Python modules that works with several Python versions.

Objects/unicodeobject.c

…ython#126025)

add find_first_nonascii

5a71387

bedevere-app bot added the awaiting core review label Oct 27, 2024

bedevere-app bot mentioned this pull request Oct 27, 2024

Improve UTF-8 decode speed #126024

Closed

methane linked an issue Oct 27, 2024 that may be closed by this pull request

Improve UTF-8 decode speed #126024

Closed

utf8_count_codepoints

9b47c2b

methane force-pushed the opt_decode_utf8_nonascii_numchars branch from 5344340 to 9b47c2b Compare October 27, 2024 01:41

rruuaanng reviewed Oct 27, 2024

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

Objects/unicodeobject.c Outdated Show resolved Hide resolved

methane added 7 commits October 27, 2024 03:37

fixup

b65bbb2

fixup

b759ca6

fix warning

096b1fd

add comment

ea97629

add news

73c381e

ascii_new

c47d574

optimize find_first_nonascii

08ce01c

picnixz reviewed Oct 27, 2024

View reviewed changes

methane added 2 commits October 27, 2024 11:12

cosmetic changes

7d5f4d2

update news

8e58bf2

methane commented Oct 28, 2024

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

methane force-pushed the opt_decode_utf8_nonascii_numchars branch from 800452a to b0ce85c Compare October 29, 2024 04:31

optimize unaligned memory load

37715b6

methane force-pushed the opt_decode_utf8_nonascii_numchars branch from b0ce85c to 37715b6 Compare October 29, 2024 04:33

Revert "ascii_new"

c3a22b6

This reverts commit c47d574.

methane added 4 commits November 1, 2024 18:37

fix warning

e3adab4

micro optimization for x86

f563b42

Merge branch 'main' into opt_decode_utf8_nonascii_numchars

7306504

fix implicit-fallthrough warning

092c189

add some comments

96c7b19

methane force-pushed the opt_decode_utf8_nonascii_numchars branch from 72ed21d to 96c7b19 Compare November 19, 2024 08:43

methane mentioned this pull request Nov 19, 2024

python: use PyUnicode_FromStringAndSize() duckdb/duckdb#14895

Merged

methane merged commit 322b486 into python:main Nov 29, 2024
39 checks passed

methane deleted the opt_decode_utf8_nonascii_numchars branch November 29, 2024 10:48

bedevere-app bot removed the awaiting core review label Nov 29, 2024

ayappanec mentioned this pull request Nov 29, 2024

AIX build broken with Illegal instruction #127417

Closed

picnixz reviewed Nov 29, 2024

View reviewed changes

Objects/unicodeobject.c Show resolved Hide resolved

picnixz mentioned this pull request Dec 3, 2024

UBSan: misaligned memory loads in Objects/dictobject.c #127563

Closed

encukou mentioned this pull request Dec 9, 2024

gh-126024: Use only memcpy for unaligned loads in find_first_nonascii #127769

Closed

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Jan 8, 2025

pythongh-126024: optimize UTF-8 decoder for short non-ASCII string (p…

ea2aa0a

…ython#126025)

ebonnal pushed a commit to ebonnal/cpython that referenced this pull request Jan 12, 2025

pythongh-126024: optimize UTF-8 decoder for short non-ASCII string (p…

19b9628

…ython#126025)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

Uh oh!

methane commented Oct 27, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

methane commented Oct 27, 2024 •

edited

Loading

Uh oh!

methane commented Oct 27, 2024 •

edited

Loading

Uh oh!

Uh oh!

methane commented Oct 29, 2024

Uh oh!

methane commented Oct 29, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

Uh oh!

Conversation

methane commented Oct 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

methane commented Oct 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

methane commented Oct 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

methane commented Oct 29, 2024

Uh oh!

methane commented Oct 29, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

methane commented Oct 27, 2024 •

edited

Loading

methane commented Oct 27, 2024 •

edited

Loading

methane commented Oct 27, 2024 •

edited

Loading