gh-93033: Use wmemchr in find_char and replace_1char_inplace #93034

goldsteinn · 2022-05-20T23:29:07Z

This was brought up a bit in #69009 but the larger issue is mostly
different.

Generally comparable perf for the "good" case where memchr doesn't
return any collisions (false matches on lower byte) but clearly faster
with collisions.

Some notes on correctness:

wchar_t being signed/unsigned shouldn't matter here BUT wmemchr (along
with just about all the other wide-char string functions) can and
often does (x86_64 for example) assume that the input is aligned
relative to the sizeof(wchar_t). If this is not the case for
Py_UCS{2|4} then this patch is broken.

Also I think the way I implemented #define STRINGLIB_FAST_MEMCHR for
ucs{2|4}lib break strict-aliasing. If this is an issue but otherwise
the patch is fine, any suggestions for how to fix it?

Test results:

$> ./python -m test -j4
...
== Tests result: SUCCESS ==

406 tests OK.

30 tests skipped:
    test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_devpoll
    test_idle test_ioctl test_kqueue test_launcher test_msilib
    test_nis test_ossaudiodev test_readline test_smtpnet
    test_socketserver test_sqlite3 test_startfile test_tcl test_tix
    test_tk test_ttk_guionly test_ttk_textonly test_turtle
    test_urllib2net test_urllibnet test_winconsoleio test_winreg
    test_winsound test_xmlrpc_net test_zipfile64

Benchmarked on:
model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

sizeof(wchar_t) == 4

GLIBC 2.35

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018210")' ## Long, No match, No collision
No wmemchr  : 1000 loops, best of 100: 127 nsec per loop
With wmemchr: 1000 loops, best of 100: 123 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018208")' ## Long, No match, High collision
No wmemchr  : 1000 loops, best of 100: 1.29 usec per loop
With wmemchr: 1000 loops, best of 100: 123 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018210"' -- 's.find("\U00018210")' ## Long, match, No collision
No wmemchr  : 1000 loops, best of 100: 136 nsec per loop
With wmemchr: 1000 loops, best of 100: 130 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018208"' -- 's.find("\U00018208")' ## Long, match, High collision
No wmemchr  : 1000 loops, best of 100: 1.35 usec per loop
With wmemchr: 1000 loops, best of 100: 131 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018210")' ## Short, No match, No collision
No wmemchr  : 1000 loops, best of 100: 50.2 nsec per loop
With wmemchr: 1000 loops, best of 100: 52.9 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018208")' ## Short, No match, High collision
No wmemchr  : 1000 loops, best of 100: 69.1 nsec per loop
With wmemchr: 1000 loops, best of 100: 53.7 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018210"' -- 's.find("\U00018210")' ## Short, match, No collision
No wmemchr  : 1000 loops, best of 100: 53.6 nsec per loop
With wmemchr: 1000 loops, best of 100: 53.6 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018208"' -- 's.find("\U00018208")' ## Short, match, High collision
No wmemchr  : 1000 loops, best of 100: 69 nsec per loop
With wmemchr: 1000 loops, best of 100: 50.9 nsec per loop

ghost · 2022-05-20T23:29:09Z

All commit authors signed the Contributor License Agreement.

bedevere-bot · 2022-05-20T23:29:10Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

Objects/stringlib/asciilib.h

This was brought up a bit in python#69009 but the larger issue is mostly different. Generally comparable perf for the "good" case where memchr doesn't return any collisions (false matches on lower byte) but clearly faster with collisions. Some notes on correctness: wchar_t being signed/unsigned shouldn't matter here BUT wmemchr (along with just about all the other wide-char string functions) can and often does (x86_64 for example) assume that the input is aligned relative to the sizeof(wchar_t). If this is not the case for Py_UCS{2|4} then this patch is broken. Also I think the way I implemented `#define STRINGLIB_FAST_MEMCHR` for ucs{2|4}lib break strict-aliasing. If this is an issue but otherwise the patch is fine, any suggestions for how to fix it? Test results: ``` $> ./python -m test -j4 ... == Tests result: SUCCESS == 406 tests OK. 30 tests skipped: test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_devpoll test_idle test_ioctl test_kqueue test_launcher test_msilib test_nis test_ossaudiodev test_readline test_smtpnet test_socketserver test_sqlite3 test_startfile test_tcl test_tix test_tk test_ttk_guionly test_ttk_textonly test_turtle test_urllib2net test_urllibnet test_winconsoleio test_winreg test_winsound test_xmlrpc_net test_zipfile64 ``` Benchmarked on: model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz sizeof(wchar_t) == 4 GLIBC 2.35 ``` ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018210")' ## Long, No match, No collision No wmemchr : 1000 loops, best of 100: 127 nsec per loop With wmemchr: 1000 loops, best of 100: 123 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018208")' ## Long, No match, High collision No wmemchr : 1000 loops, best of 100: 1.29 usec per loop With wmemchr: 1000 loops, best of 100: 123 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018210"' -- 's.find("\U00018210")' ## Long, match, No collision No wmemchr : 1000 loops, best of 100: 136 nsec per loop With wmemchr: 1000 loops, best of 100: 130 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018208"' -- 's.find("\U00018208")' ## Long, match, High collision No wmemchr : 1000 loops, best of 100: 1.35 usec per loop With wmemchr: 1000 loops, best of 100: 131 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018210")' ## Short, No match, No collision No wmemchr : 1000 loops, best of 100: 50.2 nsec per loop With wmemchr: 1000 loops, best of 100: 52.9 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018208")' ## Short, No match, High collision No wmemchr : 1000 loops, best of 100: 69.1 nsec per loop With wmemchr: 1000 loops, best of 100: 53.7 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018210"' -- 's.find("\U00018210")' ## Short, match, No collision No wmemchr : 1000 loops, best of 100: 53.6 nsec per loop With wmemchr: 1000 loops, best of 100: 53.6 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018208"' -- 's.find("\U00018208")' ## Short, match, High collision No wmemchr : 1000 loops, best of 100: 69 nsec per loop With wmemchr: 1000 loops, best of 100: 50.9 nsec per loop ```

bedevere-bot · 2022-05-21T20:18:35Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

…ods.c

…nglib' into use-wmemchr-in-stringlib

serhiy-storchaka · 2022-05-22T17:38:08Z

Could you please repeat microbenchmarks from #69009?

goldsteinn · 2022-05-22T17:59:27Z

Could you please repeat microbenchmarks from #69009?

On my machine at least all of those don't go to ucs4lib_find and are unaffected by the patch. Thats in fact why I used the benchmarks in the commit message.

Misc/NEWS.d/next/Library/2022-05-21-20-25-04.gh-issue-93033.QuMGyh.rst

serhiy-storchaka · 2022-05-22T18:06:00Z

Can you test on Windows?

goldsteinn · 2022-05-22T18:07:32Z

Can you test on Windows?

Unfortunately I do not have access to a windows machine.

goldsteinn · 2022-05-22T18:13:02Z

I can do tests with different collision rates / lengths if you want just LMK.

sweeneyde · 2022-05-23T03:38:39Z

I think it may be a good idea to test this with buildbots to make sure that there are no new warnings about strict aliasing violations.

C11, section 6.5, paragraph 7:

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:

a type compatible with the effective type of the object,

a qualified version of a type compatible with the effective type of the object,

a type that is the signed or unsigned type corresponding to the effective type of the object,

a type that is the signed or unsigned type corresponding to a qualified version of the effective type don't f the object,

an aggregate or union type that includes one of the aforementioned types among itsmembers (including, recursively, a member of a subaggregate or contained union), or

a character type

Since wchar_t and uint16_t or uint32_t aren't guaranteed to satisfy any of these, I think this is technically outside the C spec.

I'm not sure how much we care about that though: wmemchr doesn't modify the buffer, so I doubt any reasonable compiler would do the sorts of re-ordering optimizations that strict-aliasing rules enable.

bedevere-bot · 2022-05-23T03:39:59Z

🤖 New build scheduled with the buildbot fleet by @sweeneyde for commit 2bbb3ce 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

goldsteinn · 2022-05-23T15:14:08Z

I think it may be a good idea to test this with buildbots to make sure that there are no new warnings about strict aliasing violations.

C11, section 6.5, paragraph 7:

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:

a type compatible with the effective type of the object,

a qualified version of a type compatible with the effective type of the object,

a type that is the signed or unsigned type corresponding to the effective type of the object,

a type that is the signed or unsigned type corresponding to a qualified version of the effective type don't f the object,

an aggregate or union type that includes one of the aforementioned types among itsmembers (including, recursively, a member of a subaggregate or contained union), or

a character type

Since wchar_t and uint16_t or uint32_t aren't guaranteed to satisfy any of these, I think this is technically outside the C spec.

Do you know if they are guaranteed to satisfy the alignment requirements?

I'm not sure how much we care about that though: wmemchr doesn't modify the buffer, so I doubt any reasonable compiler would do the sorts of re-ordering optimizations that strict-aliasing rules enable.

Didn't see any new warnings regarding aliasing.

The two failures look unrelated (timeout and failure to setup env) but a maintainer should obviously verify.

sweeneyde · 2022-05-23T15:30:43Z

Do you know if they are guaranteed to satisfy the alignment requirements?

Yes, alignment should be fine. PyUnicode_New() uses PyObject_Malloc() which uses these constants:

cpython/Objects/obmalloc.c

Lines 878 to 884 in b2694ab

    
           #if SIZEOF_VOID_P > 4 
        
           #define ALIGNMENT              16               /* must be 2^N */ 
        
           #define ALIGNMENT_SHIFT         4 
        
           #else 
        
           #define ALIGNMENT               8               /* must be 2^N */ 
        
           #define ALIGNMENT_SHIFT         3 
        
           #endif

methane · 2022-05-24T01:33:51Z

Yes, alignment should be fine. PyUnicode_New() uses PyObject_Malloc() which uses these constants:

It is unrelating to this change because find functions support searching from substring.
So alignment guarantee of Py_UCS2* is only 2byte and Py_UCS4* is 4byte.

goldsteinn · 2022-05-24T03:59:14Z

Thanks for the review all :)

bedevere-bot added the awaiting review label May 20, 2022

goldsteinn mentioned this pull request May 21, 2022

Use wmemchr in stringlib if sizeof(STRINGLIB_CHAR) == sizeof(wchar_t) #93033

Closed

corona10 requested a review from methane May 21, 2022 15:34

sweeneyde reviewed May 21, 2022

View reviewed changes

Objects/stringlib/asciilib.h Show resolved Hide resolved

goldsteinn force-pushed the use-wmemchr-in-stringlib branch from 2b7fc75 to 80bfc80 Compare May 21, 2022 20:18

blurb-it bot and others added 3 commits May 21, 2022 20:25

📜🤖 Added by blurb_it.

cd985e5

Add missing STRINGLIB_FAST_MEMCHR in bytearrayobject.c and bytes_meth…

56f6b42

…ods.c

Merge remote-tracking branch 'refs/remotes/origin/use-wmemchr-in-stri…

928d5df

…nglib' into use-wmemchr-in-stringlib

serhiy-storchaka self-requested a review May 22, 2022 17:44

goldsteinn closed this May 22, 2022

goldsteinn reopened this May 22, 2022

serhiy-storchaka reviewed May 22, 2022

View reviewed changes

Misc/NEWS.d/next/Library/2022-05-21-20-25-04.gh-issue-93033.QuMGyh.rst Outdated Show resolved Hide resolved

📜🤖 Added by blurb_it.

8219d71

methane approved these changes May 23, 2022

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels May 23, 2022

Delete 2022-05-21-20-25-04.gh-issue-93033.QuMGyh.rst

2bbb3ce

sweeneyde added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 23, 2022

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 23, 2022

sweeneyde approved these changes May 23, 2022

View reviewed changes

methane merged commit 7108bdf into python:main May 24, 2022

bedevere-bot removed the awaiting merge label May 24, 2022

goldsteinn deleted the use-wmemchr-in-stringlib branch May 24, 2022 03:59

Uh oh!

gh-93033: Use wmemchr in find_char and replace_1char_inplace #93034

gh-93033: Use wmemchr in find_char and replace_1char_inplace #93034

Uh oh!

Conversation

goldsteinn commented May 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented May 20, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-bot commented May 20, 2022

Uh oh!

Uh oh!

bedevere-bot commented May 21, 2022

Uh oh!

serhiy-storchaka commented May 22, 2022

Uh oh!

goldsteinn commented May 22, 2022

Uh oh!

Uh oh!

serhiy-storchaka commented May 22, 2022

Uh oh!

goldsteinn commented May 22, 2022

Uh oh!

goldsteinn commented May 22, 2022

Uh oh!

sweeneyde commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-bot commented May 23, 2022

Uh oh!

goldsteinn commented May 23, 2022

Uh oh!

sweeneyde commented May 23, 2022

Uh oh!

methane commented May 24, 2022

Uh oh!

goldsteinn commented May 24, 2022

Uh oh!

Uh oh!

goldsteinn commented May 20, 2022 •

edited

Loading

ghost commented May 20, 2022 •

edited by ghost

Loading

sweeneyde commented May 23, 2022 •

edited

Loading