-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-93033: Use wmemchr in find_char and replace_1char_inplace #93034
Conversation
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
This was brought up a bit in python#69009 but the larger issue is mostly different. Generally comparable perf for the "good" case where memchr doesn't return any collisions (false matches on lower byte) but clearly faster with collisions. Some notes on correctness: wchar_t being signed/unsigned shouldn't matter here BUT wmemchr (along with just about all the other wide-char string functions) can and often does (x86_64 for example) assume that the input is aligned relative to the sizeof(wchar_t). If this is not the case for Py_UCS{2|4} then this patch is broken. Also I think the way I implemented `#define STRINGLIB_FAST_MEMCHR` for ucs{2|4}lib break strict-aliasing. If this is an issue but otherwise the patch is fine, any suggestions for how to fix it? Test results: ``` $> ./python -m test -j4 ... == Tests result: SUCCESS == 406 tests OK. 30 tests skipped: test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_devpoll test_idle test_ioctl test_kqueue test_launcher test_msilib test_nis test_ossaudiodev test_readline test_smtpnet test_socketserver test_sqlite3 test_startfile test_tcl test_tix test_tk test_ttk_guionly test_ttk_textonly test_turtle test_urllib2net test_urllibnet test_winconsoleio test_winreg test_winsound test_xmlrpc_net test_zipfile64 ``` Benchmarked on: model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz sizeof(wchar_t) == 4 GLIBC 2.35 ``` ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018210")' ## Long, No match, No collision No wmemchr : 1000 loops, best of 100: 127 nsec per loop With wmemchr: 1000 loops, best of 100: 123 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018208")' ## Long, No match, High collision No wmemchr : 1000 loops, best of 100: 1.29 usec per loop With wmemchr: 1000 loops, best of 100: 123 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018210"' -- 's.find("\U00018210")' ## Long, match, No collision No wmemchr : 1000 loops, best of 100: 136 nsec per loop With wmemchr: 1000 loops, best of 100: 130 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018208"' -- 's.find("\U00018208")' ## Long, match, High collision No wmemchr : 1000 loops, best of 100: 1.35 usec per loop With wmemchr: 1000 loops, best of 100: 131 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018210")' ## Short, No match, No collision No wmemchr : 1000 loops, best of 100: 50.2 nsec per loop With wmemchr: 1000 loops, best of 100: 52.9 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018208")' ## Short, No match, High collision No wmemchr : 1000 loops, best of 100: 69.1 nsec per loop With wmemchr: 1000 loops, best of 100: 53.7 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018210"' -- 's.find("\U00018210")' ## Short, match, No collision No wmemchr : 1000 loops, best of 100: 53.6 nsec per loop With wmemchr: 1000 loops, best of 100: 53.6 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018208"' -- 's.find("\U00018208")' ## Short, match, High collision No wmemchr : 1000 loops, best of 100: 69 nsec per loop With wmemchr: 1000 loops, best of 100: 50.9 nsec per loop ```
2b7fc75
to
80bfc80
Compare
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
…nglib' into use-wmemchr-in-stringlib
Could you please repeat microbenchmarks from #69009? |
On my machine at least all of those don't go to |
Misc/NEWS.d/next/Library/2022-05-21-20-25-04.gh-issue-93033.QuMGyh.rst
Outdated
Show resolved
Hide resolved
Can you test on Windows? |
Unfortunately I do not have access to a windows machine. |
I can do tests with different collision rates / lengths if you want just LMK. |
I think it may be a good idea to test this with buildbots to make sure that there are no new warnings about strict aliasing violations. C11, section 6.5, paragraph 7:
Since I'm not sure how much we care about that though: |
🤖 New build scheduled with the buildbot fleet by @sweeneyde for commit 2bbb3ce 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
Do you know if they are guaranteed to satisfy the alignment requirements?
Didn't see any new warnings regarding aliasing. The two failures look unrelated (timeout and failure to setup env) but a maintainer should obviously verify. |
Yes, alignment should be fine. Lines 878 to 884 in b2694ab
|
It is unrelating to this change because find functions support searching from substring. |
Thanks for the review all :) |
This was brought up a bit in #69009 but the larger issue is mostly
different.
Generally comparable perf for the "good" case where memchr doesn't
return any collisions (false matches on lower byte) but clearly faster
with collisions.
Some notes on correctness:
wchar_t being signed/unsigned shouldn't matter here BUT wmemchr (along
with just about all the other wide-char string functions) can and
often does (x86_64 for example) assume that the input is aligned
relative to the sizeof(wchar_t). If this is not the case for
Py_UCS{2|4} then this patch is broken.
Also I think the way I implemented
#define STRINGLIB_FAST_MEMCHR
forucs{2|4}lib break strict-aliasing. If this is an issue but otherwise
the patch is fine, any suggestions for how to fix it?
Test results:
Benchmarked on:
model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
sizeof(wchar_t) == 4
GLIBC 2.35