The optimization of string search can cause pessimization #69009

serhiy-storchaka · 2015-08-07T06:11:26Z

BPO	24821
Nosy	@pitrou, @vstinner, @ned-deily, @methane, @serhiy-storchaka, @animalize, @zhangyangyu, @mlouielu
PRs	bpo-24821: Fixed the slowing down to 25 times in the searching of some #505
Files	find_char.patch find_char_false_positives.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2017-03-30.06:21:11.700>
created_at = <Date 2015-08-07.06:11:25.937>
labels = ['interpreter-core', '3.7', 'performance']
title = 'The optimization of string search can cause pessimization'
updated_at = <Date 2017-03-30.06:21:11.699>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2017-03-30.06:21:11.699>
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2017-03-30.06:21:11.700>
closer = 'serhiy.storchaka'
components = ['Interpreter Core']
creation = <Date 2015-08-07.06:11:25.937>
creator = 'serhiy.storchaka'
dependencies = []
files = ['41035', '41040']
hgrepos = []
issue_num = 24821
keywords = ['patch']
message_count = 17.0
messages = ['248179', '250410', '250419', '254627', '254653', '254659', '254661', '269020', '274614', '276001', '276005', '290772', '290774', '290775', '290779', '290781', '290819']
nosy_count = 10.0
nosy_names = ['pitrou', 'vstinner', 'ned.deily', 'methane', 'python-dev', 'serhiy.storchaka', 'malin', 'azsorkin', 'xiang.zhang', 'louielu']
pr_nums = ['505']
priority = 'low'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue24821'
versions = ['Python 3.7']

serhiy-storchaka · 2015-08-07T06:11:25Z

Search in strings is highly optimized for common case. However for some input data the search in non-ascii string becomes unexpectedly slow. Compare:

$ ./python -m timeit -s 's = "АБВГД"*10**4' -- '"є" in s'
100000 loops, best of 3: 11.7 usec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**4' -- '"Є" in s'
1000 loops, best of 3: 769 usec per loop

It's because the lowest byte of the code of Ukrainian capital letter Є (U+0404) matches the highest byte of codes of most Cyrillic letters (U+04xx). There are similar issues with some other scripts.

I think we should use more robust optimization.

vstinner · 2015-09-10T19:20:16Z

I think we should use more robust optimization.

What do you propose? I don't understand the purpose of the issue. Do you want to remove the optimization?

serhiy-storchaka · 2015-09-10T20:39:40Z

I was going to provide another optimization (I had an idea), but it is not so easy as looked to me at first glance. This issue exists rather as a reminder to me. I should make a second attempt.

serhiy-storchaka · 2015-11-13T22:14:44Z

Proposed patch makes the degenerate case less hard while preserves the optimization for common case.

$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("є")'
1000 loops, best of 3: 330 usec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("є")'
1000 loops, best of 3: 325 usec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
100 loops, best of 3: 7.81 msec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("Є")'
100 loops, best of 3: 8.5 msec per loop

$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("є")'
1000 loops, best of 3: 317 usec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("є")'
1000 loops, best of 3: 327 usec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
1000 loops, best of 3: 1.1 msec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("Є")'
1000 loops, best of 3: 964 usec per loop

The slowdown is decreased from 25 times to 3 times.

The idea is that after memchr found false positive, make a tens iterations of simple loop before calling memchr again. This splits the cost of the memchr call with a tens of characters.

The patch also makes a little refactoring. STRINGLIB(fastsearch_memchr_1char) now is renamed and split on two functions STRINGLIB(find_char) and STRINGLIB(rfind_char) with simpler interface. All preconditional checks are moved into these functions. These functions now are directly used in other files.

vstinner · 2015-11-14T12:51:44Z

The patch also makes a little refactoring. STRINGLIB(fastsearch_memchr_1char) now is renamed and split on two functions STRINGLIB(find_char) and STRINGLIB(rfind_char) with simpler interface.

Could you please push this part of the change? It looks good to me.

--

The C library provides a wmemchr() function which can be used to search for a wchar_t character inside a wchar_t* string.

for 16-bit wchar_t (ex: Windows, AIX), it can be used for Python UCS2 strings
for 32-bit wchar_t (ex: Linux, Mac OS X), it can be used for Python UCS4 strings

I don't know if the type is signed or not. If the type is signed, we have to ensure that wmemchr() works as expected.

--

I didn't review carefully your new heuristic to search for a character.

Did you try to not use memchr() but a simple C loop for UCS-2 and UCS-4?

python-dev · 2015-11-14T13:48:56Z

New changeset 1412be96faf0 by Serhiy Storchaka in branch 'default':
Issue bpo-24821: Refactor STRINGLIB(fastsearch_memchr_1char) and split it on
https://hg.python.org/cpython/rev/1412be96faf0

serhiy-storchaka · 2015-11-14T14:10:59Z

Could you please push this part of the change? It looks good to me.

Done. The patch that make an optimization now looks much simpler.

The C library provides a wmemchr() function which can be used to search for a wchar_t character inside a wchar_t* string.

Yes, I know. But this is C99 function, hence we should add configure test for it, and I don't know if it exists on Windows. I did not test its performance.

Did you try to not use memchr() but a simple C loop for UCS-2 and UCS-4?

A simple C loop is 25-50% slower. My patch tries to use memchr(), but if it founds false positive, runs a simple C loop for limited length. This makes a performance only insignificantly worse in the case of few false positives, but makes it much better in cases of often or grouped together false positives. The common case when there are no false positives is not affected.

MEMCHR_CUT_OFF is heuristic parameter. It is equal to the number of characters for which memchr()/memrchr() and a simple loop work the same time. On my machine it is 10-15 for UCS1 and 20-50 for UCS2 and UCS4 (depending on direction and character width).

serhiy-storchaka · 2016-06-21T20:14:58Z

Ping.

serhiy-storchaka · 2016-09-06T19:51:13Z

Could you please make a review Victor?

serhiy-storchaka · 2016-09-12T08:46:57Z

I would commit the patch before beta 1, but:

The issue can be considered not as a new feature, but as a fix of performance bug (just we don't will to backport the fix to 3.5). I think the patch can be committed at the beta stage.
I would want to tune magic constants. I experimented on my two computers, and thing they are good enough, but would be nice to test them on other computers. I need to write automatic benchmark for running on foreign computers. This will take a time.

ned-deily · 2016-09-12T09:01:47Z

If it has waited this long and is truly low priority, I think it can wait a little longer until 3.7.

serhiy-storchaka · 2017-03-29T07:11:23Z

Ukrainian "Є" is not the only character suffered from this issue. I suppose much CJK characters are suffered too. The problem is occurred when the lower byte of searched character matches the upper byte of most characters in the text. For example, searching "乎" (U+4e4e) in the sequence of characters U+4eXX (but not containing U+4e4e and U+4e4f):

$ ./python -m perf timeit -s 's = "一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争 亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜 仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿"*100' -- 's.find("乎")'

Unpatched: Median +- std dev: 761 us +- 108 us
Patched: Median +- std dev: 117 us +- 9 us

For comparison, searching "乏" (U+4e4f) in the same sequence:

$ ./python -m perf timeit -s 's = "一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争 亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜 仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿"*100' -- 's.find("乏")'

Unpatched: Median +- std dev: 12.8 us +- 1.4 us
Patched: Median +- std dev: 12.6 us +- 1.7 us

Sorry, I don't know Chinese or Japanese and can't provide more realistic examples.

mlouielu · 2017-03-29T07:52:03Z

I can now only test on Python3.6, providing much meaningful sentence,
still trying to use perf on cpython master branch.

---------------------------------------------------

$ python -m perf timeit -s 's="一件乒乓事事亏, 不乏串連产業, 万丈一争今为举, 其乎哀哉"*1000' -- 's.find("乎")'

Median +- std dev: 228 ns +- 7 ns

$ python -m perf timeit -s 's="一件乒乓事事亏, 不乏串連产業, 万丈一争今为举, 其乎哀哉"*1000' -- 's.find("乏")'

Median +- std dev: 143 ns +- 3 ns

---------------------------------------------------

$ python -m perf timeit -s 's="一件乒乓事事亏, 不乏串連产業, 万丈一争今为举, 其哀哉"*1000' -- 's.find("乎")'  
                                                                       ^^ (missing "乎")
Median +- std dev: 100 us +- 3 us

$ python -m perf timeit -s 's="一件乒乓事事亏, 不串連产業, 万丈一争今为举, 其乎哀哉"*1000' -- 's.find("乏")'
                                             ^^ (missing "乏")
Median +- std dev: 1.67 us +- 0.05 us

zhangyangyu · 2017-03-29T08:19:53Z

I can't give a "realistic" example. A more meaningful example may be:

'。', '\u3002', the Chinese period which used in almost every paragraph,
'地', '\u5730', which is a common used word.

./python3 -m perf timeit -s 's = "你好，我叫李雷。"*1000' 's.find("地")'
Mean +- std dev: 6.65 us +- 0.27 us
Mean +- std dev: 4.08 us +- 0.22 us

It works! :-)

vstinner · 2017-03-29T12:22:45Z

Would it completely kill performances to remove the optimization?

serhiy-storchaka · 2017-03-29T13:32:52Z

Thank you for testing Louie. Thank you for your review and testing Xiang.

Would it completely kill performances to remove the optimization?

Yes. The optimization is not used if the lowest byte is 0. You can try to search "\0" for getting the result without the optimization. On my netbook the optimization speeds up searching by 5.5 times.

Search with optimization:
$ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("Ж")'
Median +- std dev: 296 us +- 33 us

Search without optimization:
$ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("\0")'
Median +- std dev: 1.65 ms +- 0.10 ms

Search an unlucky character (unpatched):
$ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
Median +- std dev: 14.7 ms +- 1.8 ms

Search an unlucky character (patched):
$ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
Median +- std dev: 2.17 ms +- 0.24 ms

serhiy-storchaka · 2017-03-30T06:11:13Z

New changeset 0a58f72 by Serhiy Storchaka in branch 'master':
bpo-24821: Fixed the slowing down to 25 times in the searching of some (#505)
0a58f72

This was brought up a bit in python#69009 but the larger issue is mostly different. Generally comparable perf for the "good" case where memchr doesn't return any collisions (false matches on lower byte) but clearly faster with collisions. Some notes on correctness: wchar_t being signed/unsigned shouldn't matter here BUT wmemchr (along with just about all the other wide-char string functions) can and often does (x86_64 for example) assume that the input is aligned relative to the sizeof(wchar_t). If this is not the case for Py_UCS{2|4} then this patch is broken. Also I think the way I implemented `#define STRINGLIB_FAST_MEMCHR` for ucs{2|4}lib break strict-aliasing. If this is an issue but otherwise the patch is fine, any suggestions for how to fix it? Test results: ``` $> ./python -m test -j4 ... == Tests result: SUCCESS == 406 tests OK. 30 tests skipped: test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_devpoll test_idle test_ioctl test_kqueue test_launcher test_msilib test_nis test_ossaudiodev test_readline test_smtpnet test_socketserver test_sqlite3 test_startfile test_tcl test_tix test_tk test_ttk_guionly test_ttk_textonly test_turtle test_urllib2net test_urllibnet test_winconsoleio test_winreg test_winsound test_xmlrpc_net test_zipfile64 ``` Benchmarked on: model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz sizeof(wchar_t) == 4 GLIBC 2.35 ``` ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018210")' ## Long, No match, No collision No wmemchr : 1000 loops, best of 100: 127 nsec per loop With wmemchr: 1000 loops, best of 100: 123 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018208")' ## Long, No match, High collision No wmemchr : 1000 loops, best of 100: 1.29 usec per loop With wmemchr: 1000 loops, best of 100: 123 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018210"' -- 's.find("\U00018210")' ## Long, match, No collision No wmemchr : 1000 loops, best of 100: 136 nsec per loop With wmemchr: 1000 loops, best of 100: 130 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018208"' -- 's.find("\U00018208")' ## Long, match, High collision No wmemchr : 1000 loops, best of 100: 1.35 usec per loop With wmemchr: 1000 loops, best of 100: 131 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018210")' ## Short, No match, No collision No wmemchr : 1000 loops, best of 100: 50.2 nsec per loop With wmemchr: 1000 loops, best of 100: 52.9 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018208")' ## Short, No match, High collision No wmemchr : 1000 loops, best of 100: 69.1 nsec per loop With wmemchr: 1000 loops, best of 100: 53.7 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018210"' -- 's.find("\U00018210")' ## Short, match, No collision No wmemchr : 1000 loops, best of 100: 53.6 nsec per loop With wmemchr: 1000 loops, best of 100: 53.6 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018208"' -- 's.find("\U00018208")' ## Short, match, High collision No wmemchr : 1000 loops, best of 100: 69 nsec per loop With wmemchr: 1000 loops, best of 100: 50.9 nsec per loop ```

serhiy-storchaka self-assigned this Aug 7, 2015

serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Aug 7, 2015

serhiy-storchaka added the 3.7 (EOL) end of life label Mar 6, 2017

serhiy-storchaka closed this as completed Mar 30, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

goldsteinn mentioned this issue May 20, 2022

Use wmemchr in stringlib if sizeof(STRINGLIB_CHAR) == sizeof(wchar_t) #93033

Closed

goldsteinn mentioned this issue May 20, 2022

gh-93033: Use wmemchr in find_char and replace_1char_inplace #93034

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The optimization of string search can cause pessimization #69009

The optimization of string search can cause pessimization #69009

serhiy-storchaka commented Aug 7, 2015

serhiy-storchaka commented Aug 7, 2015

vstinner commented Sep 10, 2015

serhiy-storchaka commented Sep 10, 2015

serhiy-storchaka commented Nov 13, 2015

vstinner commented Nov 14, 2015

python-dev mannequin commented Nov 14, 2015

serhiy-storchaka commented Nov 14, 2015

serhiy-storchaka commented Jun 21, 2016

serhiy-storchaka commented Sep 6, 2016

serhiy-storchaka commented Sep 12, 2016

ned-deily commented Sep 12, 2016

serhiy-storchaka commented Mar 29, 2017

mlouielu mannequin commented Mar 29, 2017

zhangyangyu commented Mar 29, 2017

vstinner commented Mar 29, 2017

serhiy-storchaka commented Mar 29, 2017

serhiy-storchaka commented Mar 30, 2017

The optimization of string search can cause pessimization #69009

The optimization of string search can cause pessimization #69009

Comments

serhiy-storchaka commented Aug 7, 2015

serhiy-storchaka commented Aug 7, 2015

vstinner commented Sep 10, 2015

serhiy-storchaka commented Sep 10, 2015

serhiy-storchaka commented Nov 13, 2015

vstinner commented Nov 14, 2015

python-dev mannequin commented Nov 14, 2015

serhiy-storchaka commented Nov 14, 2015

serhiy-storchaka commented Jun 21, 2016

serhiy-storchaka commented Sep 6, 2016

serhiy-storchaka commented Sep 12, 2016

ned-deily commented Sep 12, 2016

serhiy-storchaka commented Mar 29, 2017

mlouielu mannequin commented Mar 29, 2017

zhangyangyu commented Mar 29, 2017

vstinner commented Mar 29, 2017

serhiy-storchaka commented Mar 29, 2017

serhiy-storchaka commented Mar 30, 2017