Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] bpo-39465: _PyUnicode_FromId() now uses an hash table #20048

Closed
wants to merge 1 commit into from
Closed

[WIP] bpo-39465: _PyUnicode_FromId() now uses an hash table #20048

wants to merge 1 commit into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented May 12, 2020

Rewrote _Py_Identifier structure and _PyUnicode_FromId() function to
store Python objects in an hash table rather than a single-linked
list.

Add _PyUnicode_PreInit() to create the hash table: it must be called
before the first PyType_Ready() call.

https://bugs.python.org/issue39465

@vstinner
Copy link
Member Author

Result of bench.py microbenchmark of https://bugs.python.org/issue39465:

fromid a: Mean +- std dev: [ref] 2.38 ns +- 0.00 ns -> [patch] 15.5 ns +- 0.1 ns: 6.49x slower (+549%)
fromid abc: Mean +- std dev: [ref] 2.42 ns +- 0.00 ns -> [patch] 15.1 ns +- 0.2 ns: 6.23x slower (+523%)

Reading an hashtable entry is more than 6x slower than just just reading an object attribute: 15 ns vs 2 ns.

@vstinner
Copy link
Member Author

Context for these numbers:

  • PyUnicode_FromString("abc"): 35.8 ns +- 0.7 ns
  • PyUnicode_InternFromString("abc"): 89.8 ns +- 1.0 ns

_PyUnicode_FromId("abc") with this change (15.1 ns) remains 2.4x faster than PyUnicode_FromString (35.8 ns) and 5.9x faster than PyUnicode_InternFromString (89.8 ns)

@vstinner
Copy link
Member Author

With PR #20051 and one extra optimization, I optimized this PR to: 6.65 ns +- 0.09 ns that you should compared to 2.42 ns +- 0.00 ns without this change. It sounds better than my first attempt (before optimization): 15.1 ns +- 0.2 ns.

@vstinner
Copy link
Member Author

I optimized the code. Now the overhead of this PR is way smaller. Microbenchmark using LTO compilation:

[ref] 2.34 ns +- 0.00 ns -> [PR] 4.67 ns +- 0.07 ns: 1.99x slower (+99%)

It adds 2.33 nanoseconds per _PyUnicode_FromId() call.

It's hard to bet _PyUnicode_FromId() performance which is currently just if (id->object) { return id->object; } else { ... }.

@vstinner
Copy link
Member Author

cc @ericsnowcurrently @pablogsal @serhiy-storchaka @methane: Do you think that such slowdown is acceptable? The intent of this PR is to prepare _PyUnicode_FromId() to support subinterpreters.

The next step is to have per-interpreter hash table.

I'm targeting Python 3.10, not Python 3.9.

@vstinner
Copy link
Member Author

_Py_hashtable_get() in _PyUnicode_FormId() calls ht->get_func(ht, pkey, data) which is:

(gdb) disassemble _Py_hashtable_get_ptr 
Dump of assembler code for function _Py_hashtable_get_ptr:
   0x00000000004303b0 <+0>:	mov    rax,QWORD PTR [rdi]
   0x00000000004303b3 <+3>:	mov    rcx,rdx
   0x00000000004303b6 <+6>:	mov    rdx,QWORD PTR [rsi]
   0x00000000004303b9 <+9>:	sub    rax,0x1
   0x00000000004303bd <+13>:	ror    rdx,0x4
   0x00000000004303c1 <+17>:	and    rax,rdx
   0x00000000004303c4 <+20>:	mov    rdx,QWORD PTR [rdi+0x10]
   0x00000000004303c8 <+24>:	mov    rax,QWORD PTR [rdx+rax*8]
   0x00000000004303cc <+28>:	test   rax,rax
   0x00000000004303cf <+31>:	jne    0x4303e0 <_Py_hashtable_get_ptr+48>

   0x00000000004303d1 <+33>:	jmp    0x430401 <_Py_hashtable_get_ptr+81>

   0x00000000004303d3 <+35>:	nop    DWORD PTR [rax+rax*1+0x0]
   0x00000000004303d8 <+40>:	mov    rax,QWORD PTR [rax]
   0x00000000004303db <+43>:	test   rax,rax
   0x00000000004303de <+46>:	je     0x430400 <_Py_hashtable_get_ptr+80>

   0x00000000004303e0 <+48>:	mov    rdx,QWORD PTR [rsi]
   0x00000000004303e3 <+51>:	cmp    QWORD PTR [rax+0x10],rdx
   0x00000000004303e7 <+55>:	jne    0x4303d8 <_Py_hashtable_get_ptr+40>

   0x00000000004303e9 <+57>:	mov    rdx,QWORD PTR [rdi+0x18]
   0x00000000004303ed <+61>:	mov    rax,QWORD PTR [rax+rdx*1+0x10]
   0x00000000004303f2 <+66>:	mov    QWORD PTR [rcx],rax
   0x00000000004303f5 <+69>:	mov    eax,0x1
   0x00000000004303fa <+74>:	ret    

   0x00000000004303fb <+75>:	nop    DWORD PTR [rax+rax*1+0x0]
   0x0000000000430400 <+80>:	ret    

   0x0000000000430401 <+81>:	xor    eax,eax
   0x0000000000430403 <+83>:	ret    
End of assembler dump.

Machine code on x86-64 on Fedora 32 with gcc (GCC) 10.0.1 20200430 (Red Hat 10.0.1-0.14), compilation using LTO. It contains many NOP, likely for better code placement :-)

@encukou
Copy link
Member

encukou commented May 12, 2020

Could it be could be done with an array of identifiers instead of a hash table?
It would mean a preprocessing step to collect all identifiers across the stdlib. It would mean PyIdentifier could never become public API. On the other hand, common identifiers would be deduplicated across modules.

@serhiy-storchaka
Copy link
Member

I would use not hastables, but continuous arrays. At first access assign an index for _Py_Identifier, and then just read the corresponding element of the per-interpreter array of constants.

It will still much slower than the current code. Would be nice to add a compile time option to disable subinterpreters.

@ericsnowcurrently
Copy link
Member

+1 on avoiding a hash table.

What is the set of strings used with _Py_IDENTIFIER()? I would expect it to be relatively small, with lots of repetition. There probably isn't a need to make the mechanism so dynamic as we currently have.

Regardless, the same effect could be made (in the same way) using a table/struct holding all these known identifiers instead of having them spread all over as statics (and we could even get rid of _Py_IDENTIFIER).

@encukou
Copy link
Member

encukou commented May 12, 2020

It might be interesting to look at how MicroPython interns strings. There's a preprocessing step before C compilation, and new ones can also be added dynamically.

@serhiy-storchaka
Copy link
Member

_Py_IDENTIFIER is only a part of the story. There is a similar list of constant tuples for _PyArg_Parser. And Victor once proposed more general mechanism _Py_INIT_ONCE.

The number of all constants is limited, but it is hard to determine it at compile time. And the code that use the compile-time registry of constant would be non-readable. Also, it would be inefficient to initialize all constants at the interpreter start if most of them will not be used. It is better to build the list of constants at runtime.

@encukou
Copy link
Member

encukou commented May 12, 2020

It is better to build the objects on demand, but would it be worth it to allocate space for them at the beginning, and use build-time-constant indexes into the array?

@ericsnowcurrently
Copy link
Member

It is better to build the objects on demand, but would it be worth it to allocate space for them at the beginning, and use build-time-constant indexes into the array?

+1

This is more what I was saying. I didn't mean to initialize all the constants at start (just like we do not currently with _Py_IDENTIFIER()).

@ericsnowcurrently
Copy link
Member

It might be interesting to look at how MicroPython interns strings. There's a preprocessing step before C compilation, and new ones can also be added dynamically.

+1

@vstinner
Copy link
Member Author

I like the idea of an array and assign a global unique id when an identifier is initialized.

@vstinner
Copy link
Member Author

The problem of a generic API (I am not only talking about Py_IDENTIFIER here) which uses an array for per-interpreter variables is the memory usage if only a small part of variables are used. The advantage is that reading a variable is more efficient than using a hash table.

For Py_IDENTIFIER, in the benchmark, the hash table only has 138 entries (256 buckets), so an array is reasonable.

@vstinner
Copy link
Member Author

vstinner commented May 12, 2020

PR #20058 implements the array idea. The performance overhead is way lower, and it makes _PyUnicode_FromId() compatible with subinterpreters:

[ref] 2.35 ns +- 0.00 ns -> [array] 2.82 ns +- 0.00 ns: 1.20x slower (+20%)

Rewrote _Py_Identifier structure and _PyUnicode_FromId() function to
store Python objects in an hash table rather than a single-linked
list.

Add _PyUnicode_PreInit() to create the hash table: it must be called
before the first PyType_Ready() call.
@vstinner
Copy link
Member Author

I failed to make the hashtable as fast as an array, so I close this PR in favor of PR #20058 which uses an array.

@vstinner vstinner closed this May 19, 2020
@vstinner vstinner deleted the unicode_fromid_hashtable branch May 19, 2020 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants