-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-39465: Fix _PyUnicode_FromId() for subinterpreters #20058
Conversation
Objects/unicodeobject.c
Outdated
PyInterpreterState *interp = _PyInterpreterState_GET(); | ||
struct _Py_unicode_ids *ids = &interp->unicode.ids; | ||
|
||
if (id->index < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use 0 as the default value. It is easier to initialize static variables with 0, and it may be slightly faster to compare for equality with 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"It is easier to initialize static variables with 0"
You cannot initialize a _Py_Identifier to zeros: the string field must be set. IMO it's reasonable to require users of this API to use _Py_static_string_init(), _Py_static_string() or _Py_IDENTIFIER() macro and don't initialize members manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but it looks to me that it may be cheaper for the loader to initialize non-constant static variable with zeros than with some other value. For non-zero value it needs to copy it from some place (read-only) to the read-write data area. For zero value it can just keep initial zeros (if the data page is filled by zeros on allocation). It can save few bytes in executable file and address space and few tacts at program start time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but it looks to me that it may be cheaper for the loader to initialize non-constant static variable with zeros than with some other value.
_Py_Identifier is made of 2 members: string (non-NULL constant value) and index (set once at runtime, get many times). The structure is never initialized to zeros, even if we change the default "not set" index to 0.
The advantage of the hash table approach (PR #20048) is that it avoids the need of a unique index. It simply uses the variable adddress as the key. |
Oh. I'm not longer able to reproduce this benchmark :-( I rer-run a benchmark with gcc -O3, and then again with gcc -O3 and LTO. I got two times similar results:
I also checked: volatile has no impact on performances. |
I rebased my PR and squashed commits to be able to update the commit message (especially the benchmark result). |
I rebased my PR on master which became Python 3.10. I failed to make the hashtable as fast as an array, so I closed the PR #20048 in favor of this PR which uses an array. |
and
We can use something like a pre-processor to initialize some identifiers a build time, but I'm not sure that it's a good idea to allocate in advance space for all possible identifiers. Python has a large standard library. Many applications will never room some C extensions. I like the approach to assigning identifiers dynamically and only allocate more space on demand. Only the first call to Let's say that Python has 200 identifier objects. With this PR, if an application only use 10 identifiers, Python will only allocates an array of 16 items, instead of 200 items. I chose to always allocate at least 16 items, to reduce the number of realloc. We might adjust that later if needed. Maybe for identifiers, it's not critical, but I plan to use a similar approach for other objects which should be made "per-interpreter". If we pre-allocate "everything", we will likely waste a lot of memory which will never be used. |
cc @encukou @ericsnowcurrently: Would you mind to review this PR? @serhiy-storchaka: Apart your two remarks, are you ok with the overall approach? Does it sound like a reasonable overhead (+1.21 ns per function call)? If we consider that subinterpreters with per-interpreter GIL can make Python (code written for subinterpreters) at least 4x faster on machines with at least 4 CPUs, IMO it's worth it. |
Performance impact:
Context for these numbers:
(copy of my #20048 (comment) comment.) |
Just for curiosity, how many identifiers are allocated by |
(I'm running "./python -m test" which is quite long :-p) I used this patch:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concurrent programming without GIL is hard.
Objects/unicodeobject.c
Outdated
struct _Py_unicode_ids *ids = &interp->unicode.ids; | ||
|
||
// Copy the index since _Py_Identifier.index is declared as volatile | ||
Py_ssize_t index = id->index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since reading the index is not guarded by lock, it is possible that we read index simultaneously with writing it in other thread. In such case the half of index bits can be old, and the other half is new. We need not just add volatile
for index, but use an atomic integer instead of just Py_ssize_t
.
https://bugs.python.org/issue39465 doesn't try to remove the GIL but having one GIL per interpreter: see https://bugs.python.org/issue40512 |
It seems like currently, CPython uses around 523 _Py_Identifier instances:
|
One GIL per interpreter does not help when work with a data shared between interpreters. |
Only _PyRuntime is shared by multiple interpreters: access to _PyRuntime is protected by a new lock. |
Is it |
Would you mind to elaborate which shared data is not guarded by rt_ids->lock? Globals (shared by all interpreters):
Per-interpreter:
|
My apologies. You are right, now I see that But I think there is still a problem with non-atomic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, note for myself: I must revert the _testcapi changes, only there for benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than globally-locking around id->index
before we're sure it's been set, LGTM.
When you're done making the requested changes, leave the comment: |
I wrote PR #20390 to check if C11 |
I could modify my PR to only access _Py_Identifier.index when the runtime lock is acquired. The problem is that it may become a new performance bottleneck if many threads of different subinterpreters call _PyUnicode_FromId() in parallel. Threads would have to sequentially execute _PyUnicode_FromId(), rather than being able to run it in parallel. The code protected by the lock is very short and very fast, maybe it's not an issue? A "global" lock for all identifiers may defeat the purpose of per-interpreter GIL. Well, at least, it makes _PyUnicode_FromId() "less parallel" :-) If C11 _Atomic specifier cannot be used, maybe we can identify a subset of functions available on all C compilers supported by CPython. For example, MSC (Visual Studio) provides "Interlocked" functions for atomic operations on LONG or on 64-bit variables: https://docs.microsoft.com/en-us/windows/win32/sync/synchronization-functions?redirectedfrom=MSDN |
Another alternative is to use a Read/Write lock which allows parallel read access:
|
I wrote PR #20766 which adds functions to access variables atomically without having to declare variables as atomic. I rebased this PR on master and included PR #20766 in this PR to access _Py_Identifier.index atomically. Microbenchmark on the PR using atomic functions:
It seems like reading _Py_Identifier.index doesn't use any memory fence, it's just a regular MOV on x86. So the fast path doesn't pay any overhead of an atomic read. |
Currently, my PR uses
But |
Make _PyUnicode_FromId() function compatible with subinterpreters. Each interpreter now has an array of identifier objects (interned strings decoded from UTF-8). * Add PyInterpreterState.unicode.identifiers: array of identifiers objects. * Add _PyRuntimeState.unicode_ids used to allocate unique indexes to _Py_Identifier. * Rewrite _Py_Identifier structure. Benchmark _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a): [ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower This change adds 1 ns per _PyUnicode_FromId() call in average.
I plan to merge this PR next days. cc @serhiy-storchaka @ericsnowcurrently IMO the latest version of the PR is now correct (no race condition) and its performance slowdown is acceptable. This PR has a long history:
I rebased this PR on master and I re-run benchmarks:
It adds 1 ns per _PyUnicode_FromId() call in average. IMO it's reasonable and no better approach was found to fix https://bugs.python.org/issue39465 (fix _PyUnicode_FromId() for subinterpreters). Context for these numbers:
I already pushed non-controlversial changes to make this PR as short as possible (to ease reviews). About the |
I fixed the issue spotted by Eric
Oh, running
|
I was a mistake during my latest rebase. It's now fixed. |
This PR is needed to fix https://bugs.python.org/issue40521 : see PR #20085 "Per-interpreter interned strings". |
Bisecting history, git tells me this PR is causing an issue, detected in pybind11's embedding tests (pybind/pybind11#2774): https://bugs.python.org/issue42882 I'm happy to debug or help out, but I don't immediately see how to approach this easily. |
Make _PyUnicode_FromId() function compatible with subinterpreters. Each interpreter now has an array of identifier objects (interned strings decoded from UTF-8). * Add PyInterpreterState.unicode.identifiers: array of identifiers objects. * Add _PyRuntimeState.unicode_ids used to allocate unique indexes to _Py_Identifier. * Rewrite the _Py_Identifier structure. Microbenchmark on _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a): [ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower This change adds 1 ns per _PyUnicode_FromId() call in average.
Make _PyUnicode_FromId() function compatible with subinterpreters.
Each interpreter now has an array of identifier objects (interned
strings decoded from UTF-8).
objects.
to _Py_Identifier.
Benchmark _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a):
[ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower
This change adds 1 ns per _PyUnicode_FromId() call in average.
https://bugs.python.org/issue39465