bpo-39465: Fix _PyUnicode_FromId() for subinterpreters #20058

vstinner · 2020-05-12T17:38:52Z

Make _PyUnicode_FromId() function compatible with subinterpreters.
Each interpreter now has an array of identifier objects (interned
strings decoded from UTF-8).

Add PyInterpreterState.unicode.identifiers: array of identifiers
objects.
Add _PyRuntimeState.unicode_ids used to allocate unique indexes
to _Py_Identifier.
Rewrite _Py_Identifier structure.

Benchmark _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a):

[ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower

This change adds 1 ns per _PyUnicode_FromId() call in average.

https://bugs.python.org/issue39465

Objects/unicodeobject.c

serhiy-storchaka · 2020-05-12T17:57:48Z

Objects/unicodeobject.c

+    PyInterpreterState *interp = _PyInterpreterState_GET();
+    struct _Py_unicode_ids *ids = &interp->unicode.ids;
+
+    if (id->index < 0) {


I would use 0 as the default value. It is easier to initialize static variables with 0, and it may be slightly faster to compare for equality with 0.

"It is easier to initialize static variables with 0"

You cannot initialize a _Py_Identifier to zeros: the string field must be set. IMO it's reasonable to require users of this API to use _Py_static_string_init(), _Py_static_string() or _Py_IDENTIFIER() macro and don't initialize members manually.

I am not sure, but it looks to me that it may be cheaper for the loader to initialize non-constant static variable with zeros than with some other value. For non-zero value it needs to copy it from some place (read-only) to the read-write data area. For zero value it can just keep initial zeros (if the data page is filled by zeros on allocation). It can save few bytes in executable file and address space and few tacts at program start time.

I am not sure, but it looks to me that it may be cheaper for the loader to initialize non-constant static variable with zeros than with some other value.

_Py_Identifier is made of 2 members: string (non-NULL constant value) and index (set once at runtime, get many times). The structure is never initialized to zeros, even if we change the default "not set" index to 0.

Include/cpython/object.h

Objects/unicodeobject.c

vstinner · 2020-05-12T20:59:33Z

The advantage of the hash table approach (PR #20048) is that it avoids the need of a unique index. It simply uses the variable adddress as the key.

Objects/unicodeobject.c

vstinner · 2020-05-13T17:22:41Z

[ref] 2.35 ns +- 0.00 ns -> [array] 2.82 ns +- 0.00 ns: 1.20x slower (+20%)

Oh. I'm not longer able to reproduce this benchmark :-( I rer-run a benchmark with gcc -O3, and then again with gcc -O3 and LTO. I got two times similar results:

[ref] 2.35 ns +- 0.00 ns -> [array] 3.57 ns +- 0.09 ns: 1.52x slower (+52%)

I also checked: volatile has no impact on performances.

vstinner · 2020-05-13T17:23:05Z

I rebased my PR and squashed commits to be able to update the commit message (especially the benchmark result).

vstinner · 2020-05-19T14:14:53Z

I rebased my PR on master which became Python 3.10.

I failed to make the hashtable as fast as an array, so I closed the PR #20048 in favor of this PR which uses an array.

vstinner · 2020-05-19T14:24:45Z

In PR #20048, @encukou wrote:

It might be interesting to look at how MicroPython interns strings. There's a preprocessing step before C compilation, and new ones can also be added dynamically.

and

It is better to build the objects on demand, but would it be worth it to allocate space for them at the beginning, and use build-time-constant indexes into the array?

We can use something like a pre-processor to initialize some identifiers a build time, but I'm not sure that it's a good idea to allocate in advance space for all possible identifiers. Python has a large standard library. Many applications will never room some C extensions.

I like the approach to assigning identifiers dynamically and only allocate more space on demand. Only the first call to _PyUnicode_FromId() is slow. Many C extensions even call it during their module execution function to avoid any overhead at runtime.

Let's say that Python has 200 identifier objects. With this PR, if an application only use 10 identifiers, Python will only allocates an array of 16 items, instead of 200 items. I chose to always allocate at least 16 items, to reduce the number of realloc. We might adjust that later if needed.

Maybe for identifiers, it's not critical, but I plan to use a similar approach for other objects which should be made "per-interpreter". If we pre-allocate "everything", we will likely waste a lot of memory which will never be used.

vstinner · 2020-05-19T14:27:21Z

cc @encukou @ericsnowcurrently: Would you mind to review this PR?

@serhiy-storchaka: Apart your two remarks, are you ok with the overall approach? Does it sound like a reasonable overhead (+1.21 ns per function call)?

If we consider that subinterpreters with per-interpreter GIL can make Python (code written for subinterpreters) at least 4x faster on machines with at least 4 CPUs, IMO it's worth it.

vstinner · 2020-05-19T14:28:26Z

Performance impact:

[ref] 2.35 ns +- 0.00 ns -> [array] 3.57 ns +- 0.09 ns: 1.52x slower (+52%)

Context for these numbers:

PyUnicode_FromString("abc"): 35.8 ns +- 0.7 ns
PyUnicode_InternFromString("abc"): 89.8 ns +- 1.0 ns

(copy of my #20048 (comment) comment.)

serhiy-storchaka · 2020-05-19T14:35:11Z

Just for curiosity, how many identifiers are allocated by python --version, python -c 'pass', python -m this, python -m test? This question is not related to this PR, I am just curious.

vstinner · 2020-05-19T14:51:26Z

Just for curiosity, how many identifiers are allocated by python --version, python -c 'pass', python -m this, python -m test? This question is not related to this PR, I am just curious.

$ ./python -V
(...)
ids# = 0

$ ./python -sS -c pass
ids# = 95

$ ./python -c pass
ids# = 104

$ ./python -m this
(...)
ids# = 130

(I'm running "./python -m test" which is quite long :-p)

I used this patch:

diff --git a/Modules/main.c b/Modules/main.c
index bc3a2ed8ed..fbdf418ccc 100644
--- a/Modules/main.c
+++ b/Modules/main.c
@@ -643,6 +643,7 @@ Py_RunMain(void)
         exitcode = exit_sigint();
     }
 
+fprintf(stderr, "ids# = %zd\n", _PyRuntime.unicode_ids.next_index);
     return exitcode;
 }
 
@@ -653,6 +654,7 @@ pymain_main(_PyArgv *args)
     PyStatus status = pymain_init(args);
     if (_PyStatus_IS_EXIT(status)) {
         pymain_free();
+fprintf(stderr, "ids# = %zd\n", _PyRuntime.unicode_ids.next_index);
         return status.exitcode;
     }
     if (_PyStatus_EXCEPTION(status)) {

serhiy-storchaka

Concurrent programming without GIL is hard.

serhiy-storchaka · 2020-05-19T15:01:35Z

Objects/unicodeobject.c

+    struct _Py_unicode_ids *ids = &interp->unicode.ids;
+
+    // Copy the index since _Py_Identifier.index is declared as volatile
+    Py_ssize_t index = id->index;


Since reading the index is not guarded by lock, it is possible that we read index simultaneously with writing it in other thread. In such case the half of index bits can be old, and the other half is new. We need not just add volatile for index, but use an atomic integer instead of just Py_ssize_t.

Objects/unicodeobject.c

vstinner · 2020-05-19T15:34:24Z

Concurrent programming without GIL is hard.

https://bugs.python.org/issue39465 doesn't try to remove the GIL but having one GIL per interpreter: see https://bugs.python.org/issue40512

vstinner · 2020-05-19T15:41:37Z

It seems like currently, CPython uses around 523 _Py_Identifier instances:

$ ./python -m test
(...)
0:48:06 load avg: 1.50 [424/424/32] test_zoneinfo
(...)
ids# = 523

serhiy-storchaka · 2020-05-19T15:54:38Z

One GIL per interpreter does not help when work with a data shared between interpreters.

vstinner · 2020-05-19T15:57:46Z

One GIL per interpreter does not help when work with a data shared between interpreters.

Only _PyRuntime is shared by multiple interpreters: access to _PyRuntime is protected by a new lock.

serhiy-storchaka · 2020-05-19T16:01:32Z

Is it rt_ids->lock? It does not guard all operations with shared data.

vstinner · 2020-05-19T17:00:58Z

Is it rt_ids->lock? It does not guard all operations with shared data.

Would you mind to elaborate which shared data is not guarded by rt_ids->lock?

Globals (shared by all interpreters):

_PyRuntime.unicode_ids.lock (rt_ids->lock) protects _PyRuntime.unicode_ids.next_index (rt_ids->next_index) and _Py_Identifier.index.
I wrote an optimistic optimization which read _Py_Identifier.index with no lock, using volatile on _Py_Identifier.index to ensure that the compiler will read _Py_Identifier.index again a few lines below (when the lock is acquired).

Per-interpreter:

The whole PyInterpreterState.unicode structure, including PyInterpreterState.unicode.ids (ids in the function), is guarded by the GIL.

serhiy-storchaka · 2020-05-20T09:16:30Z

My apologies. You are right, now I see that PyInterpreterState.unicode do not need the global lock. I should check it twice.

But I think there is still a problem with non-atomic _Py_Identifier.index. It can cause rare, hard to reproduce bugs.

vstinner

Oops, note for myself: I must revert the _testcapi changes, only there for benchmarks.

ericsnowcurrently

Other than globally-locking around id->index before we're sure it's been set, LGTM.

Objects/unicodeobject.c

bedevere-bot · 2020-05-22T20:54:46Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

vstinner · 2020-05-25T17:08:24Z

I wrote PR #20390 to check if C11 _Atomic specifier could be used in Include/cpython/object.h. Sadly, MSC compiler (Visual Studio) doesn't support it :-(

vstinner · 2020-05-25T23:19:15Z

But I think there is still a problem with non-atomic _Py_Identifier.index. It can cause rare, hard to reproduce bugs.

I could modify my PR to only access _Py_Identifier.index when the runtime lock is acquired. The problem is that it may become a new performance bottleneck if many threads of different subinterpreters call _PyUnicode_FromId() in parallel. Threads would have to sequentially execute _PyUnicode_FromId(), rather than being able to run it in parallel. The code protected by the lock is very short and very fast, maybe it's not an issue?

A "global" lock for all identifiers may defeat the purpose of per-interpreter GIL. Well, at least, it makes _PyUnicode_FromId() "less parallel" :-)

If C11 _Atomic specifier cannot be used, maybe we can identify a subset of functions available on all C compilers supported by CPython. For example, MSC (Visual Studio) provides "Interlocked" functions for atomic operations on LONG or on 64-bit variables: https://docs.microsoft.com/en-us/windows/win32/sync/synchronization-functions?redirectedfrom=MSDN

vstinner · 2020-05-25T23:34:52Z

Another alternative is to use a Read/Write lock which allows parallel read access:

pthread: pthread_rwlock_rdlock https://linux.die.net/man/3/pthread_rwlock_rdlock
Windows: https://docs.microsoft.com/en-us/windows/win32/sync/slim-reader-writer--srw--locks?redirectedfrom=MSDN

vstinner · 2020-06-09T18:31:44Z

I wrote PR #20766 which adds functions to access variables atomically without having to declare variables as atomic.

I rebased this PR on master and included PR #20766 in this PR to access _Py_Identifier.index atomically.

Microbenchmark on the PR using atomic functions:

$ python3 -m pyperf compare_to ref.json atomic.json 
fromid a: Mean +- std dev: [ref] 2.38 ns +- 0.01 ns -> [atomic] 4.08 ns +- 0.01 ns: 1.71x slower (+71%)
fromid abc: Mean +- std dev: [ref] 2.38 ns +- 0.00 ns -> [atomic] 3.99 ns +- 0.01 ns: 1.68x slower (+68%)

It seems like reading _Py_Identifier.index doesn't use any memory fence, it's just a regular MOV on x86. So the fast path doesn't pay any overhead of an atomic read.

vstinner · 2020-06-09T18:33:38Z

Currently, my PR uses int type _Py_Identifier.index. It requires a sanity check at runtime:

            if (rt_ids->next_index > INT_MAX) {
                Py_FatalError("_Py_Identifier index overflow");
            }

But Py_ssize_t could be used if PR #20766 is extended to support more types (ex: int, Py_ssize_t, void*). I started with int since it's common and easy to implement.

Objects/unicodeobject.c

Make _PyUnicode_FromId() function compatible with subinterpreters. Each interpreter now has an array of identifier objects (interned strings decoded from UTF-8). * Add PyInterpreterState.unicode.identifiers: array of identifiers objects. * Add _PyRuntimeState.unicode_ids used to allocate unique indexes to _Py_Identifier. * Rewrite _Py_Identifier structure. Benchmark _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a): [ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower This change adds 1 ns per _PyUnicode_FromId() call in average.

vstinner · 2020-12-23T03:09:31Z

I plan to merge this PR next days. cc @serhiy-storchaka @ericsnowcurrently

IMO the latest version of the PR is now correct (no race condition) and its performance slowdown is acceptable.

This PR has a long history:

First, I tried to use a hash table: PR [WIP] bpo-39465: _PyUnicode_FromId() now uses an hash table #20048. I abandoned this approach since the performance overhead was too high.
I wrote this PR to add an array to PyInterpreterState using a lock to prevent race condition when allocating an unique index to an identifier variable.
Problem: @serhiy-storchaka was concerned about non-atomic read on the identifier index (it may require multiple CPU instructions to read the index value and so be non-atomic).
I tried to declare the index with _Atomic in PR [WIP] bpo-39465: Mark _Py_Identifier.object as atomic #20390 but the compilation failed on Windows (MSC).
I added pycore_atomic_funcs.h header (PR bpo-39465: Add pycore_atomic_funcs.h internal header #20766) to provide atomic get/set functions on Py_ssize_t. The implementation uses builtin atomic functions if available, or falls back on the volatile keyword.
I updated this PR to use _Py_atomic_size_get() and _Py_atomic_size_set(). I also replaced "int index" with "Py_ssize_t index" (which avoids the Py_FatalError() call if index is greater than INT_MAX).

I rebased this PR on master and I re-run benchmarks:

[ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower

It adds 1 ns per _PyUnicode_FromId() call in average. IMO it's reasonable and no better approach was found to fix https://bugs.python.org/issue39465 (fix _PyUnicode_FromId() for subinterpreters).

Context for these numbers:

PyUnicode_FromString("abc"): 35.8 ns +- 0.7 ns
PyUnicode_InternFromString("abc"): 89.8 ns +- 1.0 ns

I already pushed non-controlversial changes to make this PR as short as possible (to ease reviews).

About the volatile fallback for atomic functions: if the functions are not atomic, regular Python is not affected, only hypothetical subinterpreters using one GIL per interpreter would be affected. Today, there is still a GIL per interpreter, and so it's ok if _Py_atomic_size_get() is not atomic in pratice ;-) If there are real bugs, I suggest to attempt fixing _Py_atomic_size_get() and _Py_atomic_size_set(), rather than trying to fix _PyUnicode_FromId().

I fixed the issue spotted by Eric

vstinner · 2020-12-23T11:16:57Z

Oh, running support.run_in_subinterp("") leaks 100 references with this change. I have to investigate why. Example:

$ ./python -m test -R 3:3 test_atexit -m test_callbacks_leak 
test_atexit leaked [100, 100, 100] references, sum=300
test_atexit leaked [1, 1, 1] memory blocks, sum=3

vstinner · 2020-12-23T11:22:17Z

Oh, running support.run_in_subinterp("") leaks 100 references with this change.

I was a mistake during my latest rebase. It's now fixed.

vstinner · 2020-12-23T11:30:40Z

This PR is needed to fix https://bugs.python.org/issue40521 : see PR #20085 "Per-interpreter interned strings".

YannickJadoul · 2021-01-10T13:03:20Z

Bisecting history, git tells me this PR is causing an issue, detected in pybind11's embedding tests (pybind/pybind11#2774): https://bugs.python.org/issue42882

I'm happy to debug or help out, but I don't immediately see how to approach this easily.

Make _PyUnicode_FromId() function compatible with subinterpreters. Each interpreter now has an array of identifier objects (interned strings decoded from UTF-8). * Add PyInterpreterState.unicode.identifiers: array of identifiers objects. * Add _PyRuntimeState.unicode_ids used to allocate unique indexes to _Py_Identifier. * Rewrite the _Py_Identifier structure. Microbenchmark on _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a): [ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower This change adds 1 ns per _PyUnicode_FromId() call in average.

vstinner requested review from brettcannon, encukou, ericsnowcurrently, ncoghlan and warsaw as code owners May 12, 2020 17:38

the-knights-who-say-ni added the CLA signed label May 12, 2020

bedevere-bot added the awaiting core review label May 12, 2020

vstinner mentioned this pull request May 12, 2020

[WIP] bpo-39465: _PyUnicode_FromId() now uses an hash table #20048

Closed

serhiy-storchaka reviewed May 12, 2020

View reviewed changes

vstinner changed the title ~~bpo-39465: Fix _PyUnicode_FromId() for subinterpreters~~ [WIP] bpo-39465: Fix _PyUnicode_FromId() for subinterpreters May 13, 2020

vstinner commented May 13, 2020

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

brettcannon removed their request for review May 13, 2020 23:23

vstinner changed the title ~~[WIP] bpo-39465: Fix _PyUnicode_FromId() for subinterpreters~~ bpo-39465: Fix _PyUnicode_FromId() for subinterpreters May 19, 2020

serhiy-storchaka reviewed May 19, 2020

View reviewed changes

vstinner commented May 20, 2020

View reviewed changes

ericsnowcurrently previously requested changes May 22, 2020

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

Objects/unicodeobject.c Show resolved Hide resolved

bedevere-bot removed the awaiting core review label May 22, 2020

bedevere-bot added the awaiting changes label May 22, 2020

vstinner requested a review from a team as a code owner June 9, 2020 18:29

vstinner commented Jun 10, 2020

View reviewed changes

Objects/unicodeobject.c Show resolved Hide resolved

Fix reference leak in subinterpreters

84b0e68

vstinner mentioned this pull request Dec 23, 2020

bpo-40521: Per-interpreter interned strings #20085

Merged

vstinner merged commit ba3d67c into python:master Dec 25, 2020

bedevere-bot removed the awaiting changes label Dec 25, 2020

vstinner deleted the unicode_fromid_array branch December 25, 2020 23:41

YannickJadoul mentioned this pull request Jan 10, 2021

[BUG] 🐍 3.10-dev (ubuntu, mac, ~windows): TypeError: 'str' object is not callable pybind/pybind11#2774

Open

bpo-39465: Fix _PyUnicode_FromId() for subinterpreters #20058

bpo-39465: Fix _PyUnicode_FromId() for subinterpreters #20058

Conversation

vstinner commented May 12, 2020 • edited Loading

serhiy-storchaka May 12, 2020

Choose a reason for hiding this comment

vstinner May 13, 2020

Choose a reason for hiding this comment

serhiy-storchaka May 20, 2020

Choose a reason for hiding this comment

vstinner May 20, 2020

Choose a reason for hiding this comment

vstinner commented May 12, 2020

vstinner commented May 13, 2020

vstinner commented May 13, 2020

vstinner commented May 19, 2020

vstinner commented May 19, 2020

vstinner commented May 19, 2020

vstinner commented May 19, 2020

serhiy-storchaka commented May 19, 2020

vstinner commented May 19, 2020

serhiy-storchaka left a comment

Choose a reason for hiding this comment

serhiy-storchaka May 19, 2020

Choose a reason for hiding this comment

vstinner commented May 19, 2020

vstinner commented May 19, 2020

serhiy-storchaka commented May 19, 2020

vstinner commented May 19, 2020 • edited Loading

serhiy-storchaka commented May 19, 2020

vstinner commented May 19, 2020

serhiy-storchaka commented May 20, 2020

vstinner left a comment

Choose a reason for hiding this comment

ericsnowcurrently left a comment

Choose a reason for hiding this comment

bedevere-bot commented May 22, 2020

vstinner commented May 25, 2020

vstinner commented May 25, 2020

vstinner commented May 25, 2020

vstinner commented Jun 9, 2020

vstinner commented Jun 9, 2020

vstinner commented Dec 23, 2020

vstinner commented Dec 23, 2020

vstinner commented Dec 23, 2020

vstinner commented Dec 23, 2020

YannickJadoul commented Jan 10, 2021 • edited Loading

vstinner commented May 12, 2020 •

edited

Loading

vstinner commented May 19, 2020 •

edited

Loading

YannickJadoul commented Jan 10, 2021 •

edited

Loading