Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-39465: Fix _PyUnicode_FromId() for subinterpreters #20058

Merged
merged 2 commits into from
Dec 25, 2020
Merged

bpo-39465: Fix _PyUnicode_FromId() for subinterpreters #20058

merged 2 commits into from
Dec 25, 2020

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented May 12, 2020

Make _PyUnicode_FromId() function compatible with subinterpreters.
Each interpreter now has an array of identifier objects (interned
strings decoded from UTF-8).

  • Add PyInterpreterState.unicode.identifiers: array of identifiers
    objects.
  • Add _PyRuntimeState.unicode_ids used to allocate unique indexes
    to _Py_Identifier.
  • Rewrite _Py_Identifier structure.

Benchmark _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a):

[ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower

This change adds 1 ns per _PyUnicode_FromId() call in average.

https://bugs.python.org/issue39465

PyInterpreterState *interp = _PyInterpreterState_GET();
struct _Py_unicode_ids *ids = &interp->unicode.ids;

if (id->index < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use 0 as the default value. It is easier to initialize static variables with 0, and it may be slightly faster to compare for equality with 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"It is easier to initialize static variables with 0"

You cannot initialize a _Py_Identifier to zeros: the string field must be set. IMO it's reasonable to require users of this API to use _Py_static_string_init(), _Py_static_string() or _Py_IDENTIFIER() macro and don't initialize members manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but it looks to me that it may be cheaper for the loader to initialize non-constant static variable with zeros than with some other value. For non-zero value it needs to copy it from some place (read-only) to the read-write data area. For zero value it can just keep initial zeros (if the data page is filled by zeros on allocation). It can save few bytes in executable file and address space and few tacts at program start time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but it looks to me that it may be cheaper for the loader to initialize non-constant static variable with zeros than with some other value.

_Py_Identifier is made of 2 members: string (non-NULL constant value) and index (set once at runtime, get many times). The structure is never initialized to zeros, even if we change the default "not set" index to 0.

@vstinner
Copy link
Member Author

The advantage of the hash table approach (PR #20048) is that it avoids the need of a unique index. It simply uses the variable adddress as the key.

@vstinner vstinner changed the title bpo-39465: Fix _PyUnicode_FromId() for subinterpreters [WIP] bpo-39465: Fix _PyUnicode_FromId() for subinterpreters May 13, 2020
@vstinner
Copy link
Member Author

[ref] 2.35 ns +- 0.00 ns -> [array] 2.82 ns +- 0.00 ns: 1.20x slower (+20%)

Oh. I'm not longer able to reproduce this benchmark :-( I rer-run a benchmark with gcc -O3, and then again with gcc -O3 and LTO. I got two times similar results:

[ref] 2.35 ns +- 0.00 ns -> [array] 3.57 ns +- 0.09 ns: 1.52x slower (+52%)

I also checked: volatile has no impact on performances.

@vstinner
Copy link
Member Author

I rebased my PR and squashed commits to be able to update the commit message (especially the benchmark result).

@brettcannon brettcannon removed their request for review May 13, 2020 23:23
@vstinner
Copy link
Member Author

I rebased my PR on master which became Python 3.10.

I failed to make the hashtable as fast as an array, so I closed the PR #20048 in favor of this PR which uses an array.

@vstinner vstinner changed the title [WIP] bpo-39465: Fix _PyUnicode_FromId() for subinterpreters bpo-39465: Fix _PyUnicode_FromId() for subinterpreters May 19, 2020
@vstinner
Copy link
Member Author

In PR #20048, @encukou wrote:

It might be interesting to look at how MicroPython interns strings. There's a preprocessing step before C compilation, and new ones can also be added dynamically.

and

It is better to build the objects on demand, but would it be worth it to allocate space for them at the beginning, and use build-time-constant indexes into the array?

We can use something like a pre-processor to initialize some identifiers a build time, but I'm not sure that it's a good idea to allocate in advance space for all possible identifiers. Python has a large standard library. Many applications will never room some C extensions.

I like the approach to assigning identifiers dynamically and only allocate more space on demand. Only the first call to _PyUnicode_FromId() is slow. Many C extensions even call it during their module execution function to avoid any overhead at runtime.

Let's say that Python has 200 identifier objects. With this PR, if an application only use 10 identifiers, Python will only allocates an array of 16 items, instead of 200 items. I chose to always allocate at least 16 items, to reduce the number of realloc. We might adjust that later if needed.

Maybe for identifiers, it's not critical, but I plan to use a similar approach for other objects which should be made "per-interpreter". If we pre-allocate "everything", we will likely waste a lot of memory which will never be used.

@vstinner
Copy link
Member Author

cc @encukou @ericsnowcurrently: Would you mind to review this PR?

@serhiy-storchaka: Apart your two remarks, are you ok with the overall approach? Does it sound like a reasonable overhead (+1.21 ns per function call)?

If we consider that subinterpreters with per-interpreter GIL can make Python (code written for subinterpreters) at least 4x faster on machines with at least 4 CPUs, IMO it's worth it.

@vstinner
Copy link
Member Author

Performance impact:

[ref] 2.35 ns +- 0.00 ns -> [array] 3.57 ns +- 0.09 ns: 1.52x slower (+52%)

Context for these numbers:

  • PyUnicode_FromString("abc"): 35.8 ns +- 0.7 ns
  • PyUnicode_InternFromString("abc"): 89.8 ns +- 1.0 ns

(copy of my #20048 (comment) comment.)

@serhiy-storchaka
Copy link
Member

Just for curiosity, how many identifiers are allocated by python --version, python -c 'pass', python -m this, python -m test? This question is not related to this PR, I am just curious.

@vstinner
Copy link
Member Author

Just for curiosity, how many identifiers are allocated by python --version, python -c 'pass', python -m this, python -m test? This question is not related to this PR, I am just curious.

$ ./python -V
(...)
ids# = 0

$ ./python -sS -c pass
ids# = 95

$ ./python -c pass
ids# = 104

$ ./python -m this
(...)
ids# = 130

(I'm running "./python -m test" which is quite long :-p)

I used this patch:

diff --git a/Modules/main.c b/Modules/main.c
index bc3a2ed8ed..fbdf418ccc 100644
--- a/Modules/main.c
+++ b/Modules/main.c
@@ -643,6 +643,7 @@ Py_RunMain(void)
         exitcode = exit_sigint();
     }
 
+fprintf(stderr, "ids# = %zd\n", _PyRuntime.unicode_ids.next_index);
     return exitcode;
 }
 
@@ -653,6 +654,7 @@ pymain_main(_PyArgv *args)
     PyStatus status = pymain_init(args);
     if (_PyStatus_IS_EXIT(status)) {
         pymain_free();
+fprintf(stderr, "ids# = %zd\n", _PyRuntime.unicode_ids.next_index);
         return status.exitcode;
     }
     if (_PyStatus_EXCEPTION(status)) {

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrent programming without GIL is hard.

struct _Py_unicode_ids *ids = &interp->unicode.ids;

// Copy the index since _Py_Identifier.index is declared as volatile
Py_ssize_t index = id->index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since reading the index is not guarded by lock, it is possible that we read index simultaneously with writing it in other thread. In such case the half of index bits can be old, and the other half is new. We need not just add volatile for index, but use an atomic integer instead of just Py_ssize_t.

@vstinner
Copy link
Member Author

Concurrent programming without GIL is hard.

https://bugs.python.org/issue39465 doesn't try to remove the GIL but having one GIL per interpreter: see https://bugs.python.org/issue40512

@vstinner
Copy link
Member Author

It seems like currently, CPython uses around 523 _Py_Identifier instances:

$ ./python -m test
(...)
0:48:06 load avg: 1.50 [424/424/32] test_zoneinfo
(...)
ids# = 523

@serhiy-storchaka
Copy link
Member

One GIL per interpreter does not help when work with a data shared between interpreters.

@vstinner
Copy link
Member Author

vstinner commented May 19, 2020

One GIL per interpreter does not help when work with a data shared between interpreters.

Only _PyRuntime is shared by multiple interpreters: access to _PyRuntime is protected by a new lock.

@serhiy-storchaka
Copy link
Member

Is it rt_ids->lock? It does not guard all operations with shared data.

@vstinner
Copy link
Member Author

Is it rt_ids->lock? It does not guard all operations with shared data.

Would you mind to elaborate which shared data is not guarded by rt_ids->lock?

Globals (shared by all interpreters):

  • _PyRuntime.unicode_ids.lock (rt_ids->lock) protects _PyRuntime.unicode_ids.next_index (rt_ids->next_index) and _Py_Identifier.index.
  • I wrote an optimistic optimization which read _Py_Identifier.index with no lock, using volatile on _Py_Identifier.index to ensure that the compiler will read _Py_Identifier.index again a few lines below (when the lock is acquired).

Per-interpreter:

  • The whole PyInterpreterState.unicode structure, including PyInterpreterState.unicode.ids (ids in the function), is guarded by the GIL.

@serhiy-storchaka
Copy link
Member

My apologies. You are right, now I see that PyInterpreterState.unicode do not need the global lock. I should check it twice.

But I think there is still a problem with non-atomic _Py_Identifier.index. It can cause rare, hard to reproduce bugs.

Copy link
Member Author

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, note for myself: I must revert the _testcapi changes, only there for benchmarks.

Copy link
Member

@ericsnowcurrently ericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than globally-locking around id->index before we're sure it's been set, LGTM.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@vstinner
Copy link
Member Author

I wrote PR #20390 to check if C11 _Atomic specifier could be used in Include/cpython/object.h. Sadly, MSC compiler (Visual Studio) doesn't support it :-(

@vstinner
Copy link
Member Author

But I think there is still a problem with non-atomic _Py_Identifier.index. It can cause rare, hard to reproduce bugs.

I could modify my PR to only access _Py_Identifier.index when the runtime lock is acquired. The problem is that it may become a new performance bottleneck if many threads of different subinterpreters call _PyUnicode_FromId() in parallel. Threads would have to sequentially execute _PyUnicode_FromId(), rather than being able to run it in parallel. The code protected by the lock is very short and very fast, maybe it's not an issue?

A "global" lock for all identifiers may defeat the purpose of per-interpreter GIL. Well, at least, it makes _PyUnicode_FromId() "less parallel" :-)

If C11 _Atomic specifier cannot be used, maybe we can identify a subset of functions available on all C compilers supported by CPython. For example, MSC (Visual Studio) provides "Interlocked" functions for atomic operations on LONG or on 64-bit variables: https://docs.microsoft.com/en-us/windows/win32/sync/synchronization-functions?redirectedfrom=MSDN

@vstinner
Copy link
Member Author

Another alternative is to use a Read/Write lock which allows parallel read access:

@vstinner vstinner requested a review from a team as a code owner June 9, 2020 18:29
@vstinner
Copy link
Member Author

vstinner commented Jun 9, 2020

I wrote PR #20766 which adds functions to access variables atomically without having to declare variables as atomic.

I rebased this PR on master and included PR #20766 in this PR to access _Py_Identifier.index atomically.

Microbenchmark on the PR using atomic functions:

$ python3 -m pyperf compare_to ref.json atomic.json 
fromid a: Mean +- std dev: [ref] 2.38 ns +- 0.01 ns -> [atomic] 4.08 ns +- 0.01 ns: 1.71x slower (+71%)
fromid abc: Mean +- std dev: [ref] 2.38 ns +- 0.00 ns -> [atomic] 3.99 ns +- 0.01 ns: 1.68x slower (+68%)

It seems like reading _Py_Identifier.index doesn't use any memory fence, it's just a regular MOV on x86. So the fast path doesn't pay any overhead of an atomic read.

@vstinner
Copy link
Member Author

vstinner commented Jun 9, 2020

Currently, my PR uses int type _Py_Identifier.index. It requires a sanity check at runtime:

            if (rt_ids->next_index > INT_MAX) {
                Py_FatalError("_Py_Identifier index overflow");
            }

But Py_ssize_t could be used if PR #20766 is extended to support more types (ex: int, Py_ssize_t, void*). I started with int since it's common and easy to implement.

Make _PyUnicode_FromId() function compatible with subinterpreters.
Each interpreter now has an array of identifier objects (interned
strings decoded from UTF-8).

* Add PyInterpreterState.unicode.identifiers: array of identifiers
  objects.
* Add _PyRuntimeState.unicode_ids used to allocate unique indexes
  to _Py_Identifier.
* Rewrite _Py_Identifier structure.

Benchmark _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a):

[ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower

This change adds 1 ns per _PyUnicode_FromId() call in average.
@vstinner
Copy link
Member Author

I plan to merge this PR next days. cc @serhiy-storchaka @ericsnowcurrently

IMO the latest version of the PR is now correct (no race condition) and its performance slowdown is acceptable.


This PR has a long history:

  • First, I tried to use a hash table: PR [WIP] bpo-39465: _PyUnicode_FromId() now uses an hash table #20048. I abandoned this approach since the performance overhead was too high.
  • I wrote this PR to add an array to PyInterpreterState using a lock to prevent race condition when allocating an unique index to an identifier variable.
  • Problem: @serhiy-storchaka was concerned about non-atomic read on the identifier index (it may require multiple CPU instructions to read the index value and so be non-atomic).
  • I tried to declare the index with _Atomic in PR [WIP] bpo-39465: Mark _Py_Identifier.object as atomic #20390 but the compilation failed on Windows (MSC).
  • I added pycore_atomic_funcs.h header (PR bpo-39465: Add pycore_atomic_funcs.h internal header #20766) to provide atomic get/set functions on Py_ssize_t. The implementation uses builtin atomic functions if available, or falls back on the volatile keyword.
  • I updated this PR to use _Py_atomic_size_get() and _Py_atomic_size_set(). I also replaced "int index" with "Py_ssize_t index" (which avoids the Py_FatalError() call if index is greater than INT_MAX).

I rebased this PR on master and I re-run benchmarks:

[ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower

It adds 1 ns per _PyUnicode_FromId() call in average. IMO it's reasonable and no better approach was found to fix https://bugs.python.org/issue39465 (fix _PyUnicode_FromId() for subinterpreters).

Context for these numbers:

  • PyUnicode_FromString("abc"): 35.8 ns +- 0.7 ns
  • PyUnicode_InternFromString("abc"): 89.8 ns +- 1.0 ns

I already pushed non-controlversial changes to make this PR as short as possible (to ease reviews).


About the volatile fallback for atomic functions: if the functions are not atomic, regular Python is not affected, only hypothetical subinterpreters using one GIL per interpreter would be affected. Today, there is still a GIL per interpreter, and so it's ok if _Py_atomic_size_get() is not atomic in pratice ;-) If there are real bugs, I suggest to attempt fixing _Py_atomic_size_get() and _Py_atomic_size_set(), rather than trying to fix _PyUnicode_FromId().

@vstinner vstinner dismissed ericsnowcurrently’s stale review December 23, 2020 03:22

I fixed the issue spotted by Eric

@vstinner
Copy link
Member Author

Oh, running support.run_in_subinterp("") leaks 100 references with this change. I have to investigate why. Example:

$ ./python -m test -R 3:3 test_atexit -m test_callbacks_leak 
test_atexit leaked [100, 100, 100] references, sum=300
test_atexit leaked [1, 1, 1] memory blocks, sum=3

@vstinner
Copy link
Member Author

Oh, running support.run_in_subinterp("") leaks 100 references with this change.

I was a mistake during my latest rebase. It's now fixed.

@vstinner
Copy link
Member Author

This PR is needed to fix https://bugs.python.org/issue40521 : see PR #20085 "Per-interpreter interned strings".

@YannickJadoul
Copy link
Contributor

YannickJadoul commented Jan 10, 2021

Bisecting history, git tells me this PR is causing an issue, detected in pybind11's embedding tests (pybind/pybind11#2774): https://bugs.python.org/issue42882

I'm happy to debug or help out, but I don't immediately see how to approach this easily.

adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 13, 2021
Make _PyUnicode_FromId() function compatible with subinterpreters.
Each interpreter now has an array of identifier objects (interned
strings decoded from UTF-8).

* Add PyInterpreterState.unicode.identifiers: array of identifiers
  objects.
* Add _PyRuntimeState.unicode_ids used to allocate unique indexes
  to _Py_Identifier.
* Rewrite the _Py_Identifier structure.

Microbenchmark on _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a):

[ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower

This change adds 1 ns per _PyUnicode_FromId() call in average.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants