-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for watching writes to selected dictionaries #91052
Comments
CPython extensions providing optimized execution of Python bytecode (e.g. the Cinder JIT), or even CPython itself (e.g. the faster-cpython project) may wish to inline-cache access to frequently-read and rarely-changed namespaces, e.g. module globals. Rather than requiring a dict version guard on every cached read, the best-performing way to do this is is to mark the dictionary as “watched” and set a callback on writes to watched dictionaries. This optimizes the cached-read fast-path at a small cost to the (relatively infrequent and usually less perf sensitive) write path. We have an implementation of this in Cinder ( https://docs.google.com/document/d/1l8I-FDE1xrIShm9eSNJqsGmY_VanMDX5-aK_gujhYBI/edit#heading=h.n2fcxgq6ypwl ), used already by the Cinder JIT and its specializing interpreter. We would like to make the Cinder JIT available as a third-party extension to CPython ( https://docs.google.com/document/d/1l8I-FDE1xrIShm9eSNJqsGmY_VanMDX5-aK_gujhYBI/ ), and so we are interested in adding dict watchers to core CPython. The intention in this issue is not to add any specific optimization or cache (yet); just the ability to mark a dictionary as “watched” and set a write callback. The callback will be global, not per-dictionary (no extra function pointer stored in every dict). CPython will track only one global callback; it is a well-behaved client’s responsibility to check if a callback is already set when setting a new one, and daisy-chain to the previous callback if so. Given that multiple clients may mark dictionaries as watched, a dict watcher callback may receive events for dictionaries that were marked as watched by other clients, and should handle this gracefully. There is no provision in the API for “un-watching” a watched dictionary; such an API could not be used safely in the face of potentially multiple dict-watching clients. The Cinder implementation marks dictionaries as watched using the least bit of the dictionary version (so version increments by 2); this also avoids any additional memory usage for marking a dict as watched. Initial proposed API, comments welcome: // Mark given dictionary as "watched" (global callback will be called if it is modified)
void PyDict_Watch(PyObject* dict);
// Check if given dictionary is already watched
int PyDict_IsWatched(PyObject* dict);
typedef enum {
PYDICT_EVENT_CLEARED,
PYDICT_EVENT_DEALLOCED,
PYDICT_EVENT_MODIFIED
} PyDict_WatchEvent;
// Callback to be invoked when a watched dict is cleared, dealloced, or modified.
// In clear/dealloc case, key and new_value will be NULL. Otherwise, new_value will be the
// new value for key, NULL if key is being deleted.
typedef void(*PyDict_WatchCallback)(PyDict_WatchEvent event, PyObject* dict, PyObject* key, PyObject* new_value);
// Set new global watch callback; supply NULL to clear callback
void PyDict_SetWatchCallback(PyDict_WatchCallback callback);
// Get existing global watch callback
PyDict_WatchCallback PyDict_GetWatchCallback(); The callback will be called immediately before the modification to the dict takes effect, thus the callback will also have access to the prior state of the dict. |
At first quick glance, this makes sense and the API looks reasonable. Question: what happens on interpreter shutdown? Shutdown obviously finalized and clears out most all dicts. I guess the C callback simply gets called for each of these? That makes sense. Just wondering if there are any ramifications of that. The callback is in C so it shouldn't have issues with this. A pyperformance suite run on an interpreter modified to support this but having no callbacks registered would be useful. (basically judging if there is measurable overhead added by the watched bit check - I doubt it'll be noticeable in most code) |
Thanks gps! Working on a PR and will collect pyperformance data as well. We haven't observed any issues in Cinder with the callback just being called at shutdown, too, but if there are problems with that it should be possible to just have CPython clear the callback at shutdown time. |
Hm, this is a bit scary. Could we (or others) end up with unguarded stale caches if some buggy extension forgets to chain the calls correctly? Core CPython seems most at risk of this, since we would most likely be registered first. |
Also, when you say "only one global callback": does that mean per-interpreter, or per-process? |
Yes. I can really go either way on this. I initially opted for simplicity in the core support at the cost of asking a bit more of clients, on the theory that a) there are lots of ways for a buggy C extension to cause crashes with bad use of the C API, and b) I don't expect there to be very many extensions using this API. But it's also true that the consequences of a mistake here could be hard to debug (and easily blamed to the wrong place), and there might turn out to be more clients for dict-watching than I expect! If the consensus is to prefer CPython tracking an array of callbacks instead, we can try that.
Good question! The currently proposed API suggests per-process, but it's not a question I've given a lot of thought to yet; open to suggestions. It seems like in general the preference is to avoid global state and instead tie things to an interpreter instance? I'll need to do a bit of research to understand exactly how that would affect the implementation. Doesn't seem like it should be a problem, though it might make the lookup at write time to see if we have a callback a bit slower. |
Per interpreter seems best. If someone using this feature writes a buggy implementation of a callback that doesn't chain reliably, that is a bug in their code and all of the fallout from that is "just" a bug to be fixed in said code. Think of it like a C signal handler, the OS doesn't chain for you - it is the signal handler installer's responsibility to chain to the previous one. Not an unusual pattern for callback hooks in C. We already have such global C callback hooks in SetProfile and SetTrace. https://docs.python.org/3/c-api/init.html#profiling-and-tracing Those don't even provide for chaining. We could do the same here and not let a hook be set more than once instead of providing a Get API to allow for manual chaining. That seems less useful. |
Why so coarse? Getting a notification for every change of a global in module, is likely to make use the use of global variables extremely expensive.
I may well want to be notified if -------------- What happens if a watched dictionary is modified in a callback? -------------- How do you plan to implement this? Steal a bit from You'd probably need a PEP to replace PEP-509, but I think this may need a PEP anyway. |
Perhaps a compromise is possible here: one global group/chain of callbacks registered for all dictionaries in an interpreter seems reasonable (to keep overhead low), but we could reduce the number of times it’s actually called by, say, tagging specific values of specific dictionaries to be watched. For example, we could just tag the low bit of any pointer in a dictionary’s values that we want to be notified of changes to. Something like that seems like it could really keep the cost of watching down without sacrificing power. |
Thanks for the feedback!
Simplicity of implementation is a strong advantage, all else equal :) And the coarse version is a) at least somewhat proven as useful and usable already by Cinder / Cinder JIT, and b) clearly doable without introducing memory or noticeable CPU overhead to unwatched dicts. Do you have thoughts about how you'd do a more granular version without overhead?
It's possible. We haven't ever observed this as an issue in practice, but we may have just not observed enough workloads with heavy writes to globals. I'd like to verify this problem with a real representative benchmark before making design decisions based on it, though. Calling a callback that is uninterested in a particular key doesn't need to be super-expensive if the callback is reasonably written, and this expense would occur only on the write path, for cases where the
Would you want to tag the value, or the key? If value, does that mean if the value is changed it would revert to unwatched unless you explicitly watched the new value? I'm a bit concerned about the performance overhead this would create for use of dicts outside the write path, e.g. the need to mask off the watch bit of returned value pointers on lookup.
It may be best to document that this isn't supported; it shouldn't be necessary or advisable for the intended uses of dict watching. That said, I think it should work fine if the callback can handle re-entrancy and doesn't create infinite recursion. Otherwise, I think it's a case of "you broke it, you get to keep all the pieces."
We currently steal the low bit from the version tag in Cinder; my plan was to keep that approach.
I'd prefer to avoid coupling this to removal of the version tag. Then we get into issues of backward compatibility that this proposal otherwise avoids. I don't think the current proposal is of a scope or level of user impact that should require a PEP, but I'm happy to write one if needed. |
Draft PR is up for consideration. Perf data in https://gist.github.com/carljm/987a7032ed851a5fe145524128bdb67a Overall it seems like the base implementation is perf neutral -- maybe a slight impact on the pickle benchmarks? With all module global dicts (uselessly) watched, there are a few more benchmarks with small regressions, but also some with small improvements (just noise I guess?) -- overall still pretty close to neutral. Comments welcome! |
A curiosity: have you considered watching dict keys rather than whole dicts? That way, changing global values would not have to de-optimize, only adding new global keys would. Indexing into dict values array wouldn't be as efficient as embedding direct jump targets in JIT-generated machine code, but as long as we're not doing that, maybe watching the keys is a happy medium? |
Hi Dennis, thanks for the questions!
There's a bit of discussion of this above. A core requirement is to avoid any memory overhead and minimize CPU overhead on unwatched dicts. Additional memory overhead seems like a nonstarter, given the sheer number of dict objects that can exist in a large Python system. The CPU overhead for unwatched dicts in the current PR consists of a single added It's not clear to me how to implement per-key watching under this constraint. One option Brandt mentioned above is to steal the low bit of a Open to suggestions if I've missed a good option here!
But we are doing that, in the Cinder JIT. Dict watching here is intentionally exposed for use by extensions, including hopefully in future the Cinder JIT as an installable extension. We burn exact pointer values for module globals into generated JIT code and deopt if they change (we are close to landing a change to code-patch instead of deopting.) This is quite a bit more efficient in the hot path than having to go through a layer of indirection. I don't want to assume too much about how dict watching will be used in future, or go for an implementation that limits its future usefulness. The current PR is quite flexible and can be used to implement a variety of caching strategies. The main downside of dict-level watching is that a lot of notifications will be fired if code does a lot of globals-rebinding in modules where globals are watched, but this doesn't appear to be a problem in practice, either in our workloads or in pyperformance. It seems likely that a workable strategy if this ever was observed to be a problem would be to notice at runtime that globals are being re-bound frequently in a particular module and just stop watching that module's globals. |
Just realized that I misunderstood this suggestion; you don't mean per-key watching necessarily, you just mean _not_ notifying on dict values changes. Now I understand better how that connects to the second part of your comment! But yeah, I don't want this limitation on dict watching use cases. |
There are three kinds of changes that we might want to watch (that I can think of right now):
One way to support the three cases above would be to replace the dict version with a pointer to a data structure describing what it watched. |
Thanks for outlining the use cases. They make sense. The current PR provides a flexible generic API that fully supports all three of those use cases (use cases 2 and 3 are strict subsets of use case 1.) Since the callback is called before the dict is modified, all the necessary information is available to the callback to decide whether the event is interesting to it or not. The question is how much of the bookkeeping to classify events as "interesting" or "uninteresting" should be embedded in the core dispatch vs being handled by the callback. One reason to prefer keeping this logic in the callback is that with potentially multiple chained callbacks in play, the filtering logic must always exist in the callback, regardless. E.g. if callback A wants to watch only keys-version changes to dict X, but callback B wants to watch all changes to it, events will fire for all changes, and callback A must still disregard "uninteresting" events that it may receive (just like it may receive events for dicts it never asked to watch at all.) So providing API for different "levels" of watching means that the "is this event interesting to me" predicate must effectively be duplicated both in the callback and in the watch level chosen. The proposed rationale for this complexity and duplication is the idea that filtering out uninteresting events at dispatch will provide better performance. But this is hypothetical: it assumes the existence of perf-bottleneck code paths that repeatedly rebind globals. The only benchmark workload with this characteristic that I know of is pystone, and it is not even part of the pyperformance suite, I think precisely because it is not representative of real-world code patterns. And even assuming that we do need to optimize for such code, it's also not obvious that it will be noticeably cheaper in practice to filter on the dispatch side. It may be more useful to focus on API. If we get the API right, internal implementation details can always be adjusted in future if a different implementation can be shown to be noticeably faster for relevant use cases. And if we get existing API right, we can always add new API if we have to. I don't think anything about the proposed simple API precludes adding One modification to the simple proposed API that should improve the performance (and ease of implementation) of use case #2 would be to split the current |
I've updated the PR to split |
You might not like global variables, they may not show up much in benchmarks, but people do use them. I suspect a lot of jupyter notebooks have quite a few global variables. There should not be much of a slowdown for this code when watching CONST = ... # watched |
Another use of this is to add watch points in debuggers. To that end, it would better if the callback were a Python object. The overhead is relatively small if using the vectorcall protocol. |
How and when (and based on what data?) would the adaptive interpreter make the decision that for this code sample the key The code sample also suggests that the module globals dict for a module is being watched while that module's own code object is being executed. In module body execution, writing to globals (vs reading them) is relatively much more common, compared to any other Python code execution context, and it's much less common for the same global to be read many times. Given this, how frequently would watching module globals dictionaries during module body execution be a net win at all? Certainly cases can be contrived in which it would be, but it seems unlikely that it would be a net win overall. And again, unless the optimizer can reliably (and in advance, since module bodies are executed only once) distinguish the cases where it's a win, it seems the example is not practically relevant.
It is easy to create a C callback that delegates to a Python callable if someone wants to implement this use case, so the vectorcall overhead is paid only when needed. The core API doesn't need to be made more complex for this, and there's no reason to impose any overhead at all on low-level interpreter-optimization use cases. |
Let me give you an example. #module eggs eggs_var = 0 # a variable, maybe a counter or similar
EGGS_CONST # a constant #module spam import eggs spam_var # Another variable def foo():
use(eggs.EGGS_CONST) We will want to treat This might not be necessary for us right now, but we will want to implement optimizations over larger regions than a single bytecode in 3.12. |
Thanks for the extended example. I think in order for this example to answer the question I asked, a few more assumptions should be made explicit:
It is certainly possible that this case could occur, where some module contains both a frequently-read-but-not-written global and also a global that is re-bound using
I think it's worth keeping in mind that |
@markshannon and I discussed this at the PyCon sprints and agreed on an alternative design that places more complexity in the CPython core in exchange for making it easier to implement watching clients (and probably somewhat more CPU-efficient overall in the presence of multiple watching clients). The design will still initially not support per-key watching, only watching of entire dicts; per-key watching can be layered on top of it later without disturbing the initial whole-dict semantics. The change is that now the core dispatcher will track individual watchers, limited to a maximum of 8 watchers total. This seems reasonable given that we don't really see a use case for more than 3 watchers (the interpreter loop, maybe a JIT, and maybe a debugger) to be active at one time. 640k should be more than enough for anybody, right?! Rather than stealing only one bit from the dict version, we will now steal 8, one bit for each registered watcher. When a watcher first registers its callback, it will get back an integer ID. Whenever it wants to watch or unwatch a dict, it must pass in that ID. Each dict will track a "watched" bit for each watcher. The advantages of this design are that a) watchers are no longer responsible to chain to other watchers, b) watchers no longer have to potentially handle events for dicts they did not ask to watch (and thus don't have to necessarily maintain their own internal data structure of dicts they are interested in and check that data structure for every event), and c) it can now be possible for a client to unwatch a dict, since that won't impact other clients. A future enhancement (which will probably need a PEP, to take over the dict version entirely) can replace the dict version with a pointer to an array that holds a byte per key, marking which watcher(s) are interested in each individual key, as suggested by Mark above. |
Sounds good to me! This still supports Cinder's use case and should require minimal effort on our end. |
This issue seems stalled. IIRC Mark and Carl have agreed on a plan -- who's on the hook for executing (getting a PR ready for review in 3.12)? |
I am on the hook, just not finding time... I definitely plan to make it happen in time for 3.12 though. |
Fixed with merge of #31787 |
* main: bpo-35540 dataclasses.asdict now supports defaultdict fields (pythongh-32056) pythonGH-91052: Add C API for watching dictionaries (pythonGH-31787) bpo-38693: Use f-strings instead of str.format() within importlib (python#17058)
* main: pythongh-68686: Retire eptag ptag scripts (python#98064) pythongh-97922: Run the GC only on eval breaker (python#97920) GitHub Workflows security hardening (python#96492) Add `@ezio-melotti` as codeowner for `.github/`. (python#98079) pythongh-97913 Docs: Add walrus operator to the index (python#97921) [doc] Fix broken links to C extensions accelerating stdlib modules (python#96914) pythongh-97822: Fix http.server documentation reference to test() function (python#98027) pythongh-91052: Add PyDict_Unwatch for unwatching a dictionary (python#98055) pythonGH-98023: Change default child watcher to PidfdChildWatcher on supported systems (python#98024) pythonGH-94182: Run the PidfdChildWatcher on the running loop (python#94184)
* main: (5519 commits) Minor edits to the Descriptor HowTo Guide (pythonGH-24901) Fix link to Lifecycle of a Pull Request in CONTRIBUTING (python#98102) pythonGH-94597: deprecate `SafeChildWatcher`, `FastChildWatcher` and `MultiLoopChildWatcher` child watchers (python#98089) Auto-cancel old builds when new commit pushed to branch (python#98009) pythongh-95011: Migrate syslog module to Argument Clinic (pythonGH-95012) pythongh-68686: Retire eptag ptag scripts (python#98064) pythongh-97922: Run the GC only on eval breaker (python#97920) GitHub Workflows security hardening (python#96492) Add `@ezio-melotti` as codeowner for `.github/`. (python#98079) pythongh-97913 Docs: Add walrus operator to the index (python#97921) [doc] Fix broken links to C extensions accelerating stdlib modules (python#96914) pythongh-97822: Fix http.server documentation reference to test() function (python#98027) pythongh-91052: Add PyDict_Unwatch for unwatching a dictionary (python#98055) pythonGH-98023: Change default child watcher to PidfdChildWatcher on supported systems (python#98024) pythonGH-94182: Run the PidfdChildWatcher on the running loop (python#94184) pythongh-92886: make test_ast pass with -O (assertions off) (pythonGH-98058) pythongh-92886: make test_coroutines pass with -O (assertions off) (pythonGH-98060) pythongh-57179: Add note on symlinks for os.walk (python#94799) pythongh-94808: Fix regex on exotic platforms (python#98036) pythongh-90085: Remove vestigial -t and -c timeit options (python#94941) ...
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: