You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[mlir] Fix race condition introduced in ThreadLocalCache (#93280)
Okay, so an apparently not-so-rare crash could occur if the
`perInstanceState` point got re-allocated to the same pointer as before.
All the values have been destroyed, so the TLC is left with dangling
pointers, but then the same `ValueT *` is pulled out of the TLC and
dereferenced, leading to a crash.
I suppose the purpose of the `weak_ptr` was that it would get reset to a
default state when the `perInstanceState` shared pointer got destryoed
(its reference count would only ever be 1, very briefly 2 when it gets
aliased to a `ValueT *` but then assigned to a `weak_ptr`).
Basically, there are circular references between TLC instances and
`perInstanceState` instances and we have to ensure there are no dangling
references.
1. Ensure the TLC entries are reset to a valid default state if the TLC
(i.e. owning thread) lives longer than the `perInstanceState`. a. This
is currently achieved by storing `weak_ptr` in the TLC.
2. If `perInstanceState` lives longer than the TLC, it cannot contain
dangling references to entries in destroyed TLCs. a. This is not
currently the case.
3. If both are being destroyed at the same time, we cannot race. a. The
destructors are synchronized because the TLC destructor locks `weak_ptr`
while it is destructing, preventing the owning `perInstanceState` of the
entry from destructing. If `perInstanceState` got destructed first, the
`weak_ptr` lock would fail.
And
4. Ensure `get` in the common (initialized) case is as fast as possible
(no atomics).
We need to change the TLC to store a `ValueT **` so that it can be
shared with entries owned by `perInstanceState` and written to null when
they are destroyed. However, this is no longer something synchronized by
an atomic, meaning that (2) becomes a problem. This is fine because when
TLC destructs, we remove the entries from `perInstanceState` that could
reference the TLC entries.
This patch shows the same perf gain as before but hopefully without the
bug.
0 commit comments