Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Distributed] Worker local race condition between put! and fetch for Futures #42339
[Distributed] Worker local race condition between put! and fetch for Futures #42339
Changes from 20 commits
9a9821d
81b3075
656546d
8d8712f
7c0dff0
64922da
389088b
7635442
50c4cd2
92188ba
306b6e4
07bbd7c
38b6419
c5045cf
0040080
87d30f6
4b634d3
2c78ed9
c93f743
c59408e
4b8d7da
3b68488
a115a32
83ecafa
67da4d5
59d910a
4d73d33
19fd304
b9eaef6
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple tasks are deserializing the same
ref
, only one of them should "win" here and updatefound
. We probably need to do this with:sequentially_consistent
order also, since that seems likely to be our intended memory model forput!
/take!
on AbstractRemoteRefsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed it with an
@atomicreplace
- seems to be more fitting as its sequentially consistent and also keeps our rule of always setting the cache onceThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit aggressive to make serialize/deserialize full atomic barriers (instead of their defaults). @tkf any thoughts too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi this is just one leftover
future.v
usage that wasn't wrapped in@atomic
and didn't error before in tests or usageThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Can
serialize(s::ClusterSerializer, f::Future)
callserialize(::AbstractSerializer, ::typeof(something(f.v)))
? If so, I guess we need to make it as strong asfetch
since the author ofMyType
can overrideserialize(::AbstractSerializer, ::MyType)
and it can contain arbitrary code?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer, the issue here is that we need a lock around
f.v
for the duration of this functionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a lock on
f.lock
around the full serialize functionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to change the Future serializer to exclude the new fields:
and corresponding
deserialize
methodsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to make it work, but without any success.
If I serialize the fields manually as proposed by you at some point the code expects a
Future
structure and I'm not really sure how to handle that.Anyway i was thinking about the issue and as of right now:
Future
with the new lock and serializes itFuture
based on it, but I'm unsure whether it takes the serialized lock data or creates a new one (constructors only ever create a new one, apart from the default)I made the serialization now serialize a copy of the Future with a new unused lock, but I'm not sure if it's necessary. It used to work fine before any serialization change anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vtjnash I rebased and it's finally passing CI.
What do you think of the above changes?
I couldn't get the serialization of only the "safe" fields working, so I just added a full copy in the serializer (so that the serialization always works on a copy with an unlocked lock).
I'm not sure it's necessary though as serializing a locked lock doesn't matter in the end as during deserialization a new lock is being constructed.
Or is there some reason not to serialize a locked lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why you were trying to use
invoke
, since there is definitely no hope of that approach working at all, while I had written the code just above here that I think should be roughly correct. Anyways, this approach now seems suitable enough so I have merged it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we lock
r::Future
here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't add a lock here since it's supposed to be a quick cache check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant here exactly, as in after the quick lookup.
For (my own) reference on performance:
(note: that last lock
lk
needs to be optimized, to be on-par with the other lks, but we didn't used to have the ability to express atomic operations in julia)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I put the whole thing in a
lock(r.lock)
then a localput!
will never be able to enter the lock, because thefetch
will be stuck on thefetch(rv.c)
. And input!
if i put into the channel outside of the lock I don't have control over who caches the local value firstI'm not sure what else could I improve here.
The local cache is always set once. The paths from
fetch(rv.c)
andcall_on_owner
will rightfully try to cache the value and only fail if it was already cached orput!
was locally.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, yeah, I see how it is complicated by the fact that we might be the owner, so we are waiting for someone else to set the value before we can return it here. But it seems like if this fails in the remote case, it is because there was a thread-synchronization error, which could happen because this lock was not being held across here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the call on owner case won't fail due to this lock, because it fetches on a remote channel
this
fetch
function isn't used in thecall_on_owner
scenarioThis looks safe to me as it is right now because of that, but I haven't dwelled that deep into the remote scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably be locked, so that
call_on_owner
only gets called once, but I'm not sure if there's any hidden implications of doing thatThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vtjnash Just not sure here. So when a fetch gets the value from
call_on_owner
at line 573 we cache the value inr.v
and then should fetch return thev_local
obtained from thecall_on_owner
or should I loadr.v
and get that cached value?Right now it's just returning
v_local
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure here. Since
fetch(rv.c)
succeeded, it seems like that set the value for everyone, but now we might decide to copy it locally too, but that we must have strictly maintained thatrv.v === r.v
, sincer.v
is just a locally cached copy of the same pointer fromrv.v
(which has the canonical copy) in this case. Is that right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. The
rv.v === r.v
is maintained right now properly I think