Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global variable not updated on remote workers in some cases #31252

Closed
marius311 opened this issue Mar 4, 2019 · 5 comments
Closed

Global variable not updated on remote workers in some cases #31252

marius311 opened this issue Mar 4, 2019 · 5 comments
Labels
domain:parallelism Parallel or distributed computation kind:bug Indicates an unexpected problem or unintended behavior

Comments

@marius311
Copy link
Contributor

marius311 commented Mar 4, 2019

Looks like in some cases, if I switch a global variable, then switch it back, its value on remote workers when captured is only updated after the first switch. Here is a MWE (Julia 1.1.0):

julia> using Distributed

julia> addprocs(1)
1-element Array{Int64,1}:
 2

julia> x = :a
:a

julia> @fetch x
:a

julia> x = :b
:b

julia> @fetch x
:b

julia> x = :a
:a

julia> @fetch x
:b # <--- should be :a here

This doesn't always happen though. E.g. if x is instead an Int, it works as I would have expected from the docs:

julia> using Distributed

julia> addprocs(1)
1-element Array{Int64,1}:
 2

julia> x = 1
1

julia> @fetch x
1

julia> x = 2
2

julia> @fetch x
2

julia> x = 1
1

julia> @fetch x
1

Note sure what triggers the bug, if it even has anything to do with the type of x, although is seems reproducible every time. I've found that e.g. x::Int and x::Vector work fine, but x::Symbol and x::String do not.

@raminammour
Copy link
Contributor

raminammour commented Mar 5, 2019

The logic here is wrong (you can redefine the function with @show s.glbs_sent to see what is going on):

s.glbs_sent[oid] != hash(sym, hash(v)) && push!(lst, sym)

In the steps you show before (for a a non-bits type):

x=:a //objectid(x)=hash(:a)
s.globs_sent=Dict(hash(:a)=>hash(x,hash(:a))) //new oid, send x
x=:b //objectid(x)=hash(:b) 
s.globs_sent=Dict(hash(:a)=>hash(x,hash(:a)),hash(:b)=>hash(x,hash(:b))) //new oid, send x
x=:a
s.globs_sent=Dict(hash(:a)=>hash(x,hash(:a)),hash(:b)=>hash(x,hash(:b))) //key and value match from the first run, don't send x

This is a bug and will happen for many types (as long as you go back to a previous value for the same variable), and I am not sure this PR helps either :(

@amitmurthy amitmurthy added kind:bug Indicates an unexpected problem or unintended behavior domain:parallelism Parallel or distributed computation labels Mar 6, 2019
@amitmurthy
Copy link
Contributor

The issue is here -

glbs_sent::Dict{UInt64, UInt64} # (key,value) -> (objectid, hash_value)

I think glbs_sent::Dict{UInt64, UInt64} should be glbs_sent::Dict{Symbol, Tuple{UInt64, UInt64}} i.e. (key,value) -> (global_symbol,(objectid, hash_value))`.

I'll submit a PR in a couple of days.

@raminammour
Copy link
Contributor

Yes, i figured that glbs_sent should be symbol => (oid,value), so the changing value can be overwritten. I was too scared to submit any PR since this is quite low level and I don't know what else relies on the ClusterSerializer struct.
Thanks for taking care of this!

@amitmurthy
Copy link
Contributor

I was too scared to submit any PR since this is quite low level and I don't know what else relies on the ClusterSerializer struct.

Please don't be! The more contributors the better. Most of us have been new to the Julia codebase at one time.

@Keno
Copy link
Member

Keno commented Apr 20, 2019

The linked PR that was merged claims to have fixed this.

@Keno Keno closed this as completed Apr 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation kind:bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants