Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass CoreWorker into wrapped C++ code #74

Merged
merged 20 commits into from
Sep 5, 2023
Merged

Conversation

omus
Copy link
Member

@omus omus commented Aug 31, 2023

Fix #61

@omus omus requested a review from kleinschmidt August 31, 2023 21:30
@omus omus marked this pull request as draft September 1, 2023 14:05
Comment on lines 113 to 119
function GetCoreWorker()
return ray_jll.GetCoreWorker()
# if !isassigned(CORE_WORKER)
# CORE_WORKER[] = ray_jll.GetCoreWorker()
# end
# return CORE_WORKER[]
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the write-once version of this function results in most of the Ray.jl test suite failing with:

C++ object of type N3ray4core10CoreWorkerE was deleted

I'll try a couple other things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. If I change our global reference to be a Ref{ray_jll.CoreWorker} then things work fine. It seems that the CxxRef doesn't hold onto the allocated memory.

@omus omus marked this pull request as ready for review September 1, 2023 14:22
@codecov-commenter
Copy link

codecov-commenter commented Sep 1, 2023

Codecov Report

Merging #74 (ec74d4d) into main (68c911a) will not change coverage.
The diff coverage is 83.33%.

@@           Coverage Diff           @@
##             main      #74   +/-   ##
=======================================
  Coverage   68.93%   68.93%           
=======================================
  Files           6        6           
  Lines         264      264           
=======================================
  Hits          182      182           
  Misses         82       82           
Flag Coverage Δ
Ray.jl 68.93% <83.33%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
Ray.jl/src/Ray.jl 100.00% <ø> (ø)
Ray.jl/src/runtime.jl 45.39% <75.00%> (ø)
Ray.jl/src/function_manager.jl 92.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@omus
Copy link
Member Author

omus commented Sep 1, 2023

CI failure will be fixed by #75

Copy link
Member

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my concern here is that it's not clear what the problem is that this is solving; it introduces some fragility by making the julia side keep track of teh C++ coreworker struct. That might be worth it if there's a clear benefit but from what I can tell right now, there isn't.

Get the current job ID for this worker or driver. Job ID is the id of your Ray drivers that
create tasks.
"""
get_job_id() = ray_jll.ToInt(ray_jll.GetCurrentJobId(GetCoreWorker()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why rename this function? I think it's better to keep the diff clean if there isn't a functional change here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 111 to 118
const CORE_WORKER = Ref{ray_jll.CoreWorker}()

function GetCoreWorker()
if !isassigned(CORE_WORKER)
CORE_WORKER[] = ray_jll.GetCoreWorker()[]
end
return CxxRef(CORE_WORKER[])
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pattern for other global refs like the function manager is to set this during initialization and use the ref value directly; is there a good reason to not use that pattern in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see it's because we need to wrap it in a CxxRef?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally the reason for this was to handle the start_worker case. In the scenario where start_worker is called we call the blocking call initialize_worker which is the point at which CoreWorker is instantiated. Doing this this way allows this case to work. It also ended up being useful when I changed the code to return a CxxRef.

deps/wrapper.cc Outdated
.method("GetCurrentTaskId", &ray::core::CoreWorker::GetCurrentTaskId)
.method("put", &put)
.method("get", &get)
.method("submit_task", &submit_task)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear to me why we are exposing these are methods of this class when they're not defined that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these as methods of this class only to avoid headaches with moving these definitions around. The way we've set these up in Julia these methods are associated with this type so doing this just indicates the relationship here. Definitely optional

@omus
Copy link
Member Author

omus commented Sep 1, 2023

I think my concern here is that it's not clear what the problem is that this is solving; it introduces some fragility by making the julia side keep track of teh C++ coreworker struct. That might be worth it if there's a clear benefit but from what I can tell right now, there isn't.

Continuing on this thread the C++ code typically just calls GetCoreWorker() over and over so I don't think there's a major performance hit with doing this on the C++ side. However, if we did this in Julia we'd be allocating and deallocating over and over again so we would see a performance hit.

After implementing this PR I think what we should do is only pass in a CoreWorker for it's own class methods and just keep calling GetCoreWorker internally in the C++ code. For this to be performant however we'd need to keep the cached Ray.GetCoreWorker function. What do you think?

@omus
Copy link
Member Author

omus commented Sep 1, 2023

Last thing to do here is to move the caching GetCoreWorker() into the JLL.

@omus
Copy link
Member Author

omus commented Sep 1, 2023

RTM again

Copy link
Member

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still a bit nervous about handlign this on the julia side but at least in teh JLL it's confined to a safe space. and as discussed in huddle, if we're not handling it correctly, we don't really understand what we're doing here (in a general sense) so probably better to surface that.

@kleinschmidt kleinschmidt merged commit c32e03d into main Sep 5, 2023
4 checks passed
@kleinschmidt kleinschmidt deleted the cv/global-core-worker branch September 5, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid over-calling GetCoreWorker()
3 participants