Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make workers inherit package environment by default, fix tests #17

Closed
wants to merge 1 commit into from

Conversation

jbphyswx
Copy link
Contributor

@jbphyswx jbphyswx commented Dec 9, 2024

Fixes #16.

Solution adapted from https://github.com/JuliaLang/julia/pull/43270/files
There may be things I left out or other enhancements I should have included but this seems like a good first step.

Also the tests weren't working correctly for me, so I updated them -- I'm not sure if all the problems I encountered were on my machine or generally true across slurm. See the new tests.

I also added some tests of the inheritance framework in test/test_inheritance (SlurmClusterManagers is a dependency there since we don't use withenv for that test).

@jbphyswx
Copy link
Contributor Author

jbphyswx commented Dec 9, 2024

I think in principle this should also include compatibility changes for Julia > 1.9?

@jbphyswx jbphyswx marked this pull request as draft December 9, 2024 18:26
@jbphyswx jbphyswx marked this pull request as ready for review December 9, 2024 18:27
@jbphyswx jbphyswx changed the title Maker workers inherit environment by default, fix tests Make workers inherit environment by default, fix tests Dec 9, 2024
@jbphyswx jbphyswx changed the title Make workers inherit environment by default, fix tests Maker workers inherit package environment by default, fix tests Dec 9, 2024
@DilumAluthge
Copy link
Contributor

Could you split this up into two separate PRs? One PR to fix tests and CI, and a second PR for the Pkg environments?

@DilumAluthge
Copy link
Contributor

Hmmm, also, I did some testing, and for me, the package environment is already getting propagated to the workers by default. Do you have a MWE that demonstrates otherwise?

@jbphyswx
Copy link
Contributor Author

jbphyswx commented Dec 30, 2024

Thanks for checking in on this -- I do not know how to fix the CI issues (seemed to have failed instantly).
I will need to do some checks and follow up about splitting the PR -- I believe I papered over some necessary compat changes? Or maybe if I figure out what is wrong with the CI that would be help me see where if anywhere compatibility breaks?

I will see if I can construct a smaller MWE but as written, the test suite in this PR should fail without the code changes in the rest of the PR as the launch command for SlurmManager didn't implement any of the inheritance properties.

It should also print out Base.ACTIVE_PROJECT[] etc.

just test/test_inheritance/runtests_inheritance.jl perhaps should work?

If you have a script MWE_script.jl (a subset of the things in test/test_inheritance/runtests_inheritance.jl):

#!/usr/bin/env julia

using Distributed, Test, SlurmClusterManager

addprocs(SlurmManager(verbose=true))
@assert nworkers() == parse(Int, ENV["SLURM_NTASKS"])
@everywhere println("Host $(myid()): $(gethostname())") # I created a separate test for assuring that the env is set properly but without withenv it's hard to test -- you need another locaiton that has Distributed installed

active_project = Base.ACTIVE_PROJECT[] # project is set in runtests_inheritance so it will have a value [returns path unlike `Base.active_project()`]
@info "Active Project: $active_project"
@everywhere begin
    _active_project = Base.ACTIVE_PROJECT[]

    # Check the active project is correctly set to what the env variable was set to
    if !(_active_project == $active_project) # This should be the same on all workers
        active_project_interp = $active_project # hack to get the variable because i can't figure out how to do the interpolation inside a string inside an @everywhere block
        active_toml = Base.active_project()
        error("Expected Base.ACTIVE_PROJECT[] to match $active_project_interp, but got $_active_project. [ Active toml from Base.active_project() = $active_toml ]")
    end
end

and call it with something like

project_path = <desired project path w/ SlurmClusterManager.jl available>
script_file = <path to MWE_script.jl>
log_file = <desired log file>

jobid = read(`sbatch --wait --export=ALL --time=0:02:00 --parsable -n 2 -N 2 -o $log_file --wrap="julia --project=$project_path $script_file"`, String) # test a minimal config in a different location without env settings

output = read(log_file, String)

@info "results from job $jobid: $output"

It should fail on the assertion in MWE_script.jl without the updates from this PR. I just confirmed this on my end at least.

@DilumAluthge
Copy link
Contributor

Okay, so I don't think all of the changes in this PR are strictly necessary. Julia is already propagating all environment variables from the "manager process" to srun, because IIUC that is the default behavior of Base.run() and Base.open().

So we only need to propagate JULIA_PROJECT.

Can you try #19 and see if it fixes your MWE?


On a separate note, we should probably avoid ever calling Base.ACTIVE_PROJECT[]. IIUC, Base.ACTIVE_PROJECT[] is not public API. We should be calling Base.active_project() instead.

DilumAluthge added a commit to DilumAluthge-forks/SlurmClusterManager.jl that referenced this pull request Dec 30, 2024
Extracted from kleinhenz#17.

Co-authored-by: jbphyswx <jbenjami@caltech.edu>
@jbphyswx
Copy link
Contributor Author

Ok I will check later this evening (though a quick check seemed to work right now!). I originally just copied those 3 (JULIA_PROJECT, JULIA_LOAD_PATH, JULIA_DEPOT_PATH) because that's what was implemented in https://github.com/JuliaLang/julia/pull/43270/files and is still there in https://github.com/JuliaLang/Distributed.jl/blob/master/src/managers.jl.

The warning at Distributed.jl/src/managers.jl Line484 reads:

    # TODO: Maybe this belongs in base/initdefs.jl as a package_environment() function
    #       together with load_path() etc. Might be useful to have when spawning julia
    #       processes outside of Distributed.jl too.
    # JULIA_(LOAD|DEPOT)_PATH are used to populate (LOAD|DEPOT)_PATH on startup,
    # but since (LOAD|DEPOT)_PATH might have changed they are re-serialized here.
    # Users can opt-out of this by passing `env = ...` to addprocs(...).

But I don't have a test for changing JULIA_(LOAD|DEPOT)_PATH...

If I understand correctly what you're saying:

  • Setting command line flags like --PROJECT doesn't necessarily set them in ENV but...
  • JULIA_(LOAD|DEPOT)_PATH differ in that you can't set them directly from a command line option so they should always be set in ENV and thus will get automatically propagated.

But would we still not need this logic in case JULIA_(LOAD|DEPOT)_PATH were changed to match the implementation in Distributed.jl no?

@DilumAluthge
Copy link
Contributor

DilumAluthge commented Dec 30, 2024

Ah, I see, so upstream Distributed.jl wants to handle the case where a user has mutated Base.DEPOT_PATH and/or Base.LOAD_PATH.

Hmmm. I'm not sure how common that use case is, but it is probably fine for us to propagate the mutated depot and load paths to the workers.

I do wonder what the use case here is. For the project, it makes sense that we need to pass along the correct JULIA_PROJECT, because we have to cover both of the following cases:

  1. The user started Julia with julia --project or julia --project=something
  2. The user did Pkg.activate() in their Julia session.

Both of those cases seem pretty common, and thus we have to cover them. The depot and load paths case seem less common - I would expect that users would probably just set the relevant environment variables before they start Julia. It might be fine for us to still cover those use cases though. I'll think a little bit more about it.

@DilumAluthge
Copy link
Contributor

CI is now green on SlurmClusterManager#master, so I don't think that the test suite changes in this PR are needed.

DilumAluthge added a commit to DilumAluthge-forks/SlurmClusterManager.jl that referenced this pull request Dec 31, 2024
Extracted from kleinhenz#17.

Co-authored-by: jbphyswx <jbenjami@caltech.edu>
@jbphyswx
Copy link
Contributor Author

Awesome thanks so much for pushing this forward!

One of the test changes I made was changing @assert hosts == ["c1", "c1", "c2", "c2"] in test/script.jl (which seems to only be a Docker thing -- I'd not noted ci folder before) to something else...
So that works for CI I guess but not then if you try to run the tests locally on your cluster -- should the tests only work on CI? That seemed suboptimal for me when I was trying to run tests on my cluster while developing the PR, but I'm not sure the best practice there.

Everything else was around testing inheritance on the workers -- it might still be good to do some form of that, though I admit what I'd constructed may be overkill.

@DilumAluthge
Copy link
Contributor

So that works for CI I guess but not then if you try to run the tests locally on your cluster -- should the tests only work on CI? That seemed suboptimal for me when I was trying to run tests on my cluster while developing the PR, but I'm not sure the best practice there.

In an ideal world, I think we would have the tests work correctly on both CI and when running locally.

How about we keep the hostname-specific tests on CI, but skip them when running locally? I've implemented that here: #29

Note that #29 depends on #28.

@DilumAluthge
Copy link
Contributor

I'm working on breaking your PR up into smaller pieces. There are also some workarounds that need to be add

I've added you as Co-authored-by to make sure your GitHub user gets credit for the commits - if you spot any commits where I forgot to add you, just ping me, and I'll amend the commit and add your Co-authored-by line.

The first piece is here: #20

It just adds the support for params[:env]; it doesn't touch the project/depot_path/load_path stuff. That stuff will come in a follow-up PR.

Hopefully that should make it easier to review the smaller individual PRs.

@DilumAluthge
Copy link
Contributor

The second piece is here: #30.

#30 depends on #20. So #20 should be merged first.

@jbphyswx
Copy link
Contributor Author

Sounds good! Thanks for adding me to the commits but no worries about credit -- I'm just happy the package is moving forward. The PRs look good to me!

Thanks for working out the compatibility requirements too -- I've been pretty busy so I hadn't yet tracked down exactly what the constraints should be.

DilumAluthge added a commit to DilumAluthge-forks/SlurmClusterManager.jl that referenced this pull request Jan 1, 2025
Extracted from kleinhenz#17.

Co-authored-by: jbphyswx <jbenjami@caltech.edu>
@jbphyswx jbphyswx closed this Jan 15, 2025
@jbphyswx jbphyswx changed the title Maker workers inherit package environment by default, fix tests Make workers inherit package environment by default, fix tests Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

addprocs does not activate workers in current project
2 participants