Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect description for env_extra for HTCondorCluster #393

Closed
matyasselmeci opened this issue Mar 16, 2020 · 20 comments · Fixed by #563
Closed

Incorrect description for env_extra for HTCondorCluster #393

matyasselmeci opened this issue Mar 16, 2020 · 20 comments · Fixed by #563
Labels

Comments

@matyasselmeci
Copy link
Contributor

matyasselmeci commented Mar 16, 2020

Hi,

The description for env_extra in HTCondorCluster is not correct: the job that HTCondorCluster creates calls dask-worker directly instead of through a bash wrapper script, so you cannot put arbitrary shell commands into env_extra.

The interface supports environment variables as key=value pairs, which will be inserted into dask-worker's environment (via the "Environment" attribute in the submit file). (For consistency, you can write export foo=bar but the word "export" will be ignored.)

This is also important to keep in mind with regards to #323; renaming env_extra to job_script_extra or similar would be even more inaccurate (for the HTCondor case anyway).

@lesteve
Copy link
Member

lesteve commented Mar 17, 2020

Thanks for the info!

Not sure what to do about this I have to say. I would be tempted to say that there should be a HTCondor-specific argument to pass environment variables to HTCondor and that env_extra should be lines inserted as is in the job script.

I know nothing about HTCondor so that may not make that much sense ... I think it would help a lot if you had examples of

  • example use cases for job_extra and env_extra in HTCondor now
  • example use cases for job_script_extra if it existed i.e. some lines that are inserted in the job script as is.

@matyasselmeci
Copy link
Contributor Author

This is a complicated answer because HTCondor has different semantics than the other batch schedulers. In PBS for example, you've got a shell script with some magic comments on top for controlling the batch scheduling, so essentially you have the entire job in a single file.

In HTCondor, the commands for controlling the scheduling are in a separate file, called a submit file, which is mostly a key=value file (aside from the Queue at the end). Here's an example:

Output = job.out
Error = job.err
Log = job.log

RequestCpus = 1
RequestMemory = 10MB
Executable = /usr/bin/sleep
Arguments = 30

Queue

HTCondor.job_script() essentially turns a dictionary into this file.

So if you want to run multiple remote commands in a single job, put them in a shell script and use that script as the Executable. (HTCondor will transfer the Executable to the remote machine.)

How this applies to your questions:

  • job_extra is for adding arbitrary KV pairs to the submit file; this can be useful for requesting custom resources (as in Request_GPUs=1) or requiring certain systems (Requirements=(OpSysAndVer=="CentOS7")).

  • env_extra is for adding environment variables to the remote job. I don't know how that's used in a dask worker but since you had that ability for the other batch systems, I added it for HTCondor as well.

  • job_script_extra would require adding code to HTCondorJob to write a wrapper script in a temporary file, and use that as the executable. It's not very difficult, it would just be additional lines of code that are HTCondor-specific.

Does that help? Let me know if you have further questions.

@lesteve
Copy link
Member

lesteve commented Mar 18, 2020

Thanks that helps a lot! It would be great if HTCondor could move a bit closer to the other *Cluster classes.

If I understand your message correctly, if feels like one way to do that would be to create a temporary wrapper shell script that is used as the HTCondor executable.

Maybe something useful would be to add an additional method to show the wrapper script as well e.g. job_htcondor_executable or a better name?

@matyasselmeci
Copy link
Contributor Author

That sounds fine. I'll also throw in htcondor_remote_executable as a possible name.

@jolange
Copy link
Contributor

jolange commented Jul 25, 2022

I would be happy to help fixing this. @matyasselmeci , @lesteve Do you really think we need a temporary wrapper script here?
In #556 I simply prepended commands to the command template, resulting in something like this in the submit file:

Arguments = "-c 'cd <whereever>; source venv/bin/activate; python3 -m distributed.cli.dask_worker tcp://131.169.168.86:43800 --nthreads 1 --memory-limit 2.00GB --name dummy-name --nanny --death-timeout 60'"

This is less readable, but I think it's ok, because the number of commands should not be too large in most cases.
On the other hand, a temporary wrapper script would be nicer to avoid the quoting mess...

I don't have a strong opinion on this, but would like to discuss it before starting to implement something.

@guillaumeeb
Copy link
Member

guillaumeeb commented Jul 25, 2022

So this came up again in #556.

My opinion is that we should make HTCondorCluster closer to other Cluster implementations, and so have an env_extra that does the same. If it is through a temporary script that's okay. But I'm open to other proposals as I really don't know HTCondor.

cc @riedel @jolange (edit: who beat me to answer here)

@matyasselmeci
Copy link
Contributor Author

@jolange I think in this case you have already a wrapper script, it's just inlined into the Arguments line :-)

Aside from the quoting problems that you mentioned, having a hard-to-read Arguments line just makes problems harder to debug, so I would prefer having the wrapper script be a separate file.

@jolange
Copy link
Contributor

jolange commented Jul 25, 2022

I think in this case you have already a wrapper script, it's just inlined into the Arguments line :-)

I would disagree; a wrapper script to me is a separate file -- but let's not focus on semantics here ;-)

having a hard-to-read Arguments line just makes problems harder to debug

Probably for many cases, yes. But here I am not so sure that it is true, because with the in-lining solution you at least have everything self-contained in the submit file and thus in the output of condor_q -l or condor_history -l. So if users complain to their HTCondor admins, they can check more easily what's going on. (And even the users themselfs might check the condor output before trying to figure out what was written to a script.) A temporary script is gone by construction. And honestly, the arguments line will consist of a handful of commands -- I think that's not too hard to read.

Another drawback is that the temporary file has to be transferred. Probably not a big deal, because it is a small file, though.

Again, I am not against the temporary script file solution, I just want to discuss the pros and cons.

@guillaumeeb
Copy link
Member

A temporary script is gone by construction.

In all other Cluster implementations, we use temporary scripts. It's true that this probably makes it difficult for cluster admin to debug things, but anyway they'd probably need to be a bit familiar with Dask.

One of the first thing to debug dask-jobqueue is to print the job that gets submitted from your main Python script or notebook. And if needed submit it by hand.

@jolange
Copy link
Contributor

jolange commented Jul 26, 2022

Alright, since you seem to agree to prefer the temporary-wrapper-script solution, I thought a bit about how to realize this. The problem I encounter now is that I can't make that file really temporary in the sense "create wrapper script file; run submit command; delete wrapper script file", because of this (docs)

HTCondor will transfer these files, potentially delaying this transfer request, if starting the transfer right away would overload the submit machine.

That's indeed a problem with our HTCondor instance and I think it is in general. If I run this (for htc-test-script.sub, see [1] below),

$ echo '#!/usr/bin/bash\necho "temp script"' > htc-test-script-tmp.sh; chmod u+x htc-test-script-tmp.sh; cat htc-test-script-tmp.sh; condor_submit htc-test-script.sub; rm htc-test-script-tmp.sh
#!/usr/bin/bash
echo "temp script"
Submitting job(s).
1 job(s) submitted to cluster 20730236.

the job goes in HOLD state, because the file is gone by the time it tries to transfer:

$ condor_q -l 20730236.0 | ack Reason
HoldReason = "Error from <worker node>: SHADOW at <IP> failed to send file(s) to <IP>: error reading from <full path to>/htc-test-script-tmp.sh: (errno 2) No such file or directory; STARTER failed to receive file(s) from <<IP>>"
HoldReasonCode = 13
HoldReasonSubCode = 2

Does anybody have a suggestion how to handle this?

About

In all other Cluster implementations, we use temporary scripts

Right, but the difference is that for those [I don't know if for all, really], you directly hand over the temporary file to the submission command and after that can delete it, right? That's what the .sub file here is for HTCondor, but it's not the case for .sh in this example, where you have to wait for the transfer.


[1] htc-test-script.sub

executable              = htc-test-script-tmp.sh
log                     = htc-test-script.log
output                  = outfile.txt
error                   = errors.txt
should_transfer_files   = Yes
queue

@jacobtomlinson
Copy link
Member

I'm not entirely familiar with the contents of the temporary script but if it contains nothing sensitive it could just be written to /tmp and not cleaned up. We could just rely on the operating system to handle that cleanup. That way it would be much more likely to still exist when HTConder gets around to transferring it.

@guillaumeeb
Copy link
Member

I haven't a strong preference between the two solutions. If using a temporary script is more complex to handle, then we might want to stay wth inlined content at first.

I'm also OK with @jacobtomlinson suggestion.

I'm not sure when other temporary scripts (and so the HTCondor submit file) are really deleted, see the code here: https://github.com/dask/dask-jobqueue/blob/main/dask_jobqueue/core.py#L334. This uses a contextmanager, so I guess it is deleted right after the submit command is run, which would apparently not work based on your tests.

@jolange
Copy link
Contributor

jolange commented Jul 26, 2022

written to /tmp and not cleaned up

That might be a solution for many setups, although knowingly cluttering /tmp is not the nicest thing. But for our setup this does not work at all, because the schedds run on separate hosts and not on the submit machines themselves. So the location from which to transfer has to be mounted on the schedd host. That's true e.g. for our home directories (AFS), but obviously not for /tmp.
(I am currently investigating with our HTCondor experts if a transfer "submit node -> schedd host -> worker node" or direct "submit node -> worker node" could be configured at all, but I also don't know if this would be implemented, even if it is possible).

so I guess it is deleted right after the submit command is run, which would apparently not work based on your tests.

Right, that's also my understanding.

@jacobtomlinson
Copy link
Member

although knowingly cluttering /tmp is not the nicest thing

Heh /tmp is there to be cluttered 😆 .

So the location from which to transfer has to be mounted on the schedd host.

Ah that's a shame.

@jolange
Copy link
Contributor

jolange commented Jul 26, 2022

Heh /tmp is there to be cluttered 😆

😆

So by now I would prefer the in-lining solution to an extra temporary script. But maybe someone else has a good idea how to realize this.

@guillaumeeb
Copy link
Member

I'd say go for the in-lining solution for the moment!

jolange added a commit to jolange/dask-jobqueue that referenced this issue Jul 26, 2022
Before only environment variables were considered using HTCondor's
`Environment =` feature, whereas the parameter description generally
says "Other commands to add".

If the `Environment =` is needed, one can still use the generic
`job_extra` parameter to set it.

fixes dask#393
related to dask#323 dask#556
@jolange
Copy link
Contributor

jolange commented Jul 26, 2022

Ok, I made a proposal in #563. I'd suggest to start the renaming (#323) after we agreed on the implementation and merged it. (I can also do that if you like.)

@riedel
Copy link
Member

riedel commented Jul 26, 2022

Could we issue a warning if we detect sth like X=1 without an export statement? I think it would be a defacto change? Some existing code could break I guess.

@jolange
Copy link
Contributor

jolange commented Jul 26, 2022

Yes, that could really be the case, unfortunately. (Altough I would consider those cases "wrong" from the beginning, but it could be a breaking change in these probably rare cases...)

But issuing a warning would also affect valid uses without "export" like MYDIR=/tmp; cd $MYDIR; ....
And this is potentially hard to differentiate from other occurences of = like arguments (some_command --variable=3).

@guillaumeeb
Copy link
Member

I'm under the impression we are about to do some important changes, which will probably mean a major version update.

So I would say the warning is not mandatory here, But maybe we should add a documentation about the change somewhere, maybe here: https://jobqueue.dask.org/en/latest/examples.html.

guillaumeeb pushed a commit that referenced this issue Aug 1, 2022
* fix behaviour of `env_extra` for HTCondor

Before only environment variables were considered using HTCondor's
`Environment =` feature, whereas the parameter description generally
says "Other commands to add".

If the `Environment =` is needed, one can still use the generic
`job_extra` parameter to set it.

fixes #393
related to #323 #556

* adapt htcondor tests for new `env_extra` behaviour

* adapt htcondor tests for new `env_extra` behaviour

"export" is preserved now

* docs: made "Moab Deployments" heading the same level as the others

* docs: added description of HTCondorCluster + env_extra

- example in the docstring
- description in "example deployments""
- description in "advanced tops an tricks"

* docs: removed the HTCondorCluster section from examples

* formatting according to black and flake8
guillaumeeb pushed a commit to ikabadzhov/dask-jobqueue that referenced this issue Aug 7, 2022
* fix behaviour of `env_extra` for HTCondor

Before only environment variables were considered using HTCondor's
`Environment =` feature, whereas the parameter description generally
says "Other commands to add".

If the `Environment =` is needed, one can still use the generic
`job_extra` parameter to set it.

fixes dask#393
related to dask#323 dask#556

* adapt htcondor tests for new `env_extra` behaviour

* adapt htcondor tests for new `env_extra` behaviour

"export" is preserved now

* docs: made "Moab Deployments" heading the same level as the others

* docs: added description of HTCondorCluster + env_extra

- example in the docstring
- description in "example deployments""
- description in "advanced tops an tricks"

* docs: removed the HTCondorCluster section from examples

* formatting according to black and flake8
guillaumeeb added a commit that referenced this issue Aug 7, 2022
…ob scripts (#560)

* Use `--nworkers` in stead of deprecated `--nprocs` in the generated job scripts

Fix #559.

* Update the requirements for dask and distributed

This is needed to support the `--nworkers` option.

* Fix behaviour of `env_extra` for HTCondor (#563)

* fix behaviour of `env_extra` for HTCondor

Before only environment variables were considered using HTCondor's
`Environment =` feature, whereas the parameter description generally
says "Other commands to add".

If the `Environment =` is needed, one can still use the generic
`job_extra` parameter to set it.

fixes #393
related to #323 #556

* adapt htcondor tests for new `env_extra` behaviour

* adapt htcondor tests for new `env_extra` behaviour

"export" is preserved now

* docs: made "Moab Deployments" heading the same level as the others

* docs: added description of HTCondorCluster + env_extra

- example in the docstring
- description in "example deployments""
- description in "advanced tops an tricks"

* docs: removed the HTCondorCluster section from examples

* formatting according to black and flake8

* Drop Python 3.7 (#562)

* Drop Python 3.7

* Fix cleanup fixture probem (see fistributed#9137)

* Override cleanup distributed fixture, and reconfigure dask-jobqueue when called

* Use power of 2 for the memory checks in tests (see dask#7484)

* Apply Black

* Apply correct version of black...

* conda install Python in HTCondor Docker image

* Fix HTCondor Dockerfile

* Fix PBS Issue

* Add a timeout for every wait_for_workers call

* Fix HTCondor tests, leave room for scheduling cycle to take place for HTCondor

* flake check

* debugging HTCondor tests on CI

* Flake

* reduce negotiator interval for faster job queuing, more debugging logs

* move condor command at the right place

* always run cleanup step, and print more things on HTCondor

* Clean temporary debug and other not necessary modifications

* Disable HTCondor CI for now

* Import loop_in_thread fixture from distributed

* Override method from distributed to make tests pass

Co-authored-by: Johannes Lange <jolange@users.noreply.github.com>
Co-authored-by: Guillaume Eynard-Bontemps <g.eynard.bontemps@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants