-
-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect description for env_extra for HTCondorCluster #393
Comments
Thanks for the info! Not sure what to do about this I have to say. I would be tempted to say that there should be a HTCondor-specific argument to pass environment variables to HTCondor and that I know nothing about HTCondor so that may not make that much sense ... I think it would help a lot if you had examples of
|
This is a complicated answer because HTCondor has different semantics than the other batch schedulers. In PBS for example, you've got a shell script with some magic comments on top for controlling the batch scheduling, so essentially you have the entire job in a single file. In HTCondor, the commands for controlling the scheduling are in a separate file, called a submit file, which is mostly a key=value file (aside from the
So if you want to run multiple remote commands in a single job, put them in a shell script and use that script as the How this applies to your questions:
Does that help? Let me know if you have further questions. |
Thanks that helps a lot! It would be great if If I understand your message correctly, if feels like one way to do that would be to create a temporary wrapper shell script that is used as the HTCondor executable. Maybe something useful would be to add an additional method to show the wrapper script as well e.g. |
That sounds fine. I'll also throw in |
I would be happy to help fixing this. @matyasselmeci , @lesteve Do you really think we need a temporary wrapper script here?
This is less readable, but I think it's ok, because the number of commands should not be too large in most cases. I don't have a strong opinion on this, but would like to discuss it before starting to implement something. |
So this came up again in #556. My opinion is that we should make HTCondorCluster closer to other Cluster implementations, and so have an env_extra that does the same. If it is through a temporary script that's okay. But I'm open to other proposals as I really don't know HTCondor. |
@jolange I think in this case you have already a wrapper script, it's just inlined into the Arguments line :-) Aside from the quoting problems that you mentioned, having a hard-to-read Arguments line just makes problems harder to debug, so I would prefer having the wrapper script be a separate file. |
I would disagree; a wrapper script to me is a separate file -- but let's not focus on semantics here ;-)
Probably for many cases, yes. But here I am not so sure that it is true, because with the in-lining solution you at least have everything self-contained in the submit file and thus in the output of Another drawback is that the temporary file has to be transferred. Probably not a big deal, because it is a small file, though. Again, I am not against the temporary script file solution, I just want to discuss the pros and cons. |
In all other Cluster implementations, we use temporary scripts. It's true that this probably makes it difficult for cluster admin to debug things, but anyway they'd probably need to be a bit familiar with Dask. One of the first thing to debug dask-jobqueue is to print the job that gets submitted from your main Python script or notebook. And if needed submit it by hand. |
Alright, since you seem to agree to prefer the temporary-wrapper-script solution, I thought a bit about how to realize this. The problem I encounter now is that I can't make that file really temporary in the sense "create wrapper script file; run submit command; delete wrapper script file", because of this (docs)
That's indeed a problem with our HTCondor instance and I think it is in general. If I run this (for $ echo '#!/usr/bin/bash\necho "temp script"' > htc-test-script-tmp.sh; chmod u+x htc-test-script-tmp.sh; cat htc-test-script-tmp.sh; condor_submit htc-test-script.sub; rm htc-test-script-tmp.sh
#!/usr/bin/bash
echo "temp script"
Submitting job(s).
1 job(s) submitted to cluster 20730236. the job goes in HOLD state, because the file is gone by the time it tries to transfer: $ condor_q -l 20730236.0 | ack Reason
HoldReason = "Error from <worker node>: SHADOW at <IP> failed to send file(s) to <IP>: error reading from <full path to>/htc-test-script-tmp.sh: (errno 2) No such file or directory; STARTER failed to receive file(s) from <<IP>>"
HoldReasonCode = 13
HoldReasonSubCode = 2 Does anybody have a suggestion how to handle this? About
Right, but the difference is that for those [I don't know if for all, really], you directly hand over the temporary file to the submission command and after that can delete it, right? That's what the [1]
|
I'm not entirely familiar with the contents of the temporary script but if it contains nothing sensitive it could just be written to |
I haven't a strong preference between the two solutions. If using a temporary script is more complex to handle, then we might want to stay wth inlined content at first. I'm also OK with @jacobtomlinson suggestion. I'm not sure when other temporary scripts (and so the HTCondor submit file) are really deleted, see the code here: https://github.com/dask/dask-jobqueue/blob/main/dask_jobqueue/core.py#L334. This uses a contextmanager, so I guess it is deleted right after the submit command is run, which would apparently not work based on your tests. |
That might be a solution for many setups, although knowingly cluttering
Right, that's also my understanding. |
Heh
Ah that's a shame. |
😆 So by now I would prefer the in-lining solution to an extra temporary script. But maybe someone else has a good idea how to realize this. |
I'd say go for the in-lining solution for the moment! |
Before only environment variables were considered using HTCondor's `Environment =` feature, whereas the parameter description generally says "Other commands to add". If the `Environment =` is needed, one can still use the generic `job_extra` parameter to set it. fixes dask#393 related to dask#323 dask#556
Could we issue a warning if we detect sth like |
Yes, that could really be the case, unfortunately. (Altough I would consider those cases "wrong" from the beginning, but it could be a breaking change in these probably rare cases...) But issuing a warning would also affect valid uses without "export" like |
I'm under the impression we are about to do some important changes, which will probably mean a major version update. So I would say the warning is not mandatory here, But maybe we should add a documentation about the change somewhere, maybe here: https://jobqueue.dask.org/en/latest/examples.html. |
* fix behaviour of `env_extra` for HTCondor Before only environment variables were considered using HTCondor's `Environment =` feature, whereas the parameter description generally says "Other commands to add". If the `Environment =` is needed, one can still use the generic `job_extra` parameter to set it. fixes #393 related to #323 #556 * adapt htcondor tests for new `env_extra` behaviour * adapt htcondor tests for new `env_extra` behaviour "export" is preserved now * docs: made "Moab Deployments" heading the same level as the others * docs: added description of HTCondorCluster + env_extra - example in the docstring - description in "example deployments"" - description in "advanced tops an tricks" * docs: removed the HTCondorCluster section from examples * formatting according to black and flake8
* fix behaviour of `env_extra` for HTCondor Before only environment variables were considered using HTCondor's `Environment =` feature, whereas the parameter description generally says "Other commands to add". If the `Environment =` is needed, one can still use the generic `job_extra` parameter to set it. fixes dask#393 related to dask#323 dask#556 * adapt htcondor tests for new `env_extra` behaviour * adapt htcondor tests for new `env_extra` behaviour "export" is preserved now * docs: made "Moab Deployments" heading the same level as the others * docs: added description of HTCondorCluster + env_extra - example in the docstring - description in "example deployments"" - description in "advanced tops an tricks" * docs: removed the HTCondorCluster section from examples * formatting according to black and flake8
…ob scripts (#560) * Use `--nworkers` in stead of deprecated `--nprocs` in the generated job scripts Fix #559. * Update the requirements for dask and distributed This is needed to support the `--nworkers` option. * Fix behaviour of `env_extra` for HTCondor (#563) * fix behaviour of `env_extra` for HTCondor Before only environment variables were considered using HTCondor's `Environment =` feature, whereas the parameter description generally says "Other commands to add". If the `Environment =` is needed, one can still use the generic `job_extra` parameter to set it. fixes #393 related to #323 #556 * adapt htcondor tests for new `env_extra` behaviour * adapt htcondor tests for new `env_extra` behaviour "export" is preserved now * docs: made "Moab Deployments" heading the same level as the others * docs: added description of HTCondorCluster + env_extra - example in the docstring - description in "example deployments"" - description in "advanced tops an tricks" * docs: removed the HTCondorCluster section from examples * formatting according to black and flake8 * Drop Python 3.7 (#562) * Drop Python 3.7 * Fix cleanup fixture probem (see fistributed#9137) * Override cleanup distributed fixture, and reconfigure dask-jobqueue when called * Use power of 2 for the memory checks in tests (see dask#7484) * Apply Black * Apply correct version of black... * conda install Python in HTCondor Docker image * Fix HTCondor Dockerfile * Fix PBS Issue * Add a timeout for every wait_for_workers call * Fix HTCondor tests, leave room for scheduling cycle to take place for HTCondor * flake check * debugging HTCondor tests on CI * Flake * reduce negotiator interval for faster job queuing, more debugging logs * move condor command at the right place * always run cleanup step, and print more things on HTCondor * Clean temporary debug and other not necessary modifications * Disable HTCondor CI for now * Import loop_in_thread fixture from distributed * Override method from distributed to make tests pass Co-authored-by: Johannes Lange <jolange@users.noreply.github.com> Co-authored-by: Guillaume Eynard-Bontemps <g.eynard.bontemps@gmail.com>
Hi,
The description for env_extra in HTCondorCluster is not correct: the job that HTCondorCluster creates calls dask-worker directly instead of through a bash wrapper script, so you cannot put arbitrary shell commands into env_extra.
The interface supports environment variables as
key=value
pairs, which will be inserted into dask-worker's environment (via the "Environment" attribute in the submit file). (For consistency, you can writeexport foo=bar
but the word "export" will be ignored.)This is also important to keep in mind with regards to #323; renaming env_extra to job_script_extra or similar would be even more inaccurate (for the HTCondor case anyway).
The text was updated successfully, but these errors were encountered: