Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI for HTCondor using Docker #247

Closed
guillaumeeb opened this issue Mar 7, 2019 · 20 comments
Closed

CI for HTCondor using Docker #247

guillaumeeb opened this issue Mar 7, 2019 · 20 comments
Labels

Comments

@guillaumeeb
Copy link
Member

Now that #245 is in, we would need some CI testing as what is done for PBS, SGE and Slurm. If someone has any hints on interest on doing this, please come help!

@kaelancotter
Copy link
Contributor

I'd be happy to take a stab at this if no one else is working on it.I haven't used HTCondor before, but it looks pretty cool.

@guillaumeeb
Copy link
Member Author

Sure, that would be great!

Let me know if you need some guidance to begin this.

@jrbourbeau
Copy link
Member

@kaelancotter FWIW the HTMap project within HTCondor uses a Dockerfile to setup an environment with HTCondor to run their tests (see https://github.com/htcondor/htmap/tree/master/docker). This could serve as a good starting point for us here.

@guillaumeeb
Copy link
Member Author

@kaelancotter, still motivated?

@kaelancotter
Copy link
Contributor

Still motivated indeed! Unfortunately I've been unexpectedly low on available bandwidth lately. If someone else wants to step up while I continue to putter along, by all means!

@mivade
Copy link

mivade commented Aug 26, 2019

Looks like the Dockerfile for HTMap moved here: https://github.com/htcondor/htmap/blob/master/htmap-exec/Dockerfile

@lesteve
Copy link
Member

lesteve commented Aug 27, 2019

Note the docker image they are using for their CI (in .travis.yml) is this one:
https://github.com/htcondor/htmap/blob/master/tests/_inf/Dockerfile

Just a quick note before I forget: our current test infrastructure for SGE, SLURM and PBS use a docker-compose setup (this way it looks more like a real cluster, where you have the master node and some compute nodes). If that's easier to setup for HTCondor, I think having a single Dockerfile is fine too.

@lesteve
Copy link
Member

lesteve commented Aug 27, 2019

Here is a proof of concept that shows that a single Dockerfile setup seems promising:

git clone https://github.com/htcondor/htmap
cd htmap
docker build -t htmap-test --file tests/_inf/Dockerfile --build-arg HTCONDOR_VERSION=8.9 \
    --build-arg PYTHON_VERSION=3.7 .
docker run -it htmap-test bash -c 'git clone https://github.com/dask/dask-jobqueue;\
    cd dask-jobqueue;\
    pip install -e .;\
    pytest dask_jobqueue/tests/test_htcondor.py --verbose -E htcondor'

The pytest output shows that test_basic (which needs a real cluster) passes:

=============================================================================== test session starts ===============================================================================
platform linux -- Python 3.7.4, pytest-5.1.1, py-1.8.0, pluggy-0.12.0 -- /home/mapper/conda/bin/python
cachedir: .pytest_cache
rootdir: /home/mapper/htmap/dask-jobqueue
plugins: forked-1.0.2, mock-1.10.4, xdist-1.29.0, cov-2.7.1
collected 4 items                                                                                                                                                                 

dask_jobqueue/tests/test_htcondor.py::test_header PASSED                                                                                                                    [ 25%]
dask_jobqueue/tests/test_htcondor.py::test_job_script PASSED                                                                                                                [ 50%]
dask_jobqueue/tests/test_htcondor.py::test_basic PASSED                                                                                                                     [ 75%]
dask_jobqueue/tests/test_htcondor.py::test_config_name_htcondor_takes_custom_config PASSED                                                                                  [100%]

=============================================================================== 4 passed in 11.50s ================================================================================

For integrating HTCondor in our CI, here are a few pointers:

  • Dockerfile: reuse most of the htmap Dockerfile and install dask (from conda to be consistent with the CI for the other job schedulers) at the end of the Dockerfile
  • docker-compose.yml: having a single entry master seems fine. I would still have a docker-compose.yml this way htcondor is similar to other job schedulers for the CI.
  • ci/htcondor.sh should be quite close to ci/sge.sh
  • ci/htcondor/start-htcondor.sh you need a way to make sure that the workers can be seen on the master node.
  • ci/htcondor folder should be reasonably similar to ci/sge, ci/pbs, etc ...

If anything needs clarification, let me know!

@mivade
Copy link

mivade commented Aug 27, 2019

Just a quick note before I forget: our current test infrastructure for SGE, SLURM and PBS use a docker-compose setup (this way it looks more like a real cluster, where you have the master node and some compute nodes). If that's easier to setup for HTCondor, I think having a single Dockerfile is fine too.

I think the docker compose setup is probably better long term so we can better test a multinode setup, but Condor is pretty easy to get running on a single node. I'll focus on testing as outlined above in a single container for now and we can make a separate PR later to add the multinode version.

@mivade
Copy link

mivade commented Sep 6, 2019

Sorry for the silence on this. I did try to get a Docker image setup with Condor running but so far I've been running into difficulties, namely the script used as an entry point to start Condor hangs indefinitely waiting for everything to come up. I'll keep working at it as I find time.

@lesteve
Copy link
Member

lesteve commented Sep 6, 2019

I am more than willing to help you on this if you provide a bit more information:

If you provide a branch with your WIP, I will try my best to have a look at it next week.

@mivade
Copy link

mivade commented Sep 9, 2019

My branch is located here. Rather than using the Dockerfile from htmap directly, I attempted to slim it down to only include what was needed to get Condor installed and running (note it doesn't even install Dask into the image yet). When I tried to run it using the entrypoint.sh file (borrowed from htmap) Condor seemingly never started since it just said "HTCondor is starting..." and then nothing else happened.

@lesteve
Copy link
Member

lesteve commented Sep 11, 2019

Rather than using the Dockerfile from htmap directly, I attempted to slim it down to only include what was needed to get Condor installed and running

I probably won't have time to look at this before next week. If I were to look at it I would do it in the reverse order:

  1. start from the htmap docker image since CI for HTCondor using Docker #247 (comment) was quite encouraging
  2. once you get something working try to slim down the htmap image

@riedel
Copy link
Member

riedel commented Apr 11, 2020

did you see that @matyasselmeci in the meantime has created a whole set of docker containers at https://github.com/htcondor/htcondor/tree/master/build/docker/services , particular he seems to maintain a base image for different versions (would be great to have a historical one for the CI). However, I could not find a base image pushed to docker hub, yet.

IMHO they would be the easiest way to go with if availabe (no need to maintain them here). There is I guess still some documentation missing and I guess work is not finished yet. Didn't find a compose file yet but I might give it a try (the docs for the execute note give the necessary hints, I guess).

@lesteve
Copy link
Member

lesteve commented Apr 13, 2020

Great to hear! Last time I tried it seemed like adding HTCondor to the CI was definitely within reach.

About docker-compose, this is not a requirement at all. If you can get it to work with a single docker image this would probably simpler.

We use docker-compose for historical reasons and also because it allows us to test some edge cases (recently I added a test in #400 for when you have to use a different interface on the worker and on the scheduler).

For example Dask-Gateway use single docker image in their CI.

@matyasselmeci
Copy link
Contributor

We put images for htcondor/mini and htcondor/execute up on Dockerhub but they are very much in the "technology preview" stage. htcondor/mini is a single-machine all-in-one image, so if you don't need to test multi-machine support, you could use that.

We welcome comments and suggestions on how to improve those images -- there's a lot of room for improvement and we'd like to know what direction to go in.

@lesteve
Copy link
Member

lesteve commented Apr 14, 2020

I know this is a bit much to ask but given I am unlikely to be to look at this in the near future, I would encourage one of the person involved you to have a go at it.

Here are some steps to help you getting started (maybe #247 (comment) can also help fill in the blanks), do let me know if you get stuck:

  • use the single-docker image which should be simpler to setup
  • look in the ci folder how it is done. ci/slurm is probably the good place to start. You need a ci/htcondor/htcondor.sh that should be something like this (not tested but you should be able to see what I mean hopefully):
#!/usr/bin/env bash

function jobqueue_before_install {
    # start the docker container in the background probably need to give it a nice name
    docker run -d -t htcondor-container-name htcondor/mini 
}

function jobqueue_install {
    docker exec -it htcondor-container-name /bin/bash -c "cd /dask-jobqueue; pip install -e ."
}

function jobqueue_script {
    docker exec -it htcondor-container-name /bin/bash -c "pytest /dask-jobqueue/dask_jobqueue --verbose -E htcondor -s"
}

function jobqueue_after_script {
	# do something useful for debugging here if you think it is worth it
}
  • add a simple test in dask_jobqueue/tests/test_htcondor.py that needs a real cluster i.e. that does a .scale (it needs a pytest.mark.env('htcondor') decorator). Simplest thing to do is to copy one from the slurm test e.g. test_basic and test_adaptive.
  • open a PR, make sure the htcondor tests that need a cluster run, fix the problems if you have some, get it merged
  • celebrate ! Congrats HTCondor is now a first-class citizen in Dask-Jobqueue

@riedel
Copy link
Member

riedel commented Apr 15, 2020

This totally no unreasonable request!

The problem is a bit my personal availability, I think it could really quickly be done (need to get familiar with the testing anyways, if I finally want to get the stuff from #411 into the code base: more work to maintain a fork)

I also will try to get a student worker at our lab to support the work at our lab, this could accelerate things a lot, but it takes a bit of time. We are really grateful for your efforts and happy to support.

riedel added a commit to riedel/dask-jobqueue that referenced this issue Apr 25, 2020
using minicondor image and single docker

Signed-off-by: Till Riedel <riedel@teco.edu>
@riedel
Copy link
Member

riedel commented May 6, 2020

I guess this can be closed for now with #420 merged. Will open a few other issues in order to improve/align the support (Dockerfile)

@lesteve
Copy link
Member

lesteve commented May 6, 2020

Great to see that the Triage permissions work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants