Skip to content
This repository has been archived by the owner on Feb 10, 2021. It is now read-only.

Unable to start SGE workers via docker-compose on Windows 8.1 #64

Open
azjps opened this issue Mar 21, 2018 · 6 comments
Open

Unable to start SGE workers via docker-compose on Windows 8.1 #64

azjps opened this issue Mar 21, 2018 · 6 comments

Comments

@azjps
Copy link
Collaborator

azjps commented Mar 21, 2018

I'm able trouble reproducing the unit tests based on the given instructions, for Windows. I've done the following:

  • Replaced all carriage returns in all *.sh files in the top-level directory of this project.
  • Run docker-compose build --no-cache

When I run ./start-sge.sh I see the output Waiting for SGE slots to become available after an indefinite amount of waiting:

$ ./start-sge.sh
sge_master is up-to-date
slave_two is up-to-date
slave_one is up-to-date
Waiting for SGE slots to become available
# output of docker exec -it sge_master qhost;
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
# and repeats

# docker contains are running
$ docker ps
CONTAINER ID        IMAGE                 COMMAND                  CREATED
       STATUS              PORTS                     NAMES
efa6eebce34a        daskdrmaa_slave-one   "bash /run-slave.sh"     About an hour
 ago   Up About an hour                              slave_one
9fae83c48356        daskdrmaa_slave-two   "bash /run-slave.sh"     About an hour
 ago   Up About an hour                              slave_two
b13940296988        daskdrmaa_master      "bash -x /run-master"   About an hour
ago   Up About an hour    6444-6446/tcp, 8000/tcp   sge_master

# example of output when I try to run unit tests
$ docker exec -it sge_master /bin/bash -c "cd /dask-drmaa; py.test dask_drmaa -sv -k test_adaptive_memory"
============================= test session starts ==============================
platform linux -- Python 3.6.4, pytest-3.4.2, py-1.5.2, pluggy-0.6.0 -- /opt/ana
conda/bin/python
cachedir: .pytest_cache
rootdir: /dask-drmaa, inifile:
collected 23 items

dask_drmaa/tests/test_adaptive.py::test_adaptive_memory

distributed.utils - ERROR - code 17: denied: host "sge_master" is no submit host
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/utils.py", line 623, in log_errors
    yield
  File "/dask-drmaa/dask_drmaa/core.py", line 203, in start_workers
    ids = get_session().runBulkJobs(jt, 1, n, 1)
  File "/opt/anaconda/lib/python3.6/site-packages/drmaa/session.py", line 340, in runBulkJobs
    return list(run_bulk_job(jobTemplate, beginIndex, endIndex, step))
  File "/opt/anaconda/lib/python3.6/site-packages/drmaa/helpers.py", line 286, in run_bulk_job
    c(drmaa_run_bulk_jobs, jids, jt, start, end, incr)
  File "/opt/anaconda/lib/python3.6/site-packages/drmaa/helpers.py", line 303, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/opt/anaconda/lib/python3.6/site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.DeniedByDrmException: code 17: denied: host "sge_master" is no submit host

Unfortunately I have no familiarity with SGE. Any suggestions as to how I can debug/fix this? Apologies if this is not appropriate for this issue tracker.

@jakirkham
Copy link
Member

Sure. I think this is ok here (especially since it's our docker-compose setup :).

Could you please try the following SGE command to see what happens?

docker exec -it sge_master /bin/bash -c "qstat -f"

@azjps
Copy link
Collaborator Author

azjps commented Mar 21, 2018

Doesn't seem to give any output:

Dev@az MINGW64 /dask-drmaa ((cc20f43...))
$ docker ps
CONTAINER ID        IMAGE                 COMMAND                  CREATED
       STATUS              PORTS                     NAMES
efa6eebce34a        daskdrmaa_slave-one   "bash /run-slave.sh"     2 hours ago
       Up 2 hours                                    slave_one
9fae83c48356        daskdrmaa_slave-two   "bash /run-slave.sh"     2 hours ago
       Up 2 hours                                    slave_two
b13940296988        daskdrmaa_master      "bash -x /run-master"   2 hours ago
      Up About a minute   6444-6446/tcp, 8000/tcp   sge_master

Dev@az MINGW64 /dask-drmaa ((cc20f43...))
$ docker exec -it sge_master /bin/bash -cx "qstat -f"
+ qstat -f
# No output

@jakirkham
Copy link
Member

Hmm...that suggests the containers are not being connected for some reason. As contrast, this is what I see when running the same command.

$ docker exec -it sge_master /bin/bash -c "qstat -f"
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
dask.q@slave_one               BIP   0/0/4          0.78     lx26-amd64    
---------------------------------------------------------------------------------
dask.q@slave_two               BIP   0/0/4          0.79     lx26-amd64    

Is anything else trying to use the ports allocated to those Docker containers on your machine?

Also not sure how you are running Docker. Recall having various issues when using Docker Machine on VPN for instance. Would expect Docker for Windows to avoid many of these issues.

@azjps
Copy link
Collaborator Author

azjps commented Mar 23, 2018

I am using Docker Toolbox (for Windows pre-10). No VPN, and those ports should be free.

I ran docker-compose build --no-cache again and made some progress -- this time ./start-sge.sh completes successfully:

$ docker exec -it sge_master /bin/bash -cx "qhost"
+ qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
slave_one               lx26-amd64      1  0.36  995.8M  282.1M    1.1G    8.6M
slave_two               lx26-amd64      1  0.36  995.8M  282.1M    1.1G    8.6M

However, when I try to run the unit tests, they encounter an error when they try to submit a DRMAA job:

$ docker exec -it sge_master /bin/bash -c "cd /dask-drmaa; py.test dask_drmaa -sv -k test_adaptive_memory"
# following error repeats ..
  File "/opt/anaconda/lib/python3.6/site-packages/drmaa/errors.py", line 151, in
 error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.DeniedByDrmException: code 17: warning: root your job is not allowed to run in any queue
error: no suitable queues

And qstat -f unfortunately still doesn't show anything:

$ docker exec -it sge_master /bin/bash -cx "qstat -f"
+ qstat -f
# no output

Maybe there's some configuration step that I need to run again?

@jakirkham
Copy link
Member

So are you using Virtual Box then or a different VM?

@azjps
Copy link
Collaborator Author

azjps commented Apr 2, 2018

Yes, using VirtualBox.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants