-
-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask JobQueue and TCP connections between login and compute nodes #354
Comments
The dask scheduler needs to connect by TCP to the dask workers. The dask scheduler is where you create your A hacky work-around would be to start an interactive job, and launch the same script / Jupyter notebook inside your interactive job. That may work. Another longer-term fix would be #186 which seems to have been created with your use case in mind (in particular #186 (comment)). I have to say, I don't think anyone is planning to work on this in the medium-term. Out of interest, do you have an idea why no TCP connection between login and compute nodes are allowed. I am still learning about all the possible configuration tweaks of HPC clusters, this feels like quite an endless task ... |
As always a PR would be more than welcome! |
Not that hacky, in platforms with lot of users, it may be the best thing to do :)! |
I guess hacky was not the best word, I meant more "not as convenient". For example:
|
@orbitfold do you know whether all the ports are blocked on your problematic cluster? The reason I am asking is because in #355 only a range of ports are blocked. I am trying to get a better feeling of the possible configurations in different clusters. Maybe @guillaumeeb has some sys-admin perspective about how common TCP/IP restriction between login and compute nodes is and whether it is more likely that it is a partial or total port restriction. |
Apparently they don't want a pilot job type situation where someone submits a large job that then orchestrates the computations. They want each process to be submitted via the batch system. I guess they don't trust you to do your own load balancing. At least that is what we're told by an admin on one of such system. |
I'm more than happy to contribute but I'll have to piece together what needs to be done here. |
I think one really cool thing you could do, is check whether the simplest work-around we suggested works. The idea is to run your script or notebook inside an interactive job. If that works, this would be great to document the work-around (related to #356). |
So I personally only have access to a cluster where it works. However I have been told that of the two problem clusters the work-around worked on one of them. In the other one it failed to establish a TCP connection which would imply they disallow TCP connections between compute nodes. Which is nuts but that is the world we live in. If you need more details I'm happy to try and provide them. |
I also checked to see if it works if I run the script on our cluster (where it already works) in interactive mode and it does. So I think as long as the cluster admins let you tcp between compute nodes this is a valid work-around. |
Thanks a lot for your feed-back. It would be great if you want to contribute some doc with this content, I am thinking mostly login nodes - compute nodes TCP/IP port restrictions + interactive job work-around.
If you haven't done it already, I would suggest contacting IT about this and explaining your use case. You can certainly use examples of "serious" clusters, like Cheyenne (look at Pangeo doc) or Summit (First cluster in the Top 500 IIRC), see https://blog.dask.org/2019/08/28/dask-on-summit. The Pangeo community may be a good place to get involved as well if you haven't already. It can be frustrating at times but some issues can not be fixed technically, but only socially or politically whatever you want to call it. I fully sympathise with the "frustration part": I am currently trying to get an account on the newfangled IT cluster for Artificial Intelligence in France, and let me tell you, there is some room for improvement in the user experience area ... Out of curiosity, could you give us a bit more clues about the clusters you were mentioning. Info like: name, geographical location, main scientific domain of cluster users if any, etc ... My goal here is to get a better picture of the variability in HPC situations. |
I personally work for LRZ in the Munich area and we host SuperMUC-NG and a number of smaller clusters. There is no single focus scientific domain. They are mostly Skylake and Haswell nodes. We use SLURM for all clusters. I asked a colleague to provide info about the other clusters. |
Thanks a lot, that was exactly the kind of information level I was after. |
I've really no idea how common this is, but I imagine this would be a near total restriction. In our setup, we've got login nodes that automatically redirects to what we call interactive nodes upon connection (but we can stay on these nodes indefinitly, it's not interactive in the interactive job sense). login nodes are highly secured, but interactive ones are fully open to compute nodes. |
Closing in favor of #356 as what is needed here is documentation. |
Hello all, sorry for the noob question. I am trying to understand why my software works on some clusters and not on others. Does Dask Jobqueue require a TCP connection between login and compute nodes? Since that seems to be the difference between working and non-working right now. If it does require one and it is not allowed, is there a workaround?
The text was updated successfully, but these errors were encountered: