-
Notifications
You must be signed in to change notification settings - Fork 28
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS Batch scheduler #208
Comments
Starting to question this, actually. Now that I actually went through an AWS Batch tutorial and successfully ran some R code in a job, I appreciate how straightforward |
AWS is definitely on the list of things I want to support, so thank you for raising this! That said, I've got zero experience with it, so I don't even know which steps I need to consider and I'm afraid I won't be able to read up on this very soon. Should we coordinate roadmaps a bit with yours and @HenrikBengtsson? |
Awesome! I think all of us care about this, and I would love to coordinate. Also cc @davidkretch. I have been experimenting some with Batch through the console, and thanks to you I can now send jobs to workers inside a Docker container via SSH. SSH into Batch jobs sounds trickier (ref: paws-r/paws#330, https://stackoverflow.com/questions/64342893/ssh-into-aws-batch-jobs). However, the user guide repeatedly mentions the possibility of SSH for Batch container instances, so I am not convinced that it is impossible.
No worries, I think I have the least general knowledge here. |
Here's another idea: for the moment, given the unexpected difficulty of tunneling into AWS Batch jobs, why not drop one level lower and work with EC2 instances directly? From https://gist.github.com/DavisVaughan/865d95cf0101c24df27b37f4047dd2e5, EC2 seems easier for us than Batch, and tackling the former first may help us work up to the latter later. |
There's a lot we can do to improve on https://gist.github.com/DavisVaughan/865d95cf0101c24df27b37f4047dd2e5, such as
|
For the kinds of workflows we deal with, the only value added I see of Batch relative to EC2 is cost optimized resource provisioning, e.g. waiting for spot instances to get cheap before submitting jobs. Not sure we can do that with EC2 alone. (That and the ability to automatically connect to S3, which |
It is worth noting that in the AWS Batch console, I cannot select the |
Hi @wlandau: With respect to testing on AWS, I highly recommend applying for the AWS Open Source promotional credits program, which is described more here. We got this for the Paws package. I haven't put nearly as much thought into this, but I think if R is going to be in control of starting and stopping instances, then I feel like I agree that Batch doesn't get you much extra. With respect to Batch, I think some of its pros and cons are: Pros
Cons
To use spot prices with EC2, you could get spot prices with the DescribeSpotPriceHistory API call, but I think supporting spot prices would likely be a lot of work since it would also have to handle things like restarting jobs when instances get stopped due to changing prices. |
Thanks for the advice. Your assessment is helpful, and I had no idea about the promotional credit. |
How does the |
We don't know workers' node or IP address because the scheduler will assign them to nodes. Instead, each worker connects back to the main process (which is listening on the port corresponding to the So each worker needs to know either the host+port of the main process or of the SSH tunnel. |
Awesome! Sounds like my whole quixotic pursuit of worker IPs is moot! This makes me more optimistic about the potential to work through the scheduling software of Batch itself. I bet By the way, I asked about ZeroMQ compatibility with Batch and got a great answer here. I do not have the expertise to understand all the cloud-specific network barriers, but I take it as more evidence that a new Batch QSys class is possible. |
I think we will need to handle AWS Batch differently from the other schedulers. Instead of traditional template files, Batch uses JSON strings called "job definitions". In the AWS CLI, users can pass job definitions to But for our purposes, i think we should require the user to create a job definition in advance through the AWS web console or other means, then pass the job definition name to @mschubert, what do you think? If this sounds reasonable, what help would be most useful? |
Array jobs in AWS Batch are straightforward with paws::batch()$submit_job(
jobDefinition = "job-definition",
jobName = "example-job-array",
jobQueue = "job-queue",
arrayProperties = list(size = 3)
) Prework:
$ aws batch describe-job-definitions
{
"jobDefinitions": [
{
"jobDefinitionName": "job-definition",
"jobDefinitionArn": "arn:aws:batch:us-east-1:912265024257:job-definition/job-definition:3",
"revision": 3,
"status": "ACTIVE",
"type": "container",
"parameters": {},
"containerProperties": {
"image": "wlandau/cmq-docker",
"vcpus": 2,
"memory": 2048,
"command": [
"Rscript",
"-e",
"print(Sys.getenv('AWS_BATCH_JOB_ARRAY_INDEX'))"
],
"volumes": [],
"environment": [],
"mountPoints": [],
"ulimits": [],
"resourceRequirements": [],
"linuxParameters": {
"devices": []
}
}
}
]
} So I think that sketches out how to submit array jobs to AWS Batch from R. If that looks good, is there anything else I can do to help get an implementation going? |
This looks great, thank you so much! 👍 I still see a few issues before this can be implemented:
|
|
Other details: we want to override the command in the job definition so the user does not have to manually write paws::batch()$submit_job(
jobDefinition = "job-definition",
jobName = "job-array",
jobQueue = "job-queue",
arrayProperties = list(size = 2),
containerOverrides = list(
command = list(
"R",
"--no-save",
"--no-restore",
"-e",
"print(Sys.getenv('CMQ_AUTH'))"
),
environment = list(
list(
name = "CMQ_AUTH",
value = "auth_value"
)
)
)
) |
Ah, I misread this. We’re just talking about making the broadcast more efficient. How does this work in traditional HPC? |
HPC is on a local network (usually >= 10 Gbps), so it doesn't matter. End user internet to AWS much slower. SSH+HPC will cache on the HPC head node. Looks like we can use SSH for EC2: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html (not sure about lambda/batch) In this case, we could start the container, and then use the SSH connector edit: implementation could be through bastion hosts, with SSH+AWS working akin to SSH+HPC. Not sure if/how much this makes it more expensive |
Sorry for radio silence and all. Struggling with time. I think it was an awesome idea to have a call on this (proposed in some thread i think) Drive by comment regarding: "note that we don't need this for HPC because the submitting and worker nodes are on the same network". Some HPC environment are pretty locked down where compute nodes don't have internet access or SSH clients or daemons installed. That is useful to keep in mind when designing things, i.e. give people on such environments a fallback solution to work with |
Awesome! I will propose an hour on Google Meet. I attempted to capture some of our ideas in https://github.com/wlandau/r-cloud-ideas at a high level. |
I agree in principle, but we never require internet access of workers or SSH access between HPC compute nodes. If, however, a user wants the convenience of submitting remotely via SSH, then of course this is a requirement. The comment above was for caching data on the remote end, which only makes sense if the connection of local>remote is much slower than remote>workers. |
This answer claims reverse tunneling should be possible if the compute environment is unmanaged. The poster recommends using the metadata endpoint to find out which EC2 instance a worker is running on. @davidkretch, is that the same as paws-r/paws#330 (comment)? |
@wlandau The metadata endpoint would be much, much easier than the example code I created. The metadata endpoint is a local web API on the instance/container that you can query to get info about the machine you're running on. Paws doesn't have any publicly exposed functions for accessing them at the moment but it's pretty easy (example here). Is it true that if the worker is reverse SSH tunneling to your R session, it would need to know your IP, and it's less critical that you know the worker nodes' IPs? |
Not quite: The SSH connection is established from the local session to AWS (or remote HPC), which attaches a reverse tunnel to the same connection (so we need to be able to access the remote session). The result is that the remote session can access this tunnel to connect to the local process. |
Ah, I understand. In that case (local to worker), the metadata endpoint would be unhelpful here, because it is only accessible from the worker, but we'll need to know that information in our local session. |
To clarify further, the remote session thinks it's connected to a port on the localhost, it just happens to be tunneled back to a port on your local compute via that SSH reverse tunnel. Here's an illustration on how parallelly SSH into a remote machine with a reverse tunnel so that localhost:12345 on the remote machine will talk to localhost:12345 on your local computer; > cl <- parallelly::makeClusterPSOCK("remote.server.org", port = 12345L, revtunnel = TRUE, dryrun=TRUE)
----------------------------------------------------------------------
Manually, (i) login into external machine 'remote.server.org':
'/usr/bin/ssh' -R 12345:localhost:12345 remote.server.org
and (ii) start worker #1 from there:
'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'workRSOCK <- tryCatch(parallel:::.slaveRSOCK, error=function(e) parallel:::.workRSOCK); workRSOCK()' MASTER=localhost PORT=12345 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE
Alternatively, start worker #1 from the local machine by combining both step in a single call:
'/usr/bin/ssh' -R 12345:localhost:12345 remote.server.org "'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'workRSOCK <- tryCatch(parallel:::.slaveRSOCK, error=function(e) parallel:::.workRSOCK); workRSOCK()' MASTER=localhost PORT=12345 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE" |
Maybe I am getting too far ahead here, but it make sense to implement all this ZeroMQ + R + cloud functionality in an entirely new package of its own? (Say, |
Then again, there is a lot more to |
My current impression is:
|
Automatic spot pricing and Docker image support seem compelling (#208 (comment)). And if we pursue Lambda and Fargate in similar ways, we might see shorter worker startup times. |
Another thought: persistent |
I am thinking GHA runners could just make it easier to get R code on the cloud, provided the setup and teardown happens automatically. |
It didn't look to me that GitHub Actions self hosted did auto setup/teardown. It looks like it uses an agent that runs on your infrastructure that polls GitHub and waits for work. If you do already have compute infrastructure, I think you could also hypothetically have clustermq talk to it directly rather than through GHA. In terms of clustermq, which assumes an SSH connection, I think the only real options on AWS are
All other options I can think of would prohibit SSH connections and would need an S3 or other intermediary for delivering results. The two open questions I have about Batch are:
I am currently slowly working on a non-clustermq approach using Lambda, but will look into the Batch approach next, hopefully later this month. |
Thanks Will & David! I'm not sure if routing AWS access via GHA would simplify or complicate things. I'm definitely still interested in exploring an SSH-based approach (but also trying to wrap up a science project over here, so unfortunately not much spare time right now) |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
I propose AWS Batch as a new
clustermq
scheduler. Batch has become extremely popular, especially as traditional HPC is waning. I have a strong personal interest in making Batch integrate nicely with R (ref: ropensci/targets#152, ropensci/tarchetypes#8, https://wlandau.github.io/targets-manual/cloud.html, #102) and I am eager to help on the implementation side.Batch is super easy to set up through the AWS web console, and I think it would fit nicely into
clustermq
's existing interface:options(clustermq.scheduler = "aws_batch")
andoptions(clustermq.template = "batch.tmpl")
, wherebatch.tmpl
contains an AWS API call with the compute environment, job queue, job definition, and key pair. I think we could usecurl
directly instead of the much larger and rapidly developingpaws
package. I think this direct approach could be far more seamless and parsimonious than the existing SSH connector with multiple hosts.The text was updated successfully, but these errors were encountered: