-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS Batch runner support runtime-configurable Docker images. #73
Conversation
0fdfe27
to
8d68233
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jstoja! Thanks so much for working on this issue and submitting this PR! I've identified some changes that I think need to be made, detailed individually below, but this is already great progress. 😁 Let me know if you have questions or alternate suggestions, etc. I'm happy to chat more, and I'm hoping I'll be able to be more timely with feedback going forward.
I'm excited to have this functionality. ✨ It'll be super helpful when testing out new images or using custom images for certain builds.
dest = "aws_batch_image", | ||
help = "Docker image to use for the AWS Job Definition", | ||
metavar = "<name>", | ||
default = docker.DEFAULT_IMAGE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to use the existing --image
argument here, so that an invocation of the default Docker runner like:
nextstrain build --image <name> .
can simply be switched to AWS Batch like so:
nextstrain build --image <name> --aws-batch .
This would mean promoting the --image
argument from runner.docker.register_arguments()
into runner.register_arguments()
so it can be shared by all runners that support it, similar to how --exec
works. I imagine that runner.native
would ignore --image
and warn if it was used, but a future runner.singularity
might be able to make use of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, please let me know if the error handling in runner.native
is OK for you.
docker_image_tag = docker_image.split(':') | ||
name = re.sub('[^0-9a-zA-Z-]+', '-', definition_name) | ||
image = re.sub('[^0-9a-zA-Z-]+', '-', docker_image_tag[0]) | ||
tag = re.sub('[^0-9a-zA-Z-]+', '-', docker_image_tag[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker_image
isn't guaranteed to have a colon in it, and thus this could throw a ValueError.
More robust image name splitting logic is implemented in runner.docker.split_image_name()
and that would be good to re-use here. That function may want to be moved into nextstrain.cli.utils
so that this module doesn't have to import runner.docker
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that's much better indeed!
batch.register_job_definition( | ||
jobDefinitionName = definition_name, | ||
type = "container", | ||
containerProperties = { | ||
"image": image, | ||
"command": [], | ||
}, | ||
retryStrategy = { | ||
"attempts": 1, | ||
}, | ||
timeout = { | ||
"attemptDurationSeconds": 14400, | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job definition created here should be based on the existing definition
(which is specified by --aws-batch-job
, the environment, or config), with just containerProperties.image
modified.
The code can fetch the existing job description by calling batch.describe_job_definitions()
.
How does this code behave if a job definition with the name already exists? Does it throw an error? Does it silently do nothing? Does it create a duplicate job definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a TODO for this one and I'll come back later to it. I'll read the documentation thoroughly and do some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today the job definition nextstrain-job
exists. If tomorrow nextstrain-job2
is specified, should it raise an error, try to create one based on this snippet (taken from the documentation), or use the default job nextstrain-job
as base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming a base job definition (one that isn't parameterized with the image name) is present, otherwise I raise an Exception mentioning this.
) | ||
except Exception as error: | ||
warn(error) | ||
raise Exception("Creation of job definition (%s) failed" % definition_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should catch a more specific exception class, likely something from botocore.exceptions. I think there are examples of this in other places of the code.
Error reporting will also be nicer if this except
block raises a subclass of NextstrainCliError
from the nextstrain.cli.errors
module. The existing UserError
subclass probably isn't appropriate, so a new subclass like AWSError
might be warranted. It would also be ok to use NextstrainCliError
directly for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've caught the most common error as per documentation, if it's something else, it probably shouldn't be caught and go upper as something would be off.
I've also added the new exception, please let me know if you feel something else should be added here.
To support runtime-configurable Docker image in AWS Batch, the creation of a Batch Job also created a Job Definition with the configured image. Fixes nextstrain#57
* Change --aws-batch-image with --image by promoting the argument to all runners. * Improve docker image parsing by using split_image_name from cli.utils. * Added a new cli.errors.AWSError to represent botocore.exceptions that is now caught. * TODO: Handle job definition creation/duplication.
1a8d5a2
to
5e7ec94
Compare
Hello @tsibley |
If the base job definition doesn't exist, an Error is thrown to alert the user.
Job definitions are dynamically created as needed based on the base definition specified by --aws-batch-job. This is necessary to allow runtime-configurable images since, unlike CPUs, memory, env vars, etc, a submitted job may not directly override the container image set in the job definition. Resolves #57. Implementation originally started as PR #73. Co-authored-by: Julien BORDELLIER <git@julienbordellier.com>
@jstoja Thanks for reworking this! I dug into it today and first ran into a couple bugs that prevented it from working (for example, I ended up rewriting quite a bit and working out all the edge cases using this PR as a starting point. It's pushed up to the |
Description of proposed changes
To support runtime-configurable Docker image in AWS Batch,
the creation of a Batch Job also created a Job Definition with the
configured image.
Related issue(s)
Fixes #57
Testing
I have to test on my AWS account first, then when I'll confirm it's working, I'd be happy to have someone from the core team to test this in your setup to confirm its working as expected.
As a side note, maybe it'd be helpful to add
doctest
(https://docs.python.org/3/library/doctest.html), which helps testing small functions, but also documents in a very practical manner the inputs and outputs.Please let me know everything you think should be changed in this code.