Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a consistent way to specify container image #489

Open
oavdeev opened this issue Apr 28, 2021 · 36 comments
Open

Add a consistent way to specify container image #489

oavdeev opened this issue Apr 28, 2021 · 36 comments

Comments

@oavdeev
Copy link
Collaborator

oavdeev commented Apr 28, 2021

Problem

Several existing and upcoming Metaflow plugins run your code in containers. They all allow you to specify a custom container image, but each does it in a slightly different way.

  1. For AWS Batch, you can:

    • specify per-step image by using @batch(image='foo') step decorator
    • ...or use a global default set via METAFLOW_BATCH_CONTAINER_IMAGE
    • or if neither is set, let @batch use default public image (vanilla official python)
  2. For Argo you can

    • specify per-step image by using @argo(image='foo') step decorator
    • specify flow-default image by using @argo_base(image='foo') flow decorator
    • or if neither is set, argo plugin will use default public image (vanilla official python)
  3. For AWS Lambda, you can

    • specify per-step image by using @lambda(image='foo') step decorator
    • specify global default via METAFLOW_LAMBDA_IMAGE_URI

    Note that AWS Lambda is a bit special as it doesn't allow using public images. Also, the image has to be baked in a special way to include Lambda Runtime (it cannot be installed on the fly).

  4. For @kubernetes, you can

    • specify per-step image by using @kubernetes(image='foo') step decorator
    • specify global default via METAFLOW_KUBERNETES_IMAGE_URI

Proposal v2

Unify all this by adding a new decorator, @image(name="foo"), and use it across all plugins that use containers. The new decorator can be used both at the step level and flow level.

In addition to that, add a global setting called METAFLOW_DEFAULT_IMAGE to point to the global default image.

Batch and other plugins would still support specifying the image directly for brevity (that is, @batch(image='foo')).

Open question: what should @image do in local execution mode?

  • Option A: Metaflow runs your step in docker.

  • Option B: Metaflow throws an error

  • Option C: Metaflow ignores it

Probably not going to implement (A) right away, so either (C) or (B)

Open question: should @awslambda use the same decorator?

Given that Lambda container images are special, and you cannot simply repurpose an AWS Batch or kubernetes image for AWS Lambda, would it be confusing to use the same decorator to specify the image?

Decision: use @image for Lambda too

Discussion

How much of a problem is this?

Not a huge deal per se — I'd assume that the users aren't routinely switching back and forth between e.g. AWS Batch and k8s, so this inconsistency isn't really in your face as a user. However as Metaflow adds more plugins, this becomes more acute. You can see people using both Lambda and Batch, or both @kubernetes and @argo.

Would @image ever support any other params besides name= ?

Probably not now. Hypothetically, you can imagine Metaflow doing other things like building container images for the user, in that case you could have something like @image(dockerfile=...) too. But there may be a better UX for that.

You could also imagine allowing to override container entrypoint, for plugins that support that.

@savingoyal
Copy link
Collaborator

Good point! Do you think @docker would be a better terminology? Or maybe @image(name='foo')?
Regarding executing locally, one idea that we have been toying with is introducing --environment=docker that will run every step in a separate docker container.

@oavdeev
Copy link
Collaborator Author

oavdeev commented Apr 28, 2021

Yep, I was thinking @docker at first, but then remembered that it would be technically inaccurate. In the list above only AWS Batch uses Docker per se, others use containerd or Firecracker. Technically, it is just an OCI container image. So I wanted to be consistent with the terminology used in k8s and Lambda.

@image sounds good too. I'm only unsure if it will be super obvious that it refers to container/docker if you see this without context in code (especially since you may not always see a @batch or @kubernetes decorator next to it).

@oavdeev
Copy link
Collaborator Author

oavdeev commented Apr 28, 2021

As I think more about it, maybe it is a good idea to borrow the terminology from k8s container spec? In theory, we may decide to support overriding working_dir too, and maybe something similar to image_pull_policy. In fact, #463 already has something similar for AWS Lambda.

@sappier
Copy link
Contributor

sappier commented Apr 28, 2021

Having a global setting METAFLOW_DEFAULT_CONTAINER_IMAGE is a good idea. It would decrease the amount of config values users need to deal with. Comparing to remember which would they need to pick among METAFLOW_BATCH_CONTAINER_IMAGE, METAFLOW_LAMBDA_IMAGE_URI or METAFLOW_ARGO_CONTAINER_IMAGE. And it's easier to stick to it developing a new plugin.

Would @container ever support any other params besides image= ?

Seems the image is the only common denominator. Metaflow already has dedicated @environment and @resources. Other parameters like name or commands are generated automatically by Metaflow/plugin. Extra attributes like image_pull_policy or working_dir could be added directly to specific step decorators @kubernetes and @argo when needed. @argo could re-use specifications from @kubernetes to avoid duplication.

Open question: what should @container do in local execution mode?

Ignoring it seems a pretty good default until the --environment=docker is introduced.

Open question: should @awslambda use the same decorator?

I'm also leaning towards using the same decorator. At the end it just specifies a name of the image regardless how the image is built.

How much of a problem is this?

IMO, it's not a big issue to have image= inside a particular plugin decorator.
The bigger challenge for us was to analyze which exactly image should be used in a step since an image could be specified through envvar, metaflow config file, command-line parameter, flow and step decorators, default python image. Duplicating such logic in different plugins is tedious and could lead to the inconsistent behavior.
It would be great to have such functionality available along with the common METAFLOW_DEFAULT_CONTAINER_IMAGE configuration. Within the @container decorator or a separate util function.

@sappier
Copy link
Contributor

sappier commented Apr 28, 2021

Just a second thought. Having a dedicated @container with a single image seems a little overkill. But only unless the --environment=docker or --with docker introduced since it would need its own @docker or similar and it would make sense to re-use it like we do in @argo with @resources (@argo doesn't have its own cpu/gpu/mem but read it from @resources).

There is another observation though: Metaflow already has a separate @resources despite it works with @batch only and is ignored in local runs. From that perspective it's pretty natural to have @container in addition.

As a third thought :) it could even make sense to put resources specification inside such @container decorator and re-use it in all containerized plugins/runtimes.

@savingoyal
Copy link
Collaborator

We should move forward with @image. The k8s container spec assembles the entire container which is what Metaflow does on behalf of the user, so @image feels more natural to me. We can also introduce --environment=docker for executing the flow locally.

A good question is whether we should allow @conda on the same step that is annotated with @image - we can disallow it for the time being since it doesn't make much sense without liberal use of the escape hatch - #487.

@savingoyal
Copy link
Collaborator

savingoyal commented Apr 30, 2021

For backward compatibility, we would have to ensure the correctness of @batch(image='blah')

@savingoyal
Copy link
Collaborator

Also, regarding @argo - hopefully after PR #488 is merged in, we wouldn't need that decorator anymore.

@russellbrooks
Copy link
Contributor

russellbrooks commented May 3, 2021

I'm a fan of the image consistency across execution environments, but I'd vote to keep the local execution behavior decoupled.

As an example where this could fall off the rails – if using a docker container as a dev environment already, it could become quite messy trying to control other local docker containers within another one.

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 5, 2021

@russellbrooks good point on dev env in docker! Any thoughts on what the behavior should be in the local mode in the meanwhile? Ignore vs error out?

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 5, 2021

Note: updated the proposal to use @image instead of @container and METAFLOW_DEFAULT_IMAGE instead of METAFLOW_DEFAULT_CONTAINER_IMAGE. Also added a line that compute environment decorators would still support image parameter as a shortcut (i.e. @awslambda(image=...) would also work).

@savingoyal
Copy link
Collaborator

The proposal looks good to me!

@romain-intel
Copy link
Contributor

I like this proposal too but have a few questions:

  • what is the priority order, I would see this but not sure if you are thinking the same: global default image -> flow specification -> step specification -> image inside @batch (or whatever other decorator)
  • we should probably warn/error on something like:
  @image(name='foo')
  @batch(image='bar')
  @step
  def ...

@savingoyal
Copy link
Collaborator

@romain-intel Not sure if I understood your comment properly, but my expectation is that the priority order will be the reverse of your comment -

  1. image attr inside @batch
  2. name attr inside @image
  3. env var - METAFLOW_BATCH_CONTAINER_IMAGE
  4. env var - METAFLOW_CONTAINER_IMAGE

RE: warn/error - today @batch overrides the resources specified with @resources and we can maintain the same behavior.

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 5, 2021

I was thinking step decorator (overrides) flow decorator (overrides) global default. If you use both @image and @batch(image=) on a step that's an error. The reason is: its clearly a mistake anyway, and it is not intuitively obvious to me as a user which one should "win".

@savingoyal
Copy link
Collaborator

@oavdeev But then would you expect folks to not be able to execute python flow.py run --with batch:image=foo where flow.py already has @image decorator as well. Today, you can execute python flow.py run --with batch:cpu=4 where flow.py already has @resource decorator specified.

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 5, 2021

Nice I didn't even realize you can do that! But should i be able to do --with image:name=foo and expect it to override @batch(image="bar") then?

Actually, even in your example, today.. if i have @batch(image="foo") then run it with --with batch:image=bar, does that override? I vaguely remember that decorators aren't really ordered

@savingoyal
Copy link
Collaborator

Currently, we pick the larger of resources specified via @batch or @resources. It's a good point with --with image:name=foo - I think throwing an error when there are conflicts does make sense.

In terms of the actual implementation of @image - does it need to exist before we have --environment=docker? We can still introduce the notion of METAFLOW_DEFAULT_IMAGE and refactor the code so that image processing logic isn't duplicated across @batch, @kubernetes, and @awslambda.

Re: CLI - doesn't override what's specified in the code.

@romain-intel
Copy link
Contributor

@savingoyal , yes, sorry, I listed the list in the reverse order (ie: global image default overridden by flow specification overridden by etc) so I think we are saying the same thing. I think my point though is that we definitely need to think about these combinations a bit and be very clear on what is happening. Definitely agree with @oavdeev that if the code can't resolve things clearly, it's better to error out than to just "pick one". So, we should have a clear priority order and clear rules on what constitutes a conflict.

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 5, 2021

Cool so to summarize: seems like the consensus is that it is not an error to have both @batch(image=) and @image on the same step, but error out if they disagree. If that sounds good i'll update the proposal.

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 5, 2021

Re: --environment=docker, yes, I think we can have @image before we have that. The motivation for this was mostly to keep API consistent between the PRs that are in flight (@kubernetes, @awslambda and Argo).

That brings me to the last open q: should @image error out or be ignored in the local mode?

If we had --environment=docker I'd argue it should error out if you don't use this flag and but have @image somewhere. That would be consistent with @conda behavior that errors out if you don't specify --environment=conda today.

That makes me think that even before we have --environment=docker, the best thing is also error out, so that behavior doesn't change once --environment=docker is released. Thoughts?

@romain-intel
Copy link
Contributor

I would agree with that. i would also list the various priority and ordering (even if that is not 100% due to this proposal) so that it is clear in the future. I will also have to try what happens if you do --batch:image=foo and you already have stuff in @batch() (but not image).

Re the @image in local mode, I think it should be ignored. That is the current behavior for @resources at any rate (ignored if resources can't be controlled). Your point about --environment=docker is a good one though. We do seem to have two different behaviors already (as you point out -- @conda errors out if not in the proper environment but @resources does not).

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 5, 2021

Interesting. Intuitively @resources ignore behavior makes sense to me since it doesn't really change what code gets executed, unlike @conda or @image. That is, I wouldn't expect my tiny laptop to grow more RAM if i do @resources(memory=1e10) so it is not counter intuitive. But it is not unreasonable to expect that dependency specification is respected no matter what.

@romain-intel
Copy link
Contributor

I see your point @oavdeev and I do see that @image specifies dependencies and @resources is more of a performance thing but one could also argue that if you specify @resources(gpu=...), it may be an error if you can't provide a GPU (whether or not the GPU is a perf thing or a functional thing is up for debate). And then what about @resources and @image together in a forthcoming --environment=docker if the image will only run on certain hardware. One compromise is that we can potentially warn and, if the step fails, point the user to this possibility. It's entirely possible that everything works fine. Another argument against failing on @image is that, for the time being though, it makes it harder to "develop locally" and "deploy remotely". You could setup a virtualenv locally that works and would want to test a step, you would have to modify your code to remove @image or something. Sure, not a huge deal but some friction. I do agree that we should, however, favor the long term behavior.

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 6, 2021

Yeah I can see that too. Another thing that @savingoyal suggested to me is to tweak the original proposal like this (let's call it proposal v3):

  1. add a global setting called METAFLOW_DEFAULT_IMAGE to point to the global default image.
  2. require (by convention) that all compute decorators that use docker (@batch, @awslambda, @kubernetes..) accept image= parameter and respect METAFLOW_DEFAULT_IMAGE
  3. require (by convention) that all compute decorators that use docker can also be used on the Flow level. In that case the behavior is the same as adding the decorator to every step.

...and just not add @image yet, so we defer the decision on this until there is a way in Metaflow to run flows locally in docker via --environment=docker or something.

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 18, 2021

After some offline discussions, I've compiled a few usage scenarios/stories below. Anything else I'm missing?

Container image management scenarios

Scenario 1. Uber-image

There is one blessed image, maintained by your friendly ML platform team, blessed by security people. Most or all data science code uses the uber-image, so it is set via Metaflow configuration when the Flow is scheduled, so data scientists never have to think about what image to use.

Scenario 2. Environment-specific uber-image

This is similar to the Uber-image scenario, but complicated by the fact that Lambda requires a specially baked image that includes Lambda runtime. Also, maybe some tasks require a GPU and a special image as well. So you'd have multiple uber images, but the choice of the image is mostly based on compute env you're using, not the task itself.

Scenario 3. An image per flow

ML engineers or Data Scientists build their own images for each flow, e.g. based on requirements.txt for that specific flow. To keep track of image versions, they specify the image in their Metaflow python code in the decorator -- that way the version of the image is captured in their git history. Using Metaflow configuration to set the image is less desirable here.

Scenario 4. An image per step

You can imagine using Metaflow to compare multiple versions of the same ML library. Or maybe you borrowed one step from someone else's Flow, and they're using their own special container image, so you have to use it as well. In this case, users specify images for certain steps, but fall back to use either uber-image(1) or flow-image(3) otherwise.

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 18, 2021

Also an updated proposal, v3.1:

  • add a global setting called METAFLOW_DEFAULT_IMAGE to point to the global default image.
  • require (by convention) that all compute decorators that use containers (@batch, @awslambda, @kubernetes..) accept image= parameter and respect METAFLOW_DEFAULT_IMAGE
  • require (by convention) that all compute decorators also provide a configuration variable METAFLOW_XXX_DEFAULT_IMAGE (e.g. METAFLOW_AWSLAMBDA_DEFAULT_IMAGE that would set a compute plugin-specific default
  • require (by convention) that all compute decorators that use docker can also be used on the Flow level. In that case the behavior is the same as adding the decorator to every step.
  • all compute decorators should support METAFLOW_CONTAINER_REGISTRY which, is set, is used as a common prefix to image names specified by all above ways (purely quality-of-life thing, so that people don't have to specify the full image URI every time).

In the end, the precedence is (using lambda as an example here, ⊱ means "overrides"):
step-level @awslambdaflow-level @awslambdaMETAFLOW_AWSLAMBDA_DEFAULT_IMAGEMETAFLOW_DEFAULT_IMAGE

@savingoyal
Copy link
Collaborator

@oavdeev The proposal LGTM!

@sappier
Copy link
Contributor

sappier commented May 20, 2021

Have a question regarding the image on the cmdline like --batch:image=foo. How it fits into the precedence order?
It naturally can "override" envvars and flow-level. Should a step-level image override it or it's considered as a conflict?

@oavdeev
Copy link
Collaborator Author

oavdeev commented May 20, 2021

Good question, I believe right now --with attaches the decorator to all steps unless they already have it. So the step-level @batch(image=bar) will still override what you set via --with batch:image=foo.

For the purposes of this thread, I'd leave the behavior as it is today.

But it certainly worth the discussion if that's optimal. There even could be use cases for other decorator attributes (not image) where it could be useful, e.g. I can see someone wanting to run the whole thing with more memory via --with batch:memory=100000.

@oavdeev
Copy link
Collaborator Author

oavdeev commented Sep 29, 2021

Actually for naming consistency, I'd call it METAFLOW_*CONTAINER*_IMAGE not METAFLOW_*DEFAULT*_IMAGE because there's already METAFLOW_BATCH_CONTAINER_IMAGE.

@oavdeev
Copy link
Collaborator Author

oavdeev commented Sep 29, 2021

For posterity, updated naming, proposal v3.2:

  • add a global setting called METAFLOW_CONTAINER_IMAGE to point to the global default image.
  • require (by convention) that all compute decorators that use containers (@batch, @awslambda, @kubernetes..) accept image= parameter and respect METAFLOW_CONTAINER_IMAGE
  • require (by convention) that all compute decorators also provide a configuration variable METAFLOW_XXX_CONTAINER_IMAGE (e.g. METAFLOW_AWSLAMBDA_CONTAINER_IMAGE that would set a compute plugin-specific default
  • require (by convention) that all compute decorators that use docker can also be used on the Flow level. In that case the behavior is the same as adding the decorator to every step.
  • all compute decorators should support METAFLOW_CONTAINER_REGISTRY which, is set, is used as a common prefix to image names specified by all above ways (purely quality-of-life thing, so that people don't have to specify the full image URI every time).

In the end, the precedence is (using lambda as an example here, ⊱ means "overrides"):
step-level @awslambdaflow-level @awslambdaMETAFLOW_AWSLAMBDA_CONTAINER_IMAGEMETAFLOW_CONTAINER_IMAGE

@sappier
Copy link
Contributor

sappier commented Sep 30, 2021

step-level @awslambda ⊱ flow-level @awslambda ⊱ METAFLOW_AWSLAMBDA_CONTAINER_IMAGE ⊱ METAFLOW_CONTAINER_IMAGE

Shouldn't config value METAFLOW_CONTAINER_IMAGE override flow-level spec? The config could be also set as an envvar and envvar purpose was to change the default behaviour specified in the source.
As a use-case, the flow developed and executed in the dev/test environments is passed to a prod where it could use a different registry. Then using a dedicated production settings/envvars is a way to run the flow. Otherwise, it's impossible without changing the original flow.

What do you think about this order?

  • --with=awslambda:image ⊱ METAFLOW_AWSLAMBDA_CONTAINER_IMAGE ⊱ METAFLOW_CONTAINER_IMAGE ⊱ flow-level @awslambda
  • step-level @awslambda ⊱ flow-level @awslambda

Same rules are applied to other parameters, e.g. mem, cpu, iam_role, etc.

@oavdeev
Copy link
Collaborator Author

oavdeev commented Oct 1, 2021

Right, but the idea that envvar sets the default but whatever you specify in .py file takes precedence over that. Note that this is also how this works today for AWS Batch, image= in the decorator always overrides METAFLOW_BATCH_CONTAINER_IMAGE setting.

In the use case you are describing, the recommendation would be to not specify the registry in the decorator neither in dev nor prod, and rely on config variables for that. In fact I think this use case is exactly why there is a separate setting for the registry in Metaflow in the first place: Netflix folks wanted users to be able to specify the image (e.g. version of python) but leave it to devops/platform team to set the registry to dev/prod.

I agree there is indeed sometimes a legitimate need to override whatever is set in Python code, but changing the precedence here would require changing existing config behavior for Batch decorator, so I wouldn't go down this path.

@peternagy96
Copy link

Hey guys, is this already in the public release? If so how could I use the decorator?

@marcellovictorino
Copy link

Super interested in the described use case 3, where I can just use Poetry for package management and then point to a local dockerfile. Developing locally with the --environment=docker would be a beauty!
Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants