Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarified Runtime and Added Hints #315

Closed
wants to merge 8 commits into from

Conversation

patmagee
Copy link
Member

The Problem

One of the key motivators for developing WDL has always been the dream of interopable workflows. That is Workflows that can run in any execution engine, on any platform, in the exact same way and resaonably expect the same outputs.

Recently several community members have tested this theory to see just how far we are from that target. Unfortunately, WDL did not get a passing grade as it currently is. One of the primary reasons for this is the runtime section within a task. We have always had the desire to rework this section, moving it away from something that is a grab bag of parameters and more into minimum execution requirements. However, this was never done, and because of that, a lot of properties have crept in there which are not part of the WDL specification but have largely become used within the community to execute a workflow on a specific platform or infrastructure. This does not help interopability or sharing.

My Solution

To me the best solution moving forward is to start to be a bit opinionated about what goes into the runtime section and ensure that there will ALWAYS be enough information present there to run a task. To this end, I have codified the keys supported within the runtime section, and have made certain properties required for every task (even if the engine supports runtime defaults). While some people may groan at the idea of this, since it means more lines to write, I think its important to remember what a task is meant to do. It is essentially a definition of a wrapped command and the execution environment that command is supposed to run in. Previously, we only required the command part, and allowed the engine to specify defaults if it sees fit for the environment. This unfortunately meant that for some tasks, there was no concrete definition of what was required to run it, immediately making the task bound to the engine and infrastructure the author intended it for. Requiring a set of minimal attributes in the runtime gets around this, without being TOO obtrusive IMO.

I have additionally removed the ability to specify arbitrary KV pairs within the runtime block, and have introduced a new section called hints. There is a very important distinction between the two sections: The runtime defines what is REQUIRED to run the task (and engine that can provide those resources should be able to run the task), and hints are optimizations that an engine can CHOOSE to apply if they understand the meaning of the hint.

I have opted for making hints a semi-structured section similar to how the runtime used to operate in V1.0 and earlier. There are a set of reserved keys which should carry similar meaning between engines (whether the engine chooses to adopt them is completely optional), and then additionally I have added a strong emphasis on convention and reuse of arbitrary properties.


#### Reserved Keys

Although `hints` are arbitrary, there are a number of keys which are reserved by the language in or oder to try and encourage interoperability of tasks and workflows between different execution engines. The list of Reserved keys is likely to grow, and in order to prevent name space collisions, engines should follow the [best practices](#conventions-and-best-practices) for specifying hints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo or oder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a few other typos.. You are eagle eyed

@cjllanwarne
Copy link
Contributor

A few other attributes exist which change behavior (and though "runtime" is a weird description for them, they're certainly not optional):

  • continueOnReturnCode
    • If a boolean, whether to allow the task to "succeed" even if the return code is non-zero
    • If an integer, a single return code value which represents success
    • If an integer array, a list of return code values which represent success
  • failOnStderr
    • A boolean used to indicate that the task has failed if any content exists in the stderr output

* **maxCpu**: Specify the maximum CPU to be provisioned for a task. It is up to the engine, whether or not it enforces the `maxCpu` hint. The type of this hint should be the same as `runtime.cpu`
* **maxMemory**: Specify the maximum memory provisioned for a task. It is up to the engine, whether or not it enforces the `maxMemory` hint. the type of this hint should be the same as `runtime.memory`
* **shortTask**: Tell the execution engine that this task is not expected to take long to execute, and therefore the engine can attempt to optimize the execution in whatever way it defines. This flag can be used to tell an engine to use the cost optimized instance types that many clouds provide (but are available only for a limited time). An example of this would be `preemptible` instances on `gcp` and `spot` instances on `aws`.
* **localizationOptional**: Tell the execution engine that whenever possible it does not need to localize the defined `File` type inputs for this task. Important to note, is this directive should not have any impact on the success or failure of a task (ie it should run with or without localization). The type of this hint is a boolean value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to anyone else who almost makes the same mistake as me:

Athough this global localizationOptional is like a less useful version of the Cromwell parameter_meta hint, keep reading because there's a separate inputs hints section which covers the specific-per-input version of this hint.

@patmagee
Copy link
Member Author

@cjllanwarne to the runtime section I added a returnCodes entry, which is essentially the same as the continueOnReturnCodes, however makes a few tweeks from what cromwell does.

Additionally, I personally find the "failOnStdErr" a bit of an odd feature even in cromwell, which is why I did not include it in the current runtime section. I can add it here of course, however I just wanted to make the point that if a task ALWAYS returns with a 200 status, there is no reason why the user could not capture the stdout and exit if the file is nonempty.


The container key must accept one ore more locations which inform the engine where to retrieve a container image to execute the task. It is expected (but not enforced) that each container image provided are identical, and will provide the same final results when the task is run. It is the responsibility of the individual execution engine to define the specific image sources which it supports, and to determine which image is the "best" one to use at runtime. Defining multiple images enables greater portability across a broad range of execution environments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should 100% be considered invalid if different images lead to different results

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the only way to guarantee that though is 1, execute the WF under different images, or 2, assert checksums of the different images (in which case each engine might not know the actual access method). I will add stronger language here though

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn’t it be sufficient to just compare image SHA in the engine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dinvlad the image SHA is not required to be passed in by the user, and I think would actually be to cumbersome a restriction. The best approach IMO is just making it a non enforced requirment

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patmagee sorry, I guess what I meant is more on the implementation side (so perhaps it doesn't belong to this discussion). IIRC, engine implementations like Cromwell use image SHA in cache comparisons anyway. So this would be a low-hanging fruit to just compare SHAs that are already collected for these images anyway (I'm not talking about having to specify SHA inside the image URL, but rather the native caching mechanism already retrieves it from the Docker repo, if I understand that correctly).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dinvlad Keep in mind that the period for comments. At this point the PR is set in stone and we're voting on it


#### docker
#### cpu (**Required**)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nervous about increasing the amount of boilerplate this means adding into every. single. WDL. task. definition.

Can we instead say "if a cpu value is not specified then the task must work with exactly '1' CPU"? And something similar for memory?

Copy link
Member Author

@patmagee patmagee May 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that is a good suggestion, hardcoding the default in the spec. my worry though, is that people will just ignore that entirely and continue on using things like the default-runtime attributes, and relying on things that are very cromwell or other engine specific. This is what lead to alot of the motivation for redefining the runtime block. I suppose we could state:

If a cpu key is not defined, then then it must be interpreted that the task will run with >exactly 1 CPU. engines should not provide a mechanism to set their own default values >which circumvents this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjllanwarne the other idea I had, was to allow for a runtime section in the 'workflow' definition. This would allow for the definition of runtime defaults which are explicitly stated and it Also provides a mechanism for documenting them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjllanwarne on second thought, I Think I prefer enforcing an implicit definition of a default within the WDL as opposed to allowing defaults at the workflow level. Not every task will have a workflow, therefore people will likely abuse this approach for tasks and just assunme a user will define the default value

python process.py ${arg}
#### gpu (**Optional**)

The `gpu` key provides a way to accommodate modern workflows which are increasingly becoming reliant on GPU computations. This key is a `true` or `false` value and simply indicates to the engine that a task requires a GPU to run to completion. A task with this flag set to `true` is guaranteed to only run if a GPU is a available within the runtime environment. It is the responsibility of the engine to check prior to execution whether a GPU is provisionable, and if not, preemptively fail the task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe someone with more experience using GPUs can chime in:

  • Is "I just need a GPU, I don't care what type" enough information? (ie is it possible to be agnostic about the number, type and driver version required?)
  • Is having a GPU a genuine requirement for running such a task
    • ... or is it more a case of "if there isn't a GPU the task will still succeed but will be inefficient?"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I am a little naive about, so I will rely on someone else to have more information. I was trying to capture the requirement for a GPU without throwing in all of the provider specific terminology in this part of the block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to provide some color here.

The GPU instance types DNAnexus supports today, on AWS, are the g2 and p3 families. The p3 is the current generation. To allow users to actually make use of these, we install an NVIDIA driver. An alternative is nvidia-docker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orodeh does the user need to know specifics about this? or is it enough for them to know that a GPU was added, and then allow additional config in the hints section?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my motivatin here was to abstract as much of the specific vendor terminology into the hints section, then have an engine guarantee a gpu was added. the flavor and size of the gpu would need to be either configured in the hints or the engine would need to make a resonable default value here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good start to say have a GPU flag in the runtime section, and leave the rest for hints.


The following is a basic set of guidelines that engines and authors should follow when writing hints:

1. A hint should never be required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should => must?

@illusional
Copy link
Contributor

Really excited that docker has been renamed to container (while still defaulting to Dockerhub), I think this is a super valuable step towards agnosticism of containers.

@@ -925,6 +968,7 @@ task test {
command <<<
ps ~{flags}
>>>
.....
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of consistency:

Suggested change
.....
....


The container key must accept one ore more locations which inform the engine where to retrieve a container image to execute the task. It is expected that each container image provided are identical, and will MUST the same final results when the task is run. It is the responsibility of the individual execution engine to define the specific image sources which it supports, and to determine which image is the "best" one to use at runtime. Defining multiple images enables greater portability across a broad range of execution environments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one sentence here which reads a bit strangely to me:

It is expected that each container image provided are identical, and will MUST the same final results when the task is run.

Perhaps it should be:

It is expected that all container images provided are identical. They MUST produce the same final results when the task is run.

There are several ways to specify the `disks` attribute:

* `Int` - GB of disk space to request, eg: `100`, `200`.
* `String` (`"<size>" || "<mount-point> <size>"`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should some system similar to the one for memory be allowed here? ie. specifying whether the size is in KBs or MBs or TBs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically that makes sense. A disk is a a size, and sizes can be specified in blocks. In practice however I feel like we will only ever see GB or TB definitions (since specifying a disk in MB seems archaic). And from that we will NORMALLY only see GB. So I figured the simplification makes sense. Wheras with memory, people can and regularily do specify both GB and MB for the size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, are disk specifications like "100GB, or "3TB" valid?

In my personal opinion, they should be legal.


Memory requirements for this task. Two kinds of values are supported for this attributes:
The `memory` key defines the _minimum_ memory required for this task which must be available prior to the engine starting execution. The engine does not need to provide the exact amount of memory requested, however it may ONLY provision more memory then requested and not less. For example, if the wdl requested `1 GB` but only blocks of `4 GB` were available, the engine might choose to provision `4.0 GB` instead. Two kinds of values are supported for this attribute:

* `Int` - Interpreted as bytes
* `String` - This should be a decimal value with suffixes like `B`, `KB`, `MB` or binary suffixes `KiB`, `MiB`. For example: `6.2 GB`, `5MB`, `2GiB`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there or should there not be a space between the number and the suffix? Or does it not matter?

@rhpvorderman
Copy link
Contributor

Wow! Great PR!
We have exactly this problem. We are running on a SGE cluster where java jobs require insane memory allocations not to get killed. So the biowdl/tasks sharing is hindered by that. Having a way to specify this in a more general way is a blessing. Thanks a lot @patmagee !

}
```


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest having a default flag of false for this field. Otherwise, we would be a making tasks more verbose than necessary.

@orodeh
Copy link
Contributor

orodeh commented Jun 4, 2019

I think this is good. My only comment is that I like Cromwell's default runtime attributes, and they are supported by dxWDL. I would prefer to avoid adding required boilerplate to WDL tasks, if it isn't strictly necessary.

@patmagee
Copy link
Member Author

patmagee commented Jun 4, 2019

@orodeh As a language, this is one thing that we ultimately need to resolve. the runtime attributes is one of the culprits IMO behind making a workflow completely unrunnable in a different environment. It provides a way for users to move task configuration outside of the actual task/wdl and completely detracts from the portability of that workflow. The balance I tried to strike can be seen in this comment: #315 (comment), where users can avoid setting a value but the task MUST be run with the defaults if there is nothing set. This guarantess portability. IT does however explicitly dissallow things like the runtime-defaults.

@orodeh
Copy link
Contributor

orodeh commented Jun 4, 2019

@patmagee. Very well, can we then make defaults for cpu, memory, and disk, so that the only thing you absolutely must have in the runtime section is the container image?

@patmagee
Copy link
Member Author

patmagee commented Jun 4, 2019

@orodeh yes, I am open to suggestions on what those values should be. for CPU I would think that 1 is fine, disk I would so 10GB but am a bit hazy on memory

@orodeh
Copy link
Contributor

orodeh commented Jun 4, 2019

How about 2GiB as a default number for memory.

@patmagee
Copy link
Member Author

patmagee commented Jun 4, 2019

that sounds reasonable to me

@rhpvorderman
Copy link
Contributor

disk I would so 10GB

This very much depends on the size of the input file. If I am running on a whole-genome sample and my input fastqs are 100 GB, then I need 100GB of disk space as well for the output when using cutadapt.

But If I run a test workflow with 200 kb files I don't want cromwell to crash by default because there is no 10GB available. All my test workflows run in /tmp by default (pytest-workflow's default). I do have some spare gigabytes on / but not 10. 10 is a bit overly cautious IMHO.

Why not set it at 1GiB ? We have a size(FILE) operator in WDL so it is extremely easy to set the requirements according to the input files. I think it would be good to promote that as best practice.

@patmagee
Copy link
Member Author

patmagee commented Jun 5, 2019

@orodeh I have added the default values with restrictions on execution engines providing their own defaults. WDYT

@jtratner
Copy link
Contributor

jtratner commented Sep 14, 2019 via email

@@ -2647,6 +3026,39 @@ In JSON, the inputs to the workflow in the previous section might be:

It's important to note that the type in JSON must be coercible to the WDL type. For example `wf.int_val` expects an integer, but if we specified it in JSON as `"wf.int_val": "three"`, this coercion from string to integer is not valid and would result in a coercion error. See the section on [Type Coercion](#type-coercion) for more details.

## Specifying / Overriding Runtime Attributes in JSON

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 👍

@jdidion
Copy link
Collaborator

jdidion commented Nov 3, 2020

This is now implemented in wdlTools

@patmagee
Copy link
Member Author

patmagee commented Nov 3, 2020

@jdidion incredible news!! I can coordinate merging this and resolving any conflicts

@mlin
Copy link
Member

mlin commented Nov 6, 2020

Hi all, belated discussion point here. The PR is written with the hints section grammar modeled on the meta sections, meaning the right-hand side values are literals only, and can not be WDL expressions with calculations or references to other declarations. That's in contrast to the runtime section where expressions are useful & widely used. It would seem curious that you can write a calculation for runtime.memory but not hints.maxMemory.

On the other hand, if we want to allow general expressions for the hints values, there's a corner-case grammar collision between map literals and the nested runtime section, e.g. if we wrote

String bar = "xyz"
hints {
  inputs: { bar: true }
}

it's grammatically ambiguous whether the right-hand side of inputs: is a nested hints section or a Map[String,Boolean]. Various solutions are possible but we'd have to settle on one.

@mlin
Copy link
Member

mlin commented Nov 11, 2020

Also importing a discussion point from #406 (comment) with @patmagee @jdidion @illusional.

runtime.container is just an alias for runtime.docker right now, presumptively referring to a docker registry/tag. How would one specify that another kind of container/image is meant? That seems to me important if the goal is portable generalization to other container tech like Singularity, and thus shouldn't be left to engine configuration. More structure might also be needed for the future possibility of supplying a Dockerfile (or Singularity recipe) instead of a prebuilt image tag, which really streamlines workflow "devops" for simple cases (also provided by CWL).

The concern IMO is introducing runtime.container in a way that's just an alias for runtime.docker at first, and whether that's going to place constraints on the likely need for more structure in the future.

@rhpvorderman
Copy link
Contributor

One solution could be that the container field always needs to specify a protocol. I.e. docker://biowdl/chunked-scatter:0.1.0 or singularity://ubuntu:16.04. Or the container field becomes a map:

runtime {
    container: {protocol: "docker", image: "biowdl/chunked-scatter:0.1.0"}
} 

The last solution also allows for several fields if the need arises, but since I can not think of any use cases right now, I prefer the simpler version with the protocol.

@jdidion
Copy link
Collaborator

jdidion commented Nov 12, 2020

I like the simple solution. For better backward compatibility, if there is no protocol, assume Docker.

@jdidion
Copy link
Collaborator

jdidion commented Nov 12, 2020

(I was worried about having to escape URIs with colons in them, but apparently colons are allowed without escaping https://stackoverflow.com/questions/1737575/are-colons-allowed-in-urls#:~:text=Yes%2C%20unless%20it's%20in%20the,org%2Fwiki%2FTemplate%3AWelcome)

@patmagee
Copy link
Member Author

I agree with @jdidion, in this case simple is better. If we allow a protocol to be defined, and by default use docker that probably captures the vast majority of use cases

@mlin
Copy link
Member

mlin commented Nov 12, 2020

  • implied docker:// saounds good to me; I noticed the pull commands in singularity, podman, and rkt have similar schemes
  • nit: if we're trying to avoid backwards incompatibility for "WDL 1.1" then we mustn't declare runtime.container "required" as it is now (nor can we forbid other arbitrary runtime fields yet)
  • bumping Clarified Runtime and Added Hints #315 (comment) the previous comment on hints grammar

@jdidion
Copy link
Collaborator

jdidion commented Nov 13, 2020

@mlin I agree on your point 2 - this is explicitly stated in the RFC (container is an alias, not a replacement; arbitrary keys are deprecated but still allowed)


Since values are expressions, they can also reference variables in the task:
#### container (**Required**)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdidion this is the "Required" I was referring to

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah - I agree that it shouldn't be required in 1.1, and I think there's a good argument to not ever make it required, for example if I just want to run my workflow locally in an environment where I already have all the dependencies installed.

@mlin
Copy link
Member

mlin commented Nov 14, 2020

Here are several ideas for how to allow expressions in hints. This is a tedious issue obviously, but probably much less that it would be ex post facto!

  1. Keep as-is -- don't allow them. (A valid answer, but might lead to regret?)
  2. Make hints just like runtime, without any meta-style nesting. Say custom hints should give their names appropriate prefixes (e.g. aws_, azure_) to avoid conflict.
  3. Make all hints exist at exactly one level of nesting (a namespace), so every hint is shaped like hints { namespace { name: <expr> } }
  4. Say expressions are allowed in hints "except for map literals" so the key: value syntax always represents meta-style nesting
  5. Allow expressions in hints only within ~{} interpolations
  6. from @jdidion on slack:

require the user to define a struct for any object-like value they want to put in hints, and then use the struct literal syntax
or for engine-specific hints, the engine pre-defines the struct format and other engines just ignore any hints they don’t recognize (including not trying to type-check their values)

@jdidion
Copy link
Collaborator

jdidion commented Nov 19, 2020

@mlin I actually think #3 is the best solution. In fact, I'd be in favor of also disallowing map literals in runtime and the meta sections as well - it just makes parsing more difficult and confusing to the end user for very little benefit.

My second choices would be 0 or 5, and I'm pretty opposed to 1 and 2.

@jdidion jdidion mentioned this pull request Dec 16, 2020
jdidion added a commit that referenced this pull request Feb 12, 2021
Replace development spec with updated version of 1.1 spec. It is the 1.1 spec with all deprecated items removed, and with Directory and PR #315 added in.
@jdidion
Copy link
Collaborator

jdidion commented Feb 12, 2021

Merged as part of #435

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.