Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New flow CLI, SDK #612

Draft
wants to merge 54 commits into
base: master
Choose a base branch
from
Draft

New flow CLI, SDK #612

wants to merge 54 commits into from

Conversation

ryan-williams
Copy link
Contributor

@ryan-williams ryan-williams commented Jul 15, 2021

This is a large change, factored into 4 sequential groups of commits, addressing pain points I hit as a new user/developer of Metaflow; work leading up to this has been discussed in Slack:

  • Loosen requirement that "1 file = 1 flow = 1 class = 1 Python process":
    • support multiple flows per file (new CLI: metaflow flow <file[:flow_name]>)
    • support multiple flows being defined in one Python process
  • Reduce flow-definition boilerplate
  • Define graph structure using decorators, not self.next calls
  • Compose flows via inheritance
  • add pytest tests under metaflow/tests (in addition to existing tests under test/core)

Everything should be backwards-compatible; flows can optionally be defined using a new API under metaflow.api; existing Metaflow public members work as they always have.

Examples

See metaflow/api/README.md @ dsl for a detailed write-up and examples, but here is a short overview:

Support multiple flows per file

Using new (optional) metaflow flow CLI entrypoint:

-python my_flow.py <help|run|show|…> …
+metaflow flow my_flow.py[:MyFlow] <help|run|show|…> ….

The flow name ([:MyFlow] above) can be omitted if there is just one flow in the file.

Reduce flow-definition boilerplate

start and end steps, self.next calls, and __main__ handler are often superfluous:

-from metaflow import FlowSpec, step
+from metaflow.api import FlowSpec, step

 class LinearFlow(FlowSpec):
-    def start(self):
-        self.next(self.one)
-
     @step
     def one(self):
         self.a = 111
-        self.next(self.two)

     @step
     def two(self):
         self.b = self.a * 2
-        self.next(self.three)

     @step
     def three(self):
         assert (self.a, self.b) == (111, 222)
-        self.next(self.end)
-
-    @step
-    def end(self): pass

-if __name__ == '__main__':
-    LinearFlow()

Define graph structure using decorators, not self.next calls

Mixing graph structure logic (self.next calls) into step implementations is clumsy; they are two different levels of abstraction and should be better delineated/separated:

class SumSquares(FlowSpec):
    num = Parameter("num", required=True, type=int, default=4)

    @step
    def start(self):
        self.nums = list(range(1, self.num + 1))

    @foreach('nums')
    def square(self, num):
        self.num2 = num**2

    @join
    def sum(self, inputs):
        self.sum2 = sum(input.num2 for input in inputs)

(this resolves #604, I believe)

Compose flows via inheritance

Flows are modeled as Python classes, but currently there's no way to compose them (via inheritance, or otherwise). This PR implements a simple inheritance-based composition scheme:

class FlowOne(FlowSpec):
    @step
    def one(self):
        self.a = 111

class FlowTwo(FlowSpec):
    @step
    def two(self):
        self.b = self.a * self.a

class MyFlow(FlowOne, FlowTwo):
    """Contains steps `one`, `two`"""
    pass

(resolves #245)

Other methods of composing flows may be desirable, but this solves many use cases. Celsius has been using this in production since August 2021.

PR Status

  • #661, #662, #665, #666 were factored out of this PR.
  • d0 is a merge of all of them, and the 4 "checkpoints" described below start from it as a base.
  • Most recent rebase: Jan 8 2022 (master @ 53ab9d4)

PR Structure

Since there are many commits, I've broken out 4 "checkpoints"; each one basically stands alone, and should be able to be reviewed/merged in sequence:

  • d0...d1 (minor): nits, mostly related to graph construction
  • d1...d2 (major): new CLI (metaflow flow <file>[:<flow>] …), multiple flow definitions per file, multiple FlowSpecs can be run within one Python process, more pytest tests
  • d2...d3 (minor): more pytest tests
  • d3...d4 (major): new FlowSpec SDK: graph structure in decorators, hide self.next calls, support flow composition via inheritance

Let me know if a different decomposition would make more sense, or if this isn't clear.

Invariants

I'll preserve these as updates are made to this PR:

d0...d1: misc nits, mostly related to graph construction

Introduces an IS_STEP constant as a start to formalizing the ad hoc API around certain function attrs (e.g. is_step) indicating they are "steps". I build more on this later.

This chunk could be folded into #666, or the subsequent d1...d2 chunk.

d1...d2: support multiple flows per {file, process} + new CLI (metaflow flow <file>[:<flow>] …)

This chunk of commits makes flows behave more like the Python classes they are modeled as:

  • multiple flows can exist in one file
  • multiple flows can be executed within one Python process
  • if __name__ == '__main__': … handler no longer required

New CLI: metaflow flow <file>[:<flow>] …

To support multiple flow definitions in one file, a new, optional CLI entrypoint is implemented:

metaflow flow <file>[:<flow>] …

as a drop-in replacement for / alternative to:

python <file>

If there is just one flow in a file, the :<flow> portion can be omitted:

metaflow flow <file>

Users can still use the old form (python <file> … and a __name__ == '__main__' handler; all relevant changes are backwards-compatible), but Metaflow internally uses the new form everywhere.

Expanded Parameter / "main flow" global bookkeeping

In addition to supporting multiple flow definitions per file, this change allows multiple flows to be invoked in one Python process.

This requires tracking a (global, mutable) "main" flow (and corresponding Parameters that should be fed to click parsing). See changes to parameters.py.

d2...d3: more pytest tests

3 commits, each adding a new test file:

d3...d4: new FlowSpec SDK: graph structure in decorators, hide self.next calls, flow composition via inheritance

This introduces the metaflow.api package, containing a FlowSpec base-class and @step decorator (which are drop-in replacements for metaflow.{FlowSpec,step}) as well as new @foreach and @join decorators.

This alternate FlowSpec class and decorators allow constructing flows in a more ergonomic fashion:

  • Graph structure is specified by decorators (not self.next calls in steps' function-bodies)
    • self.next calls are still synthesized+injected under the hood, but abstracted from users
    • This also improves flows' reusability, as step definitions needn't additionally fix what is to be done downstream with their outputs (as they must do today with required trailing self.next calls).
  • start/end steps are optional
    • Flows with one step (or zero steps) are possible
    • Trivial start/end steps are still synthesized under the hood, if not explicitly defined, for minimal disruption to lower layers

Flow composition via inheritance

Addresses #245: flows can be "mixed-in" to factor out common sets of steps; see metaflow/tests/test_inheritance.py / metaflow/tests/flows/inherited_flows.py.

There are some rough edges:

  • two mixed-in flows can't have step-names that collide with one another
  • in particular, "classic" flows can't be mixed in (since start/end would collide; otherwise this should Just Work)
    • it's possible to work around this; the combined flow already analyzes the graphs of its "parent" flows, and builds its own graph based on them
    • it might require a more principled approach to namespacing flows and steps (afaik flows with the same name, defined in different files, can already collide / cause problems in the metastore?)

However, it allowed us (Celsius) to combine 5 production flows that were previously being shelled out to separately in sequence (also allowing resume to work across them, which previously didn't work), so hopefully it is a good start. We've been using this in production since August 2021.

Longer term, I'd like to come up with syntax for directly "injecting" another flow into the body of a FlowSpec class (supporting direct composition instead of relying on the class-inheritance mechanism), and/or supporting @step decorators on top-level Python functions (not just methods of FlowSpec classes).

@ryan-williams
Copy link
Contributor Author

I can't figure out what failed in "R tests on macos-latest"; it just says Workflow failed. at one point. R tests run for me locally on macos:

cd R/tests
Rscript run_tests.R

@ryan-williams ryan-williams force-pushed the dsl branch 3 times, most recently from 00873e5 to db0a601 Compare January 8, 2022 04:50
@ryan-williams
Copy link
Contributor Author

Some updates here:

  • I rebased the whole stack here on top of master (53ab9d4, or 2.4.7+6) a few days ago
  • I updated all the PRs that feed d0: #661, #662, #665, #666
  • I updated the PR description (which had drifted from the actual PR structure a bit)

Celsius has been using this in production (mostly in Batch mode) since August 2021, and some roadblocks to upstreaming have been lifted since then (notably, it looks like Python2 support is on the way out, which I never attempted to support here. I also didn't support R initially, but do now). I've rebased this a few times since opening it, and generally the only issue has been adding the new "entrypoint" (python -m metaflow.main_cli flow <flow> … instead of python <script> …) to new places (e.g. parallel_decorator.py, card_decorator.py, test_unbounded_foreach_decorator.py).

It would be nice to check in about interest in upstreaming any of this!

One simple change I'm considering here is renaming the new package I introduce from metaflow.api to something like metaflow.sdk or metaflow.sdk2. I feel that "SDK" more accurately captures what I've created: a new/alternate Python SDK for defining flows, that compiles to the same graph structure / Metaflow backend. Thoughts welcome on that.

@ekamperi
Copy link

Hey folks! Is there a path on when (whether?) this PR will be merged? I love the graph composition and the new decorator-enabled syntax. However, I'm not technically apt to contribute, though I could lend help in testing if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Annotation @join Composable Flows and Steps
2 participants