Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update engine README for Params #7600

Merged
merged 2 commits into from
Aug 2, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 91 additions & 100 deletions src/python/pants/engine/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,63 +17,74 @@ Once the engine is instantiated with a valid set of `@rule`s, a caller can synch
computation of any of the product types provided by those `@rule`s by calling:

```python
# Request a ThingINeed (a `Product`) for the thing_i_have (a `Subject`).
# Request a ThingINeed (a `Product`) for a thing_i_have (a `Param`).
thing_i_need, = scheduler.product_request(ThingINeed, [thing_i_have])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comma after thing_i_need a typo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No: it's unpacking a single result from a list. I feel like I like this syntax better than:

thing_i_need = scheduler.product_request(ThingINeed, [thing_i_have])[0]

... but if we want to avoid this, we can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thing_i_need = scheduler.product_request(ThingINeed, [thing_i_have])[0] is less surprising to me. I recommend using that style both in documentation and source code.

Copy link
Contributor

@jsirois jsirois Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comma is subtle. Perhaps .pop() as the idiom is a compromise, but it would seem nicer to assert a single item which the tuple unpack does. Maybe wrapping this case up in an API pulls its weight and scheduler.single_product_request(ThingINeed, [thing_i_have]) should be a thing (I think it was at some point?). Maybe request and request_single would be better names - the product bit is perhaps redundant - the only thing you can request is a product.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's always pants.util.collections.assert_single_element()! That is what I have used for this exact purpose.

```

The engine then takes care of concurrently executing all dependencies of the matched `@rule`s to
produce the requested value.

### Products and Subjects
### Products and Params

The engine executes your `@rule`s in order to (recursively) compute a `Product` of the requested
type for a given `Subject`. This recursive type search leads to a very loosely coupled (and yet
type for a set of `Param`s. This recursive type search leads to a loosely coupled (and yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give formal definitions of Product and Param here? So that there's something to hang the subsequent explanation and examples off?

still statically checked) form of dependency injection.

When an `@rule` runs, it runs for a particular `Subject` value, which is part of the unique
identity for that instance of the `@rule`. An `@rule` can request dependencies for different
`Subject` values as it runs (see the section on `Get` requests below). Because the subject for
an `@rule` is chosen by callers, a `Subject` can be of any (hashable) type that a user might want
to compute a product for.
When an `@rule` runs, it requires a set of `Param`s that the engine has determined are needed
to compute its transitive `@rule` dependencies. So although an `@rule` might not have a particular
`Param` type in its signature, it might depend on another `@rule` that does need that `Param`, and
would thus need that `Param` in order to run. To see which `Params` the engine needs to run each
`@rule`, refer to the `Visualization` section below.

The return value of an `@rule` for a particular `Subject` is known as a `Product`. At some level,
you can think of (`subject_value`, `product_type`) as a "key" that uniquely identifies a particular
Product value and `@rule` execution.
Any hashable type with useful equality may be used as a `Param`, and additional `Params` can be
provided to an `@rule`'s dependencies via `Get` requests (see below). Each `Param` value in a set
of `Params` is unique by type, so if `@rules` recursively introduce a particular `Param` type,
there will still only be one value for that type in each `@rule`, but it will change as you move
deeper into the dependency graph.

The return value of an `@rule` is known as a `Product`. At some level, you can think
stuhood marked this conversation as resolved.
Show resolved Hide resolved
of `(product_type, params_set)` as a "key" that uniquely identifies a particular `Product` value
and `@rule` execution. If an `@rule` is able to produce a `Product` without consuming any `Params`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence confused me because I thought it was closely related to the prior sentence. I recommend adding a transition word like Further, if an @rule.

then the `@rule` will run exactly once, and the value that it produces will be a singleton.

#### Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! This is really helpful.

We might want to move the ambiguity information to a new subsection called Rule ambiguity. Right now, it's all the same section as the example, and I fear 1) some people will skip it as a result, 2) the example looks scarier than necessary, and 3) it will be harder to discover rule ambiguity is a thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rereading this again, I echo this suggestion. The example doesn't seem closely related to the actual topic of Rule ambiguity.


As a very simple example, you might register the following `@rule` that can compute a `String`
Product given a single `Int` input.
Product given a single `Int` argument.

```python
@rule(StringType, [IntType])
@rule(str, [int])
def int_to_str(an_int):
return '{}'.format(an_int)
return str(an_int)
```

The first argument to the `@rule` decorator is the Product (ie, return) type for the `@rule`. The
second argument is a list of parameter selectors that declare the types of the input parameters for
the `@rule`. In this case, because the Product type is `StringType` and there is one parameter
selector (for `IntType`), this `@rule` represents a conversion from `IntType` to `StrType`, with no
other inputs.
The first argument to the `@rule` decorator is the `Product` (ie, return) type for the `@rule`. The
second argument is a list of "parameter selectors" that declare the types of the input parameters for
stuhood marked this conversation as resolved.
Show resolved Hide resolved
the `@rule`. In this case, because the `Product` type is `str` and there is one parameter
selector (for `int`), this `@rule` represents a conversion from `int` to `str`, with no other inputs.

When the engine encounters this `@rule` while compiling the rule graph for `str`-producing-`@rules`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once more on the pure function theme. Given that rules are just pure functions, there are really just two interesting bits:

  1. How does the engine decide to call a function (you address a bit of this here).
  2. How does the user call a function (Get).

it will next go hunting for the dependency `@rule` that can produce an `int` using the fewest number
of `Params`. For example, if there was an `@rule` that could produce an `int` without consuming any
`Params` at all (ie, a singleton), then that `@rule` would always be chosen first. If all `@rules` to
produce `int`s required at least one `Param`, then the engine would next see whether the input `Params`
contained an `int`, or whether there were any `@rules` that required only one `Param`, then two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence buries the lede a bit: you start by talking about needing to produce an int from Params, but here you imply that the int itself can be a Param.

Does the output of some other rule count as a Param, or are Params just the things that are injected into the boundary of the graph "from outside"?

In other words, if I have rule1 that takes a Foo and returns a Bar, and then rule2 that takes a Bar and returns a Baz, is it correct to refer to the Foo as a Param? What about the Bar?

Sorry to keep banging on this, but this README is going to be extremely useful and important, so it's best to make sure it's crystal clear.

`Params`, and so on.

When the engine statically checks whether it can use this `@rule` to create a string for a
Subject, it will first see whether there are any ways to get an IntType for that Subject. If
the subject is already of `type(subject) == IntType`, then the `@rule` will be satisfiable without
any other dependencies. On the other hand, if the type _doesn't_ match, the engine doesn't give up:
it will next look for any other registered `@rule`s that can compute an IntType Product for the
Subject (and so on, recursively).
In cases where this search detects any ambiguity (generally because there are two or more `@rules` that
can provide the same product with the same number of parameters), rule graph compilation will fail with
a useful error message.

### Datatypes

In practical use, using basic types like `StringType` or `IntType` does not provide enough
information to disambiguate between various types of data. So declaring small `datatype`
definitions to provide a unique and descriptive type is strongly recommended:
In practical use, builtin types like `str` or `int` do not provide enough information to disambiguate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue not providing enough information, or instead you can't easily compose complex data types like a List[int] or a Tuple[MyClass, int].

Currently this suggests it's wrong to ever use str or int, because it suggests they result in ambiguity. Instead, I think what we're after is it would be hard to express complex data types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, well after reading the example maybe both reasons are cause to use a datatype. Would be useful to mention both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it's both. A feature, not a bug =)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps to create compound types and to disambiguate between various types of data? I do think the idea of collections is an important one to explain.

When reading this, I wasn't immediately thinking "huh how would I make a list of ints?" But that's an important thing to know it's possible, so I think it's worth proactively mentioning it.

between various types of data in `@rule` signatures, so declaring small `datatype` definitions to
provide a unique and descriptive type is highly recommended:

```python
class FormattedInt(datatype(['content'])): pass

@rule(FormattedInt, [IntType])
@rule(FormattedInt, [int])
def int_to_str(an_int):
return FormattedInt('{}'.format(an_int))
stuhood marked this conversation as resolved.
Show resolved Hide resolved

Expand Down Expand Up @@ -105,29 +116,32 @@ class TypedDatatype(datatype([('field_name', Exactly(str, int))])):
```

Assigning a specific type to a field can be somewhat unidiomatic in Python, and may be unexpected or
unnatural to use. Additionally, the engine already applies a form of implicit type checking by
ensuring there is a unique path from subject to product when a product request is made. However,
regardless of whether the object is created directly with type-checked fields or whether it's
produced from a set of rules by the engine's dependency injection, it is extremely useful to
formalize the assumptions made about the value of an object into a specific type, even if the type
just wraps a single field. The `datatype()` function makes it simple and efficient to apply that
strategy.
unnatural to use. However, regardless of whether the object is created directly with type-checked
fields or whether it's produced from a set of rules by the engine's dependency injection, it is
extremely useful to formalize the assumptions made about the value of an object into a specific type,
even if the type just wraps a single field. The `datatype()` function makes it simple and efficient
to apply that strategy.

stuhood marked this conversation as resolved.
Show resolved Hide resolved
### Parameter selectors and Gets
### Gets and RootRules

As demonstrated above, parameter selectors select `@rule` inputs in the context of a particular
`Subject` (and its `Variants`: discussed below). But it is frequently necessary to "change" the
subject and request products for subjects other than the one that the `@rule` is running for.
As demonstrated above, parameter selectors select `@rule` arguments in the context of a set of `Params`.
But where do `Params` come from?

In cases where this is necessary, `@rule`s may be written as coroutines (ie, using the python
`yield` statement) that yield "`Get` requests" that request products for other subjects. Just like
`@rule` parameter selectors, `Get` requests instantiated in the body of an `@rule` are statically
checked to be satisfiable in the set of installed `@rule`s.
One source of `Params` is the root of a request, where a `Param` type that may be provided by a caller
of the engine can be declared using a `RootRule`. Installing a `RootRule` is sometimes necessary to
seal the rule graph in cases where a `Param` could only possibly be computed outside of the rule graph
and then passed in.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. I think it would help to have an example specific to the idea of a RootRule. My main confusion is what does "root" mean in "root of a request"?


The second case for introducing new `Params` occurs within the running graph when an `@rule` needs
to pass values to its dependencies that are necessary to compute a product. In this case, `@rule`s may
be written as coroutines (ie, using the python `yield` statement) that yield "`Get` requests" that request
products for other `Params`. Just like `@rule` parameter selectors, `Get` requests instantiated in the
body of an `@rule` are statically checked to be satisfiable in the set of installed `@rule`s.

#### Example

For example, you could declare an `@rule` that requests FileContent for each entry in a Files list,
and then concatentates that content into a (typed) string:
and then concatentates that content into a (datatype-wrapped) string:

```python
@rule(ConcattedFiles, [Files])
Expand All @@ -136,27 +150,27 @@ def concat(files):
yield ConcattedFiles(''.join(fc.content for fc in file_content_list))
```

This `@rule` declares that: "for any Subject for which we can compute `Files`, we can also compute
`ConcattedFiles`". Each yielded `Get` request results in FileContent for a different File Subject
from the Files list.
This `@rule` declares that: "for any `Params` for which we can compute `Files`, we can also compute
`ConcattedFiles`". Each yielded `Get` request results in FileContent for a different File `Param`
from the Files list. And, happily, all of these requests can proceed in parallel.

### Advanced Param Usage
stuhood marked this conversation as resolved.
Show resolved Hide resolved

### Variants
Sometimes `@rule`s will need to consume multiple `Params` in order to tailor their output Products
to their consumers.

Certain `@rule`s will also need parameters provided by their dependents in order to tailor their output
Products to their consumers. For example, a javac `@rule` might need to know the version of the java
platform for a given dependent binary target (say Java 9), or an ivy `@rule` might need to identify a
globally consistent ivy resolve for a test target. To allow for this the engine introduces the
concept of `Variants`, which are passed recursively from dependents to dependencies.
For example, a javac `@rule` might need to know the version of the java platform for a given
dependent binary target, or an ivy `@rule` might need to identify a globally consistent ivy resolve
for a test target. In both of these cases, the `@rule` requires two `Params` to be in scope. But
due to the fact that `Params` are implicitly propagated from dependents to dependencies, it's possible
for these `Params` to be provided much higher in the graph, without intermediate `@rules` needing to
be aware of them.

If a Rule uses a `SelectVariants` Selector to indicate that a variant is required, consumers can use
a `@[type]=[name]` address syntax extension to pass a variant that matches a particular configuration
for a `@rule`. A dependency declared as `src/java/com/example/lib:lib` specifies no particular variant, but
`src/java/com/example/lib:lib@java=java8` asks for the configured variant of the lib named "java8".
The result would be that any subgraph that transitively consumed a `Param` to produce Java 11 (for
example) would be safely isolated and distinct from one that produced Java 9.

Additionally, it is possible to specify the "default" variants for an Address by installing an `@rule`
function that can provide `Variants(default=..)`. Since the purpose of variants is to collect
information from dependents, only default variant values which have not been set by a dependent
will be used.
_(This section needs an example, but that will have to wait for
[#7490](https://github.com/pantsbuild/pants/issues/7490)!)_

## Internal API

Expand All @@ -168,44 +182,32 @@ To compute a value for a Node, the engine uses the `Node.run` method starting fr
roots. If a Node needs more inputs, it requests them via `Context.get`, which will declare a
dependency, and memoize the computation represented by the requested `Node`.

The initial Nodes are [launched by the engine](https://github.com/pantsbuild/pants/blob/16d43a06ba3751e22fdc7f69f009faeb59a33930/src/rust/engine/src/scheduler.rs#L116-L126),
but the rest of execution is driven by Nodes recursively calling `Context.get` to request their
dependencies.
This recorded `Graph` tracks all dependencies between `@rules` and builtin "intrinsic" rules that
provide filesystem and network access. That dependency tracking allows for invalidation and dirtying
of `Nodes` as their dependencies change.

### Registering Rules
## Registering Rules

Currently, it is only possible to load rules into the pants scheduler in two ways: by importing and
using them in `src/python/pants/bin/engine_initializer.py`, or by adding them to the list returned
by a `rules()` method defined in `src/python/backend/<backend_name>/register.py`. Plugins cannot add
new rules yet. Unit tests, however, can mix in `TestBase` from
`tests/python/pants_test/test_base.py` to generate and execute a scheduler from a given set of
rules.
The recommended way to install `@rules` is to return them as a list from a `def rules()` definition
in a plugin's `register.py` file. Unit tests can either invoke `@rules` with fully mocked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short example would be helpful of what the def rules() looks like. You don't need to show the implementation of the rules, just the signatures. Something like this.

class List:

  @rule(List, [Console, List.Options, Specs])
  def list_targets(console, list_options, specs):
    ...
  
  def rules():
    return [
      list_targets,
    ]

I'd think of this example like an ~integration example. You showed above how the building blocks work like datatype and @rule. Here, you show how it all comes together.

dependencies via `pants_test.engine.util.run_rule`, or extend `pants_test.test_base.TestBase` to
construct and execute a scheduler for a given set of rules.

In general, there are two types of rules that you can define:

1. an `@rule`, which has a single product type and selects its inputs as described above.
2. a `RootRule`, which declares a type that can be used as a *subject*, which means it can be
provided as an input to a `product_request()`.

In more depth, a `RootRule` for some type is required when no other rule might provide that
type (i.e. it is not provided as the product of any `@rule`) in some context. In the absence of a
`RootRule`, any subject type involved in a request "at runtime" (i.e. via `product_request()`),
would show up as an an unused or impossible path in the rule graph. Another potential name for
`RootRule` might be `ParamRule`, or something similar, as it can be thought of as saying that the
type represents a sort of "public API entrypoint" via a `product_request()`.

Note that `Get` requests do not require a `RootRule`, as their requests are statically verified when
the `@rule` definition is parsed, so we know before runtime that they might be requested.
2. a `RootRule`, which declares a type that a caller of the engine may provide as a `Param` in a
call to `Scheduler.product_request(..)` (ie, at the "root" of the graph).

This interface is being actively developed at this time and this documentation may be out of
date. Please feel free to file an issue or pull request if you notice any outdated or incorrect
information in this document!

## Execution
## Visualization

The engine executes work concurrently wherever possible; to help visualize executions, a visualization
tool is provided that, after executing a `Graph`, generates a `dot` file that can be rendered using
Graphviz:
To help visualize executions, the engine can render both the static rule graph that is compiled
on startup, and also the content of the `Graph` that is produced while `@rules` run. This generates
`dot` files that can be rendered using Graphviz:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could link to #7024 or #7509 here.


```console
$ mkdir viz
Expand All @@ -214,17 +216,6 @@ $ ls viz
run.0.dot
```

## Native Engine

The native engine is integrated into the pants codebase via `native.py` in
this directory along with `build-support/bin/native/bootstrap.sh` which ensures a
pants native engine library is built and available for linking. The glue is the
sha1 hash of the native engine source code used as its version by the `Native`
class. This hash is maintained by `build-support/bin/native/bootstrap.sh` and
output to the `native_engine_version` file in this directory. Any modification
to this resource file's location will need adjustments in
`build-support/bin/native/bootstrap.sh` to ensure the linking continues to work.

## History

The need for an engine that could schedule all work as a result of linking required products to
stuhood marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -240,7 +231,7 @@ Work stalled on the later phases of the `RoundEngine` and talks re-booted about
it stood and proposed the idea of a "tuple-engine". With some license taken in representation, this
idea took the `RoundEngine` to the extreme of generating a round for each target-task pair. The
pair formed the tuple of schedulable work and this concept combined with others to form the design
[here][tuple-design].
[here][https://docs.google.com/document/d/1OARyIZSnw6XQiPlMydi57l_tS_JbFTJH6KLX61kPInI/edit?usp=sharing].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!


Meanwhile, need for fine-grained parallelism was acute to help speed up jvm compilation, especially
in the context of scala and mixed scala & java builds. Twitter spiked on a project to implement
Expand Down