Skip to content

Latest commit

 

History

History
49 lines (25 loc) · 10.8 KB

README.md

File metadata and controls

49 lines (25 loc) · 10.8 KB

The Runtime Graph

Overview

Pants' runtime Graph contains Nodes that represent memoized user logic and have a known output type. Edges between Nodes represent data dependencies, which allow Nodes to be invalidated when their inputs have changed.

Nodes

The Graph trait is generic for testing purposes, but in Pants' production usage, the Node trait is implemented for @rules and intrinsics (essentially, @rules implemented in Rust). The Node is a single generic parameter, and so from the Graph's perspective, all Node instances are identical: the production usage implements Node using an enum in order to isolate data and logic used for different purposes (running @rules, running intrinsics, etc).

Inputs and Identity

The Node trait requires both Eq and Hash, so a Node is its own memoization key and "identity". When an instance of a Node runs (via its run method), it has two sources of information that it can consume: its own fields/identity, and a NodeContext object which allows it to request the values of other Nodes and access any other information which should not contribute to the Node's identity.

For example: the Node implementation for a @rule (called "Task" for historical reasons) holds information that uniquely identifies that instance of the @rule: the Task struct differentiates it from other @rule definitions, and the Params (the inputs that will be used to compute its positional arguments and Gets) differentiate it from other instances of the same @rule definition.

It's important to differentiate the "identity" of a node from its current value, because the Graph has optimizations for the case of re-running an existing node with potentially changed inputs. Requesting a brand new Node (perhaps with changed Params, as in the example above) will always run the logic in that Node. On the other hand, re-requesting an existing Node might be able to skip re-running the Node's logic if its inputs have not changed. See the "Dirtying and Cleaning" section for more information on those optimizations.

Execution

When an external caller requests the output for a Node (via get), the Graph does a HashMap lookup for the Node to locate an Entry object for that node. The Entry object stores the current state of the node, including whether it is NotStarted, Running, or Completed. To get the value of a Node, Graph::get calls Entry::get_node_result.

In the simplest case, if the Node is already Completed, the Graph immediately returns the Node's result value from the Entry. If the Node is currently Running, the caller will be added as a waiter on the Entry, and pushed a value when it is ready. Finally, if the Node is NotStarted, its run method is launched (concurrently on a tokio task) and the Entry moves to the Running state.

Continuing the example above: the Node implementation for a @rule uses information (pre-computed by the RuleGraph) about the @rule's signature to request the values of Nodes for the positional arguments of the @rule, and then runs the @rule function. If the result of running the @rule is a generator value, the Node then continues running to request more Node values for Gets: otherwise it completes with the result value.

Invalidation

When a Node requests the value of another Node, it does so via NodeContext::get (implemented for the Context struct in production usage), which ends up calling Graph::get. Unlike an external caller, when a Node requests another Node, an edge is created in the Graph to record the dependency between the two Node instances. This tracking is used to implement invalidation of Nodes.

At any point (perhaps concurrently with Nodes running), external threads may call Graph::invalidate_from_roots to indicate that the result values of particular Node instances are no longer valid for some reason. Because the identity of a Node is immutable (see the "Inputs and Identity" section above), invalidation almost always needs to occur because some information that the Node fetched using the Context (or direct use of syscalls, etc) is no longer valid. In production usage, invalidation occurs due to filesystem changes: a file watching thread calls invalidate_from_roots for Node types that are associated with syscalls to read files when it suspects that files on disk have changed.

Confusingly, "root" in the context of invalidate_from_roots means a root in the dependent graph: calling the method for a root Node will "clear" the value of that Node, and mark the transitive dependents of that Node "dirty". No Nodes are ever deleted from the Graph, so "clearing" a Node means moving its Entry to the NotStarted state. Because it will be in the NotStarted state, the cleared Node will definitely re-run when it is next requested, but any dirty Nodes that might depend on it will only re-run if they cannot be "cleaned": see below.

Dirtying and Cleaning

When the Entry for a Node is marked dirty by "Invalidation" (see above) we only suspect that it might need to re-run, because one of its dependencies has been cleared. To determine whether it actually needs to re-run, the next time the dirty Node is requested, Graph::get (via Entry::get_node_result) will compare the Generation values of the dependencies used to compute the previous value of the Node with their current Generation values. If none of the (Generation values) of the Node have changed, then it does not need to re-run: this is called "cleaning" or "early cutoff" (in Build Systems à la Carte).

When a Node does need to re-run for some reason (either due to having been cleared by invalidation, or being marked dirty and then failing to be cleaned), its previous and new result are compared to determine whether its Generation value should increment. The Generation value avoids the need to keep multiple copies of a Node's result value in memory over time, and makes cleaning cheaper by converting potentially expensive equality checks into integer comparisons.

Uncacheable Nodes

Nodes can be marked "uncacheable" (via), which causes them to re-run once per "session" (controlled by a RunId on the Context), regardless of whether any of their inputs have changed. In production usage, a session always corresponds to a single run of Pants (or each iteration of a --loop), regardless of whether pantsd is in use.

A Node that is uncacheable is given an Uncacheable EntryState, which records which RunId it is valid for. When the value of an uncacheable Node is requested, the current RunId is compared to the RunId on the Node's EntryState, and the Node re-runs if it mismatches.

Uncacheable Nodes may have dependents, but in order to ensure that the uncacheable Node is always detected below them in the Graph, any cacheable Nodes that transitively depend on uncacheable Nodes complete in a different EntryState: [UncacheableDependencies](

// A value of type UncacheableDependencies has Uncacheable dependencies, and is treated as
// equivalent to Dirty in all cases except when `poll`d: since `poll` requests are waiting for
// meaningful work to do, they need to differentiate between a truly invalidated/changed (Dirty)
// Node and a Node that would be re-cleaned once per session.
UncacheableDependencies(N::Item),
). An UncacheableDependencies node for a mismatched RunId acts just like the Dirty state: the value it holds cannot be directly consumed until it is cleaned (or until it re-runs because its dependencies have changed). An UncacheableDependencies node is cleaned by updating its RunId.

As an example: if there is a cacheable Node A that depends on an uncacheable Node B, and both of them ran in some session, then they will be in the states UncacheableDependencies and Uncacheable, respectively. If a caller requests A in a new session, the Graph will observe the UncacheableDependencies state has a mismatched SessonId, and will attempt to clean A by checking that its dependencies have not changed. That check will recursively inspect B to see whether it has changed, and will unconditionally re-run B because it is in the Uncacheable state for a mismatched RunId. Just like in "Dirtying and Cleaning", if B returns the same value as it did in the first session, then its Generation will not change: A will observe that the Generation values of its dependencies have not changed, and will be cleaned with its previous value without re-running. But unlike a Node in the Dirty state, A will stay in the UncacheableDependencies state: only its RunId changes.