There are two parts to Hamilton:
-
Hamilton Functions.
Hamilton Functions are what you, the end user write.
-
Hamilton Driver.
Once you've written your functions, you will need to use the Hamilton Driver to build the DAG and orchestrate execution.
Let's dive deeper into these parts below, but first a word on terminology.
We use the following terms interchangeably, e.g. a ____ in Hamilton is ... :
- column
- variable
- node
- function
That's because we're representing columns as functions, which are parts of a directed acyclic graph. That is a column is a part of a dataframe. To compute a column we write a function that has input variables. From these functions we create a DAG and represent each function as a node, linking each input variable by an edge to its respective node.
Using Hamilton is all about writing functions. From these functions a dataframe is constructed for you at execution time.
A simple (but rather contrived) example of what Hamilton does that adds two numbers is as follows:
def _sum(*vars):
"""Helper function to sum numbers.
This is here to demonstrate that functions starting with _ do not get processed by hamilton.
"""
return sum(vars)
def sum_a_b(a: int, b: int) -> int:
"""Adds a and b together
:param a: The first number to add
:param b: The second number to add
:return: The sum of a and b
"""
return _sum(a,b) # Delegates to a helper function
While this looks like a simple python function, there are a few components to note:
- The function name
sum_a_b
is a globally unique key. In the DAG there can only be one function namedsum_a_b
. While this is not optimal for functionality reuse, it makes it extremely easy to learn exactly how a node in the DAG is generated, and separate out that logic for debugging/iterating. - The function
sum_a_b
depends on two upstream nodes --a
andb
. This means that these values must either be:- Defined by another function
- Passed in by the user as a configuration variable (see
Hamilton Driver Code
below)
- The function
sum_a_b
makes full use of the python type-hint system. This is required in Hamilton, as it allows us to type-check the inputs and outputs to match with upstream producers and downstream consumers. In this case, we know that the inputa
has to be an integer, the inputb
has to also be an integer, and anything that declaressum_a_b
as an input has to declare it as an integer. - Standard python documentation is a first-class citizen. As we have a 1:1 relationship between python functions and nodes, each function documentation also describes a piece of business logic.
- Functions that start with _ are ignored, and not included in the DAG. Hamilton tries to make use of every function in a module, so this allows us to easily indicate helper functions that won't become part of the DAG.
Hamilton makes use of python's type-hinting feature to check compatibility between function outputs and function inputs. However, this is not particularly sophisticated, largely due to the lack of available tooling in python. Thus, generic types do not function correctly. The following will not work:
def some_func() -> Dict[str, int]:
return {1: 2}
The following will both work:
def some_func() -> Dict:
return {1: 2}
def some_func() -> dict:
return {1: 2}
While this is unfortunate, the typing API in python is not yet sophisticated enough to rely on accurate subclass validation.
For documentation on the actual Hamilton Driver code, we invite the reader to read the Driver class source code directly.
At a high level, the driver code does two things:
- Create a Directed Acyclic Graph (DAG) from functions you define.
from hamilton import driver dr = driver.Driver(config, *modules_to_load) # this creates the DAG from the modules you pass in.
- It orchestrates execution given expected output and provided input.
df = dr.execute(final_vars, overrides, display_graph) # this executes the DAG appropriately to create the dataframe.
The driver object also has a few other methods, e.g. display_all_functions()
, list_available_variables()
, but they're
really only used for debugging purposes.
Let's dive into the driver constructor call, and the execute method.
The constructor call is pretty simple. Each constructor call sets up a DAG for execution given some configuration. So if you want to change something about the DAG, very likely you'll need to create a new Driver() object.
The configuration is used not just to feed data to the DAG, but also to determine the structure of the DAG. As such, it is passed in to the constructor, and used during DAG creation. This enables such decorators like @config.when.
Otherwise the contents of the config dictionary should include all the inputs required for whatever final output you
want to create. The configuration dictionary should not be used for overriding what Hamilton will compute.
To do this, use the override
parameter as part of the execute()
-- see below.
This can be any number of modules. We traverse the modules in the order they are provided.
The execute function determines the DAG walk required to get the requisite final variables (aka columns) that you want in the dataframe. It also ensures that you have provided everything to execute properly.
Once it executes it uses a dictionary to memoize results, so that everything is only computed once. It executes the DAG via a recursive depth-first-traversal, which leads to the possibility (although highly unlikely) of hitting python recursion depth errors. If that happens, the culprit is almost always a circular reference in the graph. We suggest displaying the DAG to verify this.
To help speed up development of new or existing Hamilton Functions, we enable you to override parts of the DAG. What
this means is that before calling execute()
, you have computed some result that you want to use instead of what Hamilton
would produce. To do so, you just pass in a dictionary of {'col_name': YOUR_VALUE}
as the overrides argument to the
execute function.
To visualize the DAG that would be executed, pass the flag display_graph=True
to execute. It will render an image in a pdf format.
For the backstory on Hamilton we invite you to watch ~9 minute lightning talk on it that we gave at the apply conference: video, slides.