Use
@card(type='notebook')
to programatically run & render notebooks in your flows.
pip install metaflow-card-notebook
You may have seen this series of blog posts that have been written about Notebook Infrastructure at Netflix. Of particular interest is how notebooks are programmatically run, often in DAGs, to generate reports and dashboards:
Parameterized Execution of Notebooks | Notebooks in DAGs | Dependency Management & Scheduling |
---|---|---|
This way of generating reports and dashboards is very compelling, as it lets data scientists create content using environments and tools that they are familiar with. With @card(type='notebook')
you can programmatically run and render notebooks as part of a DAG. This card allows you to accomplish the following:
- Run notebook(s) programmatically in your Metaflow DAGs.
- Access data from any step in your DAG so you can visualize it or otherwise use it to generate reports in a notebook.
- Render your notebooks as reports or model cards that can be embedded in various apps.
- Inject custom parameters into your notebook for execution.
- Ensure that notebook outputs are reproducible.
Additionally, you can use all of the features of Metaflow to manage the execution of notebooks, for example:
- Managing dependencies (ex:
@conda
) - Requesting compute (ex:
@resources
) - Parallel execution (ex:
foreach
) - etc.
Here is an example of a dashboard generated by a notebook card:
you can see real examples of flows that generate these dashboards in examples
The notebook card injects the following five variables into your notebook:
run_id
step name
task_id
flow_name
pathspec
You can use these variables to retrieve the data you need from a flow. It is recommended that the first cell in your notebook defines these variables and that you designate this cell with the tag "parameters".
For example of this, see tests/nbflow.ipynb:
Note: in the example notebook these variables are set to
None
however, you can set these variables to real values based on flows that have been previously executed for prototyping.
You can render cards from notebooks using the @card(type='notebook')
decorator on a step. For example, in tests/nbflow.py, the notebook tests/nbflow.ipynb is run and rendered programatically:
from metaflow import step, current, FlowSpec, Parameter, card
class NBFlow(FlowSpec):
exclude_nb_input = Parameter('exclude_nb_input', default=True, type=bool)
@step
def start(self):
self.data_for_notebook = "I Will Print Myself From A Notebook"
self.next(self.end)
@card(type='notebook')
@step
def end(self):
self.nb_options_dict = dict(input_path='nbflow.ipynb', exclude_input=self.exclude_nb_input)
if __name__ == '__main__':
NBFlow()
Note how the start
step stores some data that we want to access from a notebook later. We will discuss how to access this data from a notebook in the next step.
By default, a step that is decorated with @card(type='notebook')
expects the variable nb_options_dict
to be defined in the step. This variable is a dictionary of arguments that is passed to papermill.execute.notebook. Only the input_path
argument is required. If output_path
is absent, this is automatically set to _rendered_<run_id>_<step_name>_<task_id>_<your_input_notebook_name>.ipynb
.
Furthermore, the exclude_input
is an additional boolean argument that specifies whether or not to show our hide cell outputs, which is False
by default.
Recall that the run_id
, step_name
, task_id
, flow_name
and pathspec
are injected into the notebook. We can access this in a notebook using Metaflow's utlities for inspecting Flows and Results. We demonstrate this in tests/nbflow.ipynb:
Some notes about this notebook:
- We recommend printing the variables injected into the notebook. This can help with debugging and provide an easy to locate lineage.
- We demonstrate how to access your flow's data via a
Step
or aTask
object. You can read more about the relationship between these objects in these docs. In short, aTask
is a child of aStep
because aStep
can have many tasks (for example if you use aforeach
construct for parallelism). - We recommend executing a run manually and prototyping the notebook by temporarily supplying the
run_id
,flow_name
, etc to achieve the desired result.
To test the card in the example outlined above, you must first run the flow (the parenthesis allows the commands to run in a subshell):
(cd tests && python nbflow.py run)
Then, render the card
(cd tests && python nbflow.py card view end)
By default, the cell inputs are hidden when the card is rendered. For learning purposes, it can be useful to render the card with the inputs to validate how the card is executed. You can do this by setting the exclude_nb_input
parameter to False
that was defined in the flow:
(cd tests && python nbflow.py run --exclude_nb_input=False && python nbflow.py card view end)
The @card(type='notebook')
is an opinionated way to execute and render notebooks with the tradeoff of requiring significantly less code. While some customization is possible by passing the appropriate arguments to nb_options_dict
as listed in papermill.execute.notebook, you can achieve more fine-grained control by executing and rendering the notebook yourself and using the html card. We show an example of this in examples/deep_learning/dl_flow.py:
@card(type='html')
@step
def nb_manual(self):
"""
Run & Render Jupyter Notebook Manually With The HTML Card.
Using the html card provides you greater control over notebook execution and rendering.
"""
import papermill as pm
output_nb_path = 'notebooks/rendered_Evaluate.ipynb'
output_html_path = output_nb_path.replace('.ipynb', '.html')
pm.execute_notebook('notebooks/Evaluate.ipynb',
output_nb_path,
parameters=dict(run_id=current.run_id,
flow_name=current.flow_name,)
)
run(f'jupyter nbconvert --to html --no-input --no-prompt {output_nb_path}')
with open(output_html_path, 'r') as f:
self.html = f.read()
self.next(self.end)
You can run the following command in your terminal the see output of this step(may take several minutes):
(cd example && python dl_flow.py run && python dl_flow.py card view nb_manual)
Many issues can be resolved by providing the right arguments to papermill.execute.notebook. Below are some common issues and examples of how to resolve them:
- Kernel Name: The name of the python kernel you use locally may be different from your remote execution environment. By default, papermill will attempt to find a kernel name in the metadata of your notebook, which is often automatically created when you select a kernel while running a notebook. You can use the
kernel_name
argument to specify a kernel. Below is an example:
@card(type='notebook')
@step
def end(self):
self.nb_options_dict = dict(input_path='nbflow.ipynb', kernel_name='Python3')
- Working Directory: The working directory may be important when your notebook is executed, especially if your notebooks rely on certain files or other assets. You can set the working directory the notebook is executed in with the
cwd
argument, for example, to set the working directory todata/
:
@card(type='notebook')
@step
def end(self):
self.nb_options_dict = dict(input_path='nbflow.ipynb', cwd='data/')
If you are running your flow remotely, for example with @batch
, you must remember to include the dependencies for this notebook card itself! One way to do this is using pip
as illustrated below:
@card(type='notebook')
@step
def end(self):
import os, sys
os.system(f"sys.executable -m pip ipykernel>=6.4.1 papermill>=2.3.3 nbconvert>=6.4.1 nbformat>=5.1.3")
self.nb_options_dict = dict(input_path='nbflow.ipynb')
Note: You can omit the pip install
step above if your environment already includes all the dependendencies in your target environment listed in settings.ini. If you do omit pip install
, make sure that you pin the correct version numbers as well.
If you are running steps remotely, you must ensure that your notebooks are uploaded to the remote environment with the cli argument --package-suffixes=".ipynb"
For example, to execute examples/deep_learning/dl_flow.py with this argument:
(cd example && python dl_flow.py --package-suffixes=".ipynb" run)
We provide several examples of flows that contain the notebook card in examples/.