Skip to content

Commit 3eb3b3f

Browse files
committed
Update for publishing
1 parent 8f36b0d commit 3eb3b3f

File tree

4 files changed

+198
-43
lines changed

4 files changed

+198
-43
lines changed

README.md

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Millrun
2+
3+
## A Python library and CLI tool for automating the execution of papermill
4+
5+
### Motivation
6+
7+
Papermill is great: it parameterizes a single notebook for you. Ok, so what about this whole directory of notebooks that I would like to execute with this list of different parameters?
8+
9+
**Millrun** Will execute either a single notebook or all of the notebooks in a directory (recursively, if you want) and using either a list of alternative parameter dictionaries or a dictionary with a list of variations.
10+
11+
In short, it iterates both over notebooks in a directory AND over lists of parameters.
12+
13+
_When executed as a CLI tool, notebooks are executed in parallel using multi-processing_.
14+
15+
## Installation
16+
17+
`pip install millrun`
18+
19+
## Usage: Python Library
20+
21+
```python
22+
import millrun
23+
24+
millrun.execute_run(
25+
notebook_dir_or_file: pathlib.Path | str,
26+
bulk_params: list | dict,
27+
output_dir: Optional[pathlib.Path | str] = None,
28+
output_prepend_components: Optional[list[str]] = None,
29+
output_append_components: Optional[list[str]] = None,
30+
recursive: bool = False,
31+
exclude_glob_pattern: Optional[str] = None,
32+
include_glob_pattern: Optional[str] = None,
33+
use_multiprocessing: bool = False,
34+
**kwargs, # kwargs are passed through to papermill
35+
)
36+
```
37+
38+
## Usage: CLI tool
39+
40+
```
41+
millrun --help
42+
43+
Usage: millrun [OPTIONS] NOTEBOOK_DIR_OR_FILE PARAMS
44+
45+
Executes a notebook or directory of notebooks using the provided bulk parameters JSON file
46+
47+
48+
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────╮
49+
│ * notebook_dir_or_file TEXT Path to a notebook file or a directory containing notebooks. │
50+
│ [default: None] │
51+
│ [required] │
52+
│ * params TEXT JSON file that contains parameters for notebook execution. Can │
53+
│ either be a 'list of dict' or 'dict of list'. │
54+
│ [default: None] │
55+
│ [required] │
56+
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
57+
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────╮
58+
│ --output-dir TEXT Directory to place output files into. If not │
59+
│ provided the file directory will be used. │
60+
│ [default: None] │
61+
│ --prepend TEXT Prepend components to use on output filename.Can │
62+
│ use dict keys from 'params' which will be │
63+
│ evaluated.(Comma-separated values). │
64+
│ [default: None] │
65+
│ --append TEXT Append components to use on output filename.Can │
66+
│ use dict keys from 'params' which will be │
67+
│ evaluated.(Comma-separated values). │
68+
│ [default: None] │
69+
│ --recursive --no-recursive [default: no-recursive] │
70+
│ --exclude-glob-pattern TEXT [default: None] │
71+
│ --include-glob-pattern TEXT [default: None] │
72+
│ --help Show this message and exit. │
73+
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
74+
75+
```
76+
77+
### Example
78+
79+
While the `prepend` argument is optional, it is highly recommend you take advantage of it. If not, your output file names will be automatically prepended with an integer index to differentiate the output files.
80+
81+
```
82+
millrun ./Notebooks_Dir params.json --prepend id_key_in_params
83+
```
84+
85+
Where `id_key_in_params` is one of the keys in your params.json that you can use to uniquely identify each iteration. If you do not have a single unique key, you can provide a list of keys and they will all be prepended:
86+
87+
Lets say my params.json looked like this:
88+
89+
```json
90+
{
91+
"x_values": [0, 1, 2],
92+
"y_values": [45, 32, 60],
93+
}
94+
95+
```
96+
97+
I could execute like this:
98+
99+
```
100+
millrun ./Notebooks_Dir params.json --prepend x_values,y_values,results
101+
```
102+
103+
And my output files would look like:
104+
105+
```
106+
0-45-results-special_calculation.ipynb
107+
1-32-results-special_calculation.ipynb
108+
2-60-results-special_calculation.ipynb
109+
```
110+
111+
**Notice**: Since "results" was not a key in my params.json, it gets passed through as a string literal.
112+
113+
## Organizing your parameters
114+
115+
You can have your parameters dictionary/JSON in one of two formats:
116+
117+
### Format 1: A list of dicts
118+
119+
```python
120+
[
121+
{"param1": 0, "param2": "hat", "param3": 21.2},
122+
{"param1": 1, "param2": "cat", "param3": 34.3},
123+
{"param1": 2, "param2": "bat", "param3": 200.0}
124+
]
125+
```
126+
127+
Where each notebook given to millrun will execute against each dictionary in the list.
128+
129+
130+
### Format 2: A dict of lists
131+
132+
```python
133+
{
134+
"param1": [0, 1, 2],
135+
"param2": ["hat", "cat", "bat"],
136+
"param3": [21.2, 34.3, 200.0]
137+
}
138+
```
139+
140+
This format is offered as a convenience format. Internally, it is converted into "Format 1" prior to execution.
141+
142+
143+
## CLI parallel execution
144+
145+
Since millrun iterates over two dimensions (each notebook and then dict of parameters in the list), there are two ways of parellelizing:
146+
147+
1. Execute each notebook in sequence and parallelize the execution of the different parameter variations
148+
2. Execute each notebook in parallel and sequentialize the execution of the different parameter variations
149+
150+
Because of my own personal use cases, it is more efficient for me to use **1.** because I have way more parameter variations than I do notebooks.
151+
152+
However, this method becomes inefficient if you have MANY notebooks and only 1-3 variations. In that case, you would probably prefer the method **2.**. It is still faster than single-process execution (like you get )
153+
154+
If you need this use case then feel free to raise an issue and/or contribute a PR to implement it as an option for execution.

src/millrun/__init__.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,8 @@
1-
def main() -> None:
2-
print("Hello from millrun!")
1+
"""
2+
Millrun: A Python library and CLI tool to automate the execution of notebooks
3+
with papermill.
4+
"""
5+
6+
__version__ = "0.1.0"
7+
8+
from .millrun import execute_run

src/millrun/cli.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
import pathlib
55

66
import typer
7-
from .millrun import execute_batch
7+
from .millrun import execute_run
88

99

1010
def _parse_json(filepath: str) -> dict:
@@ -69,7 +69,7 @@ def run(
6969
output_dir = pathlib.Path(output_dir)
7070
else:
7171
output_dir = pathlib.Path.cwd()
72-
execute_batch(
72+
execute_run(
7373
notebook_dir_or_file,
7474
params,
7575
output_dir,

src/millrun/millrun.py

Lines changed: 34 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010

1111

12-
def execute_batch(
12+
def execute_run(
1313
notebook_dir_or_file: pathlib.Path | str,
1414
bulk_params: list | dict,
1515
output_dir: Optional[pathlib.Path | str] = None,
@@ -90,15 +90,15 @@ def execute_batch(
9090
output_dir = notebook_dir
9191
if not output_dir.exists():
9292
output_dir.mkdir(exist_ok=True, parents=True)
93-
9493
if notebook_filename is not None:
9594
execute_notebooks(
9695
notebook_dir / notebook_filename,
9796
bulk_params_list,
9897
output_prepend_components,
9998
output_append_components,
10099
output_dir,
101-
use_multiprocessing
100+
use_multiprocessing,
101+
**kwargs
102102
)
103103
else:
104104
glob_method = notebook_dir.glob
@@ -115,19 +115,17 @@ def execute_batch(
115115
included_paths = set(glob_method(glob_pattern))
116116

117117
notebook_paths = sorted(included_paths - excluded_paths)
118-
119118
for notebook_path in notebook_paths:
119+
120120
execute_notebooks(
121121
notebook_path,
122122
bulk_params_list,
123123
output_prepend_components,
124124
output_append_components,
125125
output_dir,
126126
use_multiprocessing,
127-
)
128-
# Multiprocessing approach inspired by
129-
# https://www.deanmontgomery.com/2022/03/24/rich-progress-and-multiprocessing/
130-
127+
**kwargs
128+
)
131129

132130

133131

@@ -163,35 +161,6 @@ def convert_bulk_params_to_list(bulk_params: dict[str, list]):
163161
return bulk_params_list
164162

165163

166-
def get_output_name(
167-
notebook_filename: str,
168-
output_prepend_components: list[str] | None,
169-
output_append_components: list[str] | None,
170-
notebook_params: dict[str, Any]
171-
) -> str:
172-
"""
173-
Returns the output name given the included components.
174-
"""
175-
if output_prepend_components is None:
176-
output_prepend_components = []
177-
if output_append_components is None:
178-
output_append_components = []
179-
prepends = [notebook_params[comp] for comp in output_prepend_components]
180-
appends = [notebook_params[comp] for comp in output_append_components]
181-
prepend_str = "-".join(prepends)
182-
append_str = "-".join(appends)
183-
notebook_filename = pathlib.Path(notebook_filename)
184-
return "-".join([elem for elem in [prepend_str, notebook_filename.stem, append_str] if elem]) + notebook_filename.suffix
185-
186-
# notebook_path,
187-
# bulk_params_list,
188-
# output_prepend_components,
189-
# output_append_components,
190-
# output_dir,
191-
# _progress,
192-
# task_id
193-
194-
195164
def execute_notebooks(
196165
notebook_path: pathlib.Path,
197166
bulk_params_list: dict[str, Any],
@@ -229,6 +198,8 @@ def execute_notebooks(
229198
TimeElapsedColumn(),
230199
refresh_per_second=1, # bit slower updates
231200
) as progress:
201+
# Multiprocessing approach inspired by
202+
# https://www.deanmontgomery.com/2022/03/24/rich-progress-and-multiprocessing/
232203
futures = [] # keep track of the jobs
233204
with multiprocessing.Manager() as manager:
234205
_progress = manager.dict()
@@ -275,7 +246,8 @@ def execute_notebook(
275246
notebook_filename,
276247
output_prepend_components,
277248
output_append_components,
278-
notebook_params
249+
notebook_params,
250+
current_iteration
279251
)
280252
pm.execute_notebook(
281253
notebook_filename,
@@ -286,4 +258,27 @@ def execute_notebook(
286258
**kwargs
287259
)
288260
if _progress is not None:
289-
_progress[_task_id] = {"progress": current_iteration / total_variations, "total": total_variations}
261+
_progress[_task_id] = {"progress": current_iteration, "total": total_variations}
262+
263+
264+
def get_output_name(
265+
notebook_filename: str,
266+
output_prepend_components: list[str] | None,
267+
output_append_components: list[str] | None,
268+
notebook_params: dict[str, Any],
269+
current_index: int
270+
) -> str:
271+
"""
272+
Returns the output name given the included components.
273+
"""
274+
if output_prepend_components is None:
275+
output_prepend_components = [str(current_index)]
276+
if output_append_components is None:
277+
output_append_components = []
278+
prepends = [notebook_params.get(comp, comp) for comp in output_prepend_components]
279+
appends = [notebook_params.get(comp, comp) for comp in output_append_components]
280+
prepend_str = "-".join(prepends)
281+
append_str = "-".join(appends)
282+
notebook_filename = pathlib.Path(notebook_filename)
283+
output_name = "-".join([elem for elem in [prepend_str, notebook_filename.stem, append_str] if elem]) + notebook_filename.suffix
284+
return output_name

0 commit comments

Comments
 (0)