Skip to content

Commit 74631e3

Browse files
authored
Add a guide to migrate from scripts to pytask. (#330)
1 parent a3b4a5c commit 74631e3

File tree

5 files changed

+266
-6
lines changed

5 files changed

+266
-6
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ repos:
8282
(?x)^(
8383
docs/source/how_to_guides/bp_structure_of_task_files.md|
8484
docs/source/how_to_guides/how_to_influence_build_order.md|
85+
docs/source/how_to_guides/migrating_from_scripts_to_pytask.md|
8586
docs/source/how_to_guides/repeating_tasks_with_different_inputs_the_pytest_way.md|
8687
docs/source/reference_guides/hookspecs.md|
8788
docs/source/tutorials/configuration.md|
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
<div class="termy">
2+
3+
```console
4+
5+
$ pytask
6+
──────────────────────────── Start pytask session ────────────────────────────
7+
Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.2.0</span>, pluggy <span style="color: var(--termynal-blue)">1.0.0</span>
8+
Root: C:\Users\pytask-dev\git\my_project
9+
Collected <span style="color: var(--termynal-blue)">1</span> task.
10+
11+
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
12+
┃ Task ┃ Outcome ┃
13+
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
14+
│ <span class="termynal-dim">task_data_preparation.py::</span>task_prepare_data │ <span class="termynal-success">.</span> │
15+
└─────────────────────────────────────────────┴─────────┘
16+
17+
<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
18+
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
19+
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 1 Collected tasks </span> <span class="termynal-success">│</span>
20+
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 1 Succeeded (100.0%) </span> <span class="termynal-success">│</span>
21+
<span class="termynal-success">╰────────────────────────────────╯</span>
22+
<span class="termynal-success">───────────────────────── Succeeded in 30.6 seconds ──────────────────────────</span>
23+
```
24+
25+
</div>

docs/source/changes.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
1010
- {pull}`313` refactors the configuration. INI configurations are no longer supported.
1111
- {pull}`326` fixes the badge for status of the workflow.
1212
- {pull}`329` adds ruff to pre-commit hooks.
13+
- {pull}`330` add a guide for migrating from scripts to pytask.
1314
- {pull}`332` refactors `database.py`.
1415

1516
## 0.2.7 - 2022-12-14

docs/source/how_to_guides/index.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,14 @@ This section contains two collections of documents.
44

55
## How-to Guides
66

7-
The first collection of how-to guides provide detailed explanations on how to accomplish
7+
The first collection of how-to guides provides detailed explanations on accomplishing
88
specific tasks with pytask.
99

1010
```{toctree}
1111
---
1212
maxdepth: 1
1313
---
14+
migrating_from_scripts_to_pytask
1415
invoking_pytask_extended
1516
capture_warnings
1617
repeating_tasks_with_different_inputs_the_pytest_way
@@ -22,12 +23,11 @@ how_to_write_a_plugin
2223

2324
The second collection comprises best practice guides for pytask. The guides combine
2425
experience with pytask and build systems in general, research projects, and software
25-
engineering to provide useful and easily understandable instructions.
26+
engineering to provide practical and easily understandable instructions.
2627

27-
Contributions in any form - additions, comments, own experiences, request for
28-
clarifications - are highly appreciated. File either
29-
[issue](https://github.com/pytask-dev/pytask/issues) or start
30-
[discussion](https://github.com/pytask-dev/pytask/discussions).
28+
Contributions - additions, comments, experiences, and requests for clarification - are
29+
highly appreciated. File either an [issue](https://github.com/pytask-dev/pytask/issues)
30+
or start a [discussion](https://github.com/pytask-dev/pytask/discussions).
3131

3232
```{toctree}
3333
---
Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# Migrating from scripts to pytask
2+
3+
Are you tired of managing tasks in your research workflows with scripts that get harder
4+
to maintain over time? Then pytask is here to help!
5+
6+
With pytask, you can enjoy features like:
7+
8+
- **Lazy builds**. Only execute the scripts that need to be run or re-run because
9+
something has changed, saving you lots of time.
10+
- **Parallelization**. Use
11+
[pytask-parallel](https://github.com/pytask-dev/pytask-parallel) to speed up your
12+
scripts by running them in parallel.
13+
- **Cross-language projects**. pytask has several plugins for running scripts written in
14+
other popular languages: [pytask-r](https://github.com/pytask-dev/pytask-r),
15+
[pytask-julia](https://github.com/pytask-dev/pytask-julia), and
16+
[pytask-stata](https://github.com/pytask-dev/pytask-stata).
17+
18+
The following guide will walk you through a series of steps to quickly migrate your
19+
scripts to a workflow managed by pytask. The focus is first on Python scripts, but the
20+
guide concludes with an additional example of an R script.
21+
22+
## Installation
23+
24+
To get started with pytask, simply install it with pip or conda:
25+
26+
```console
27+
$ pip install pytask pytask-parallel
28+
29+
$ conda -c conda-forge pytask pytask-parallel
30+
```
31+
32+
## From Python script to task
33+
34+
We must rewrite your scripts and move the executable part to a task function. You might
35+
contain the code in the main namespace of your script, like in this example.
36+
37+
```python
38+
# Content of task_data_management.py
39+
import pandas as pd
40+
41+
42+
df = pd.read_csv("data.csv")
43+
44+
# Many operations.
45+
46+
df.to_pickle("data.pkl")
47+
```
48+
49+
Or, you might use an `if __name__ == "__main__"` block like this example.
50+
51+
```python
52+
# Content of task_data_management.py
53+
import pandas as pd
54+
55+
56+
def main():
57+
df = pd.read_csv("data.csv")
58+
59+
# Many operations.
60+
61+
df.to_pickle("data.pkl")
62+
63+
64+
if __name__ == "__main__":
65+
main()
66+
```
67+
68+
For pytask, you need to move the code into a task that is a function whose name starts
69+
with `task_` in a module with the same prefix like `task_data_management.py`.
70+
71+
```python
72+
# Content of task_data_management.py
73+
import pandas as pd
74+
75+
76+
def task_prepare_data():
77+
df = pd.read_csv("data.csv")
78+
79+
# Many operations.
80+
81+
df.to_pickle("data.pkl")
82+
```
83+
84+
An `if __name__ == "__main__"` block must be deleted.
85+
86+
## Extracting dependencies and products
87+
88+
To let pytask know the order in which to execute tasks and when to re-run them, you'll
89+
need to specify task dependencies and products using `@pytask.mark.depends_on` and
90+
`@pytask.mark.produces`. Extract the paths to the inputs and outputs of your script and
91+
pass them to the decorator. For example:
92+
93+
```python
94+
# Content of task_data_management.py
95+
import pandas as pd
96+
import pytask
97+
98+
99+
@pytask.mark.depends_on("data.csv")
100+
@pytask.mark.produces("data.pkl")
101+
def task_prepare_data(depends_on, produces):
102+
df = pd.read_csv(depends_on)
103+
104+
# Many operations.
105+
106+
df.to_pickle(produces)
107+
```
108+
109+
The decorators allow you to use `depends_on` and `produces` as arguments to the
110+
function and access the paths to the dependencies and products as {class}`pathlib.Path`.
111+
112+
You can pass a dictionary to these decorators if you have multiple dependencies or
113+
products. The dictionary's keys are the dependencies'/product's names, and the values
114+
are the paths. Here is an example:
115+
116+
```python
117+
import pandas as pd
118+
import pytask
119+
120+
121+
@pytask.mark.depends_on({"data_1": "data_1.csv", "data_2": "data_2.csv"})
122+
@pytask.mark.produces("data.pkl")
123+
def task_merge_data(depends_on, produces):
124+
df1 = pd.read_csv(depends_on["data_1"])
125+
df2 = pd.read_csv(depends_on["data_2"])
126+
127+
df = df1.merge(df2, on=...)
128+
129+
df.to_pickle(produces)
130+
```
131+
132+
:::{seealso}
133+
If you want to learn more about dependencies and products, read the
134+
[tutorial](../tutorials/defining_dependencies_products.md).
135+
:::
136+
137+
## Execution
138+
139+
Finally, execute your newly defined tasks with pytask. Assuming your scripts lie in the
140+
current directory of your terminal or a subsequent directory, run the following.
141+
142+
```{include} ../_static/md/migrating-from-scripts-to-pytask.md
143+
```
144+
145+
Otherwise, pass the paths explicitly to the pytask executable.
146+
147+
If you have rewritten multiple scripts that can be run in parallel, use the
148+
`-n/--n-workers` option to define the number of parallel tasks. pytask-parallel will
149+
then automatically spawn multiple processes to run the workflow in parallel.
150+
151+
```console
152+
$ pytask -n 4
153+
```
154+
155+
:::{seealso}
156+
You can find more information on pytask-parallel in the
157+
[readme](https://github.com/pytask-dev/pytask-parallel) on Github.
158+
:::
159+
160+
## Bonus: From R script to task
161+
162+
pytask wants to help you get your job done, and sometimes a different programming
163+
language can make your life easier. Thus, pytask has several plugins to integrate code
164+
written in R, Julia, and Stata. Here, we explore how to incorporate an R script with
165+
[pytask-r](https://github.com/pytask-dev/pytask-r). You can also find more information
166+
about the plugin in the repo's readme.
167+
168+
First, we will install the package.
169+
170+
```console
171+
$ pip install pytask-r
172+
173+
$ conda install -c conda-forge pytask-r
174+
```
175+
176+
:::{seealso}
177+
Checkout [pytask-julia](https://github.com/pytask-dev/pytask-julia) and
178+
[pytask-stata](https://github.com/pytask-dev/pytask-stata), too!
179+
:::
180+
181+
And here is the R script `prepare_data.r` that we want to integrate.
182+
183+
```r
184+
# Content of prepare_data.r
185+
df <- read.csv("data.csv")
186+
187+
# Many operations.
188+
189+
saveRDS(df, "data.rds")
190+
```
191+
192+
Next, we create a task function to point pytask to the script and the dependencies and
193+
products.
194+
195+
```python
196+
# Content of task_data_management.py
197+
import pytask
198+
199+
200+
@pytask.mark.r(script="prepare_data.r")
201+
@pytask.mark.depends_on("data.csv")
202+
@pytask.mark.produces("data.rds")
203+
def task_prepare_data():
204+
pass
205+
```
206+
207+
pytask automatically makes the paths to the dependencies and products available to the
208+
R file via a JSON file. Let us amend the R script to load the information from the JSON
209+
file.
210+
211+
```r
212+
# Content of prepare_data.r
213+
library(jsonlite)
214+
215+
# Read the JSON file whose path is passed to the script
216+
args <- commandArgs(trailingOnly=TRUE)
217+
path_to_json <- args[length(args)]
218+
config <- read_json(path_to_json)
219+
220+
df <- read.csv(config$depends_on)
221+
222+
# Many operations.
223+
224+
saveRDS(df, config$produces)
225+
```
226+
227+
## Conclusion
228+
229+
Congrats! You have just set up your first workflow with pytask!
230+
231+
If you enjoyed what you have seen, you should discover the other parts of the
232+
documentation. The [tutorials](../tutorials/index.md) are a good entry point to start
233+
with pytask and learn about many concepts step-by-step.

0 commit comments

Comments
 (0)