|
| 1 | +# Migrating from scripts to pytask |
| 2 | + |
| 3 | +Are you tired of managing tasks in your research workflows with scripts that get harder |
| 4 | +to maintain over time? Then pytask is here to help! |
| 5 | + |
| 6 | +With pytask, you can enjoy features like: |
| 7 | + |
| 8 | +- **Lazy builds**. Only execute the scripts that need to be run or re-run because |
| 9 | + something has changed, saving you lots of time. |
| 10 | +- **Parallelization**. Use |
| 11 | + [pytask-parallel](https://github.com/pytask-dev/pytask-parallel) to speed up your |
| 12 | + scripts by running them in parallel. |
| 13 | +- **Cross-language projects**. pytask has several plugins for running scripts written in |
| 14 | + other popular languages: [pytask-r](https://github.com/pytask-dev/pytask-r), |
| 15 | + [pytask-julia](https://github.com/pytask-dev/pytask-julia), and |
| 16 | + [pytask-stata](https://github.com/pytask-dev/pytask-stata). |
| 17 | + |
| 18 | +The following guide will walk you through a series of steps to quickly migrate your |
| 19 | +scripts to a workflow managed by pytask. The focus is first on Python scripts, but the |
| 20 | +guide concludes with an additional example of an R script. |
| 21 | + |
| 22 | +## Installation |
| 23 | + |
| 24 | +To get started with pytask, simply install it with pip or conda: |
| 25 | + |
| 26 | +```console |
| 27 | +$ pip install pytask pytask-parallel |
| 28 | + |
| 29 | +$ conda -c conda-forge pytask pytask-parallel |
| 30 | +``` |
| 31 | + |
| 32 | +## From Python script to task |
| 33 | + |
| 34 | +We must rewrite your scripts and move the executable part to a task function. You might |
| 35 | +contain the code in the main namespace of your script, like in this example. |
| 36 | + |
| 37 | +```python |
| 38 | +# Content of task_data_management.py |
| 39 | +import pandas as pd |
| 40 | + |
| 41 | + |
| 42 | +df = pd.read_csv("data.csv") |
| 43 | + |
| 44 | +# Many operations. |
| 45 | + |
| 46 | +df.to_pickle("data.pkl") |
| 47 | +``` |
| 48 | + |
| 49 | +Or, you might use an `if __name__ == "__main__"` block like this example. |
| 50 | + |
| 51 | +```python |
| 52 | +# Content of task_data_management.py |
| 53 | +import pandas as pd |
| 54 | + |
| 55 | + |
| 56 | +def main(): |
| 57 | + df = pd.read_csv("data.csv") |
| 58 | + |
| 59 | + # Many operations. |
| 60 | + |
| 61 | + df.to_pickle("data.pkl") |
| 62 | + |
| 63 | + |
| 64 | +if __name__ == "__main__": |
| 65 | + main() |
| 66 | +``` |
| 67 | + |
| 68 | +For pytask, you need to move the code into a task that is a function whose name starts |
| 69 | +with `task_` in a module with the same prefix like `task_data_management.py`. |
| 70 | + |
| 71 | +```python |
| 72 | +# Content of task_data_management.py |
| 73 | +import pandas as pd |
| 74 | + |
| 75 | + |
| 76 | +def task_prepare_data(): |
| 77 | + df = pd.read_csv("data.csv") |
| 78 | + |
| 79 | + # Many operations. |
| 80 | + |
| 81 | + df.to_pickle("data.pkl") |
| 82 | +``` |
| 83 | + |
| 84 | +An `if __name__ == "__main__"` block must be deleted. |
| 85 | + |
| 86 | +## Extracting dependencies and products |
| 87 | + |
| 88 | +To let pytask know the order in which to execute tasks and when to re-run them, you'll |
| 89 | +need to specify task dependencies and products using `@pytask.mark.depends_on` and |
| 90 | +`@pytask.mark.produces`. Extract the paths to the inputs and outputs of your script and |
| 91 | +pass them to the decorator. For example: |
| 92 | + |
| 93 | +```python |
| 94 | +# Content of task_data_management.py |
| 95 | +import pandas as pd |
| 96 | +import pytask |
| 97 | + |
| 98 | + |
| 99 | +@pytask.mark.depends_on("data.csv") |
| 100 | +@pytask.mark.produces("data.pkl") |
| 101 | +def task_prepare_data(depends_on, produces): |
| 102 | + df = pd.read_csv(depends_on) |
| 103 | + |
| 104 | + # Many operations. |
| 105 | + |
| 106 | + df.to_pickle(produces) |
| 107 | +``` |
| 108 | + |
| 109 | +The decorators allow you to use `depends_on` and `produces` as arguments to the |
| 110 | +function and access the paths to the dependencies and products as {class}`pathlib.Path`. |
| 111 | + |
| 112 | +You can pass a dictionary to these decorators if you have multiple dependencies or |
| 113 | +products. The dictionary's keys are the dependencies'/product's names, and the values |
| 114 | +are the paths. Here is an example: |
| 115 | + |
| 116 | +```python |
| 117 | +import pandas as pd |
| 118 | +import pytask |
| 119 | + |
| 120 | + |
| 121 | +@pytask.mark.depends_on({"data_1": "data_1.csv", "data_2": "data_2.csv"}) |
| 122 | +@pytask.mark.produces("data.pkl") |
| 123 | +def task_merge_data(depends_on, produces): |
| 124 | + df1 = pd.read_csv(depends_on["data_1"]) |
| 125 | + df2 = pd.read_csv(depends_on["data_2"]) |
| 126 | + |
| 127 | + df = df1.merge(df2, on=...) |
| 128 | + |
| 129 | + df.to_pickle(produces) |
| 130 | +``` |
| 131 | + |
| 132 | +:::{seealso} |
| 133 | +If you want to learn more about dependencies and products, read the |
| 134 | +[tutorial](../tutorials/defining_dependencies_products.md). |
| 135 | +::: |
| 136 | + |
| 137 | +## Execution |
| 138 | + |
| 139 | +Finally, execute your newly defined tasks with pytask. Assuming your scripts lie in the |
| 140 | +current directory of your terminal or a subsequent directory, run the following. |
| 141 | + |
| 142 | +```{include} ../_static/md/migrating-from-scripts-to-pytask.md |
| 143 | +``` |
| 144 | + |
| 145 | +Otherwise, pass the paths explicitly to the pytask executable. |
| 146 | + |
| 147 | +If you have rewritten multiple scripts that can be run in parallel, use the |
| 148 | +`-n/--n-workers` option to define the number of parallel tasks. pytask-parallel will |
| 149 | +then automatically spawn multiple processes to run the workflow in parallel. |
| 150 | + |
| 151 | +```console |
| 152 | +$ pytask -n 4 |
| 153 | +``` |
| 154 | + |
| 155 | +:::{seealso} |
| 156 | +You can find more information on pytask-parallel in the |
| 157 | +[readme](https://github.com/pytask-dev/pytask-parallel) on Github. |
| 158 | +::: |
| 159 | + |
| 160 | +## Bonus: From R script to task |
| 161 | + |
| 162 | +pytask wants to help you get your job done, and sometimes a different programming |
| 163 | +language can make your life easier. Thus, pytask has several plugins to integrate code |
| 164 | +written in R, Julia, and Stata. Here, we explore how to incorporate an R script with |
| 165 | +[pytask-r](https://github.com/pytask-dev/pytask-r). You can also find more information |
| 166 | +about the plugin in the repo's readme. |
| 167 | + |
| 168 | +First, we will install the package. |
| 169 | + |
| 170 | +```console |
| 171 | +$ pip install pytask-r |
| 172 | + |
| 173 | +$ conda install -c conda-forge pytask-r |
| 174 | +``` |
| 175 | + |
| 176 | +:::{seealso} |
| 177 | +Checkout [pytask-julia](https://github.com/pytask-dev/pytask-julia) and |
| 178 | +[pytask-stata](https://github.com/pytask-dev/pytask-stata), too! |
| 179 | +::: |
| 180 | + |
| 181 | +And here is the R script `prepare_data.r` that we want to integrate. |
| 182 | + |
| 183 | +```r |
| 184 | +# Content of prepare_data.r |
| 185 | +df <- read.csv("data.csv") |
| 186 | + |
| 187 | +# Many operations. |
| 188 | + |
| 189 | +saveRDS(df, "data.rds") |
| 190 | +``` |
| 191 | + |
| 192 | +Next, we create a task function to point pytask to the script and the dependencies and |
| 193 | +products. |
| 194 | + |
| 195 | +```python |
| 196 | +# Content of task_data_management.py |
| 197 | +import pytask |
| 198 | + |
| 199 | + |
| 200 | +@pytask.mark.r(script="prepare_data.r") |
| 201 | +@pytask.mark.depends_on("data.csv") |
| 202 | +@pytask.mark.produces("data.rds") |
| 203 | +def task_prepare_data(): |
| 204 | + pass |
| 205 | +``` |
| 206 | + |
| 207 | +pytask automatically makes the paths to the dependencies and products available to the |
| 208 | +R file via a JSON file. Let us amend the R script to load the information from the JSON |
| 209 | +file. |
| 210 | + |
| 211 | +```r |
| 212 | +# Content of prepare_data.r |
| 213 | +library(jsonlite) |
| 214 | + |
| 215 | +# Read the JSON file whose path is passed to the script |
| 216 | +args <- commandArgs(trailingOnly=TRUE) |
| 217 | +path_to_json <- args[length(args)] |
| 218 | +config <- read_json(path_to_json) |
| 219 | + |
| 220 | +df <- read.csv(config$depends_on) |
| 221 | + |
| 222 | +# Many operations. |
| 223 | + |
| 224 | +saveRDS(df, config$produces) |
| 225 | +``` |
| 226 | + |
| 227 | +## Conclusion |
| 228 | + |
| 229 | +Congrats! You have just set up your first workflow with pytask! |
| 230 | + |
| 231 | +If you enjoyed what you have seen, you should discover the other parts of the |
| 232 | +documentation. The [tutorials](../tutorials/index.md) are a good entry point to start |
| 233 | +with pytask and learn about many concepts step-by-step. |
0 commit comments