Add a guide to migrate from scripts to pytask. (#330)

tobiasraabe · web-flow · commit 74631e353481 · 2022-12-31T15:06:27.000+01:00
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -82,6 +82,7 @@ repos:
             (?x)^(
                 docs/source/how_to_guides/bp_structure_of_task_files.md|
                 docs/source/how_to_guides/how_to_influence_build_order.md|
+                docs/source/how_to_guides/migrating_from_scripts_to_pytask.md|
                 docs/source/how_to_guides/repeating_tasks_with_different_inputs_the_pytest_way.md|
                 docs/source/reference_guides/hookspecs.md|
                 docs/source/tutorials/configuration.md|
diff --git a/docs/source/_static/md/migrating-from-scripts-to-pytask.md b/docs/source/_static/md/migrating-from-scripts-to-pytask.md
@@ -0,0 +1,25 @@
+<div class="termy">
+
+```console
+
+$ pytask
+──────────────────────────── Start pytask session ────────────────────────────
+Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.2.0</span>, pluggy <span style="color: var(--termynal-blue)">1.0.0</span>
+Root: C:\Users\pytask-dev\git\my_project
+Collected <span style="color: var(--termynal-blue)">1</span> task.
+
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ Task                                        ┃ Outcome ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ <span class="termynal-dim">task_data_preparation.py::</span>task_prepare_data │ <span class="termynal-success">.</span>       │
+└─────────────────────────────────────────────┴─────────┘
+
+<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
+<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
+<span class="termynal-success">│</span> <span style="font-weight: bold;"> 1  Collected tasks </span>           <span class="termynal-success">│</span>
+<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 1  Succeeded        (100.0%) </span> <span class="termynal-success">│</span>
+<span class="termynal-success">╰────────────────────────────────╯</span>
+<span class="termynal-success">───────────────────────── Succeeded in 30.6 seconds ──────────────────────────</span>
+```
+
+</div>
diff --git a/docs/source/changes.md b/docs/source/changes.md
@@ -10,6 +10,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
 - {pull}`313` refactors the configuration. INI configurations are no longer supported.
 - {pull}`326` fixes the badge for status of the workflow.
 - {pull}`329` adds ruff to pre-commit hooks.
+- {pull}`330` add a guide for migrating from scripts to pytask.
 - {pull}`332` refactors `database.py`.
 
 ## 0.2.7 - 2022-12-14
diff --git a/docs/source/how_to_guides/index.md b/docs/source/how_to_guides/index.md
@@ -4,13 +4,14 @@ This section contains two collections of documents.
 
 ## How-to Guides
 
-The first collection of how-to guides provide detailed explanations on how to accomplish
+The first collection of how-to guides provides detailed explanations on accomplishing
 specific tasks with pytask.
 
 ```{toctree}
 ---
 maxdepth: 1
 ---
+migrating_from_scripts_to_pytask
 invoking_pytask_extended
 capture_warnings
 repeating_tasks_with_different_inputs_the_pytest_way
@@ -22,12 +23,11 @@ how_to_write_a_plugin
 
 The second collection comprises best practice guides for pytask. The guides combine
 experience with pytask and build systems in general, research projects, and software
-engineering to provide useful and easily understandable instructions.
+engineering to provide practical and easily understandable instructions.
 
-Contributions in any form - additions, comments, own experiences, request for
-clarifications - are highly appreciated. File either
-[issue](https://github.com/pytask-dev/pytask/issues) or start
-[discussion](https://github.com/pytask-dev/pytask/discussions).
+Contributions - additions, comments, experiences, and requests for clarification - are
+highly appreciated. File either an [issue](https://github.com/pytask-dev/pytask/issues)
+or start a [discussion](https://github.com/pytask-dev/pytask/discussions).
 
 ```{toctree}
 ---
diff --git a/docs/source/how_to_guides/migrating_from_scripts_to_pytask.md b/docs/source/how_to_guides/migrating_from_scripts_to_pytask.md
@@ -0,0 +1,233 @@
+# Migrating from scripts to pytask
+
+Are you tired of managing tasks in your research workflows with scripts that get harder
+to maintain over time? Then pytask is here to help!
+
+With pytask, you can enjoy features like:
+
+- **Lazy builds**. Only execute the scripts that need to be run or re-run because
+  something has changed, saving you lots of time.
+- **Parallelization**. Use
+  [pytask-parallel](https://github.com/pytask-dev/pytask-parallel) to speed up your
+  scripts by running them in parallel.
+- **Cross-language projects**. pytask has several plugins for running scripts written in
+  other popular languages: [pytask-r](https://github.com/pytask-dev/pytask-r),
+  [pytask-julia](https://github.com/pytask-dev/pytask-julia), and
+  [pytask-stata](https://github.com/pytask-dev/pytask-stata).
+
+The following guide will walk you through a series of steps to quickly migrate your
+scripts to a workflow managed by pytask. The focus is first on Python scripts, but the
+guide concludes with an additional example of an R script.
+
+## Installation
+
+To get started with pytask, simply install it with pip or conda:
+
+```console
+$ pip install pytask pytask-parallel
+
+$ conda -c conda-forge pytask pytask-parallel
+```
+
+## From Python script to task
+
+We must rewrite your scripts and move the executable part to a task function. You might
+contain the code in the main namespace of your script, like in this example.
+
+```python
+# Content of task_data_management.py
+import pandas as pd
+
+
+df = pd.read_csv("data.csv")
+
+# Many operations.
+
+df.to_pickle("data.pkl")
+```
+
+Or, you might use an `if __name__ == "__main__"` block like this example.
+
+```python
+# Content of task_data_management.py
+import pandas as pd
+
+
+def main():
+    df = pd.read_csv("data.csv")
+
+    # Many operations.
+
+    df.to_pickle("data.pkl")
+
+
+if __name__ == "__main__":
+    main()
+```
+
+For pytask, you need to move the code into a task that is a function whose name starts
+with `task_` in a module with the same prefix like `task_data_management.py`.
+
+```python
+# Content of task_data_management.py
+import pandas as pd
+
+
+def task_prepare_data():
+    df = pd.read_csv("data.csv")
+
+    # Many operations.
+
+    df.to_pickle("data.pkl")
+```
+
+An `if __name__ == "__main__"` block must be deleted.
+
+## Extracting dependencies and products
+
+To let pytask know the order in which to execute tasks and when to re-run them, you'll
+need to specify task dependencies and products using `@pytask.mark.depends_on` and
+`@pytask.mark.produces`. Extract the paths to the inputs and outputs of your script and
+pass them to the decorator. For example:
+
+```python
+# Content of task_data_management.py
+import pandas as pd
+import pytask
+
+
+@pytask.mark.depends_on("data.csv")
+@pytask.mark.produces("data.pkl")
+def task_prepare_data(depends_on, produces):
+    df = pd.read_csv(depends_on)
+
+    # Many operations.
+
+    df.to_pickle(produces)
+```
+
+The decorators allow you to use `depends_on` and `produces` as arguments to the
+function and access the paths to the dependencies and products as {class}`pathlib.Path`.
+
+You can pass a dictionary to these decorators if you have multiple dependencies or
+products. The dictionary's keys are the dependencies'/product's names, and the values
+are the paths. Here is an example:
+
+```python
+import pandas as pd
+import pytask
+
+
+@pytask.mark.depends_on({"data_1": "data_1.csv", "data_2": "data_2.csv"})
+@pytask.mark.produces("data.pkl")
+def task_merge_data(depends_on, produces):
+    df1 = pd.read_csv(depends_on["data_1"])
+    df2 = pd.read_csv(depends_on["data_2"])
+
+    df = df1.merge(df2, on=...)
+
+    df.to_pickle(produces)
+```
+
+:::{seealso}
+If you want to learn more about dependencies and products, read the
+[tutorial](../tutorials/defining_dependencies_products.md).
+:::
+
+## Execution
+
+Finally, execute your newly defined tasks with pytask. Assuming your scripts lie in the
+current directory of your terminal or a subsequent directory, run the following.
+
+```{include} ../_static/md/migrating-from-scripts-to-pytask.md
+```
+
+Otherwise, pass the paths explicitly to the pytask executable.
+
+If you have rewritten multiple scripts that can be run in parallel, use the
+`-n/--n-workers` option to define the number of parallel tasks. pytask-parallel will
+then automatically spawn multiple processes to run the workflow in parallel.
+
+```console
+$ pytask -n 4
+```
+
+:::{seealso}
+You can find more information on pytask-parallel in the
+[readme](https://github.com/pytask-dev/pytask-parallel) on Github.
+:::
+
+## Bonus: From R script to task
+
+pytask wants to help you get your job done, and sometimes a different programming
+language can make your life easier. Thus, pytask has several plugins to integrate code
+written in R, Julia, and Stata. Here, we explore how to incorporate an R script with
+[pytask-r](https://github.com/pytask-dev/pytask-r). You can also find more information
+about the plugin in the repo's readme.
+
+First, we will install the package.
+
+```console
+$ pip install pytask-r
+
+$ conda install -c conda-forge pytask-r
+```
+
+:::{seealso}
+Checkout [pytask-julia](https://github.com/pytask-dev/pytask-julia) and
+[pytask-stata](https://github.com/pytask-dev/pytask-stata), too!
+:::
+
+And here is the R script `prepare_data.r` that we want to integrate.
+
+```r
+# Content of prepare_data.r
+df <- read.csv("data.csv")
+
+# Many operations.
+
+saveRDS(df, "data.rds")
+```
+
+Next, we create a task function to point pytask to the script and the dependencies and
+products.
+
+```python
+# Content of task_data_management.py
+import pytask
+
+
+@pytask.mark.r(script="prepare_data.r")
+@pytask.mark.depends_on("data.csv")
+@pytask.mark.produces("data.rds")
+def task_prepare_data():
+    pass
+```
+
+pytask automatically makes the paths to the dependencies and products available to the
+R file via a JSON file. Let us amend the R script to load the information from the JSON
+file.
+
+```r
+# Content of prepare_data.r
+library(jsonlite)
+
+# Read the JSON file whose path is passed to the script
+args <- commandArgs(trailingOnly=TRUE)
+path_to_json <- args[length(args)]
+config <- read_json(path_to_json)
+
+df <- read.csv(config$depends_on)
+
+# Many operations.
+
+saveRDS(df, config$produces)
+```
+
+## Conclusion
+
+Congrats! You have just set up your first workflow with pytask!
+
+If you enjoyed what you have seen, you should discover the other parts of the
+documentation. The [tutorials](../tutorials/index.md) are a good entry point to start
+with pytask and learn about many concepts step-by-step.