Evaluate orchestration systems for ooni/data #46

hellais · 2023-11-27T13:50:50Z

At the moment OONI/data just runs through crobjob based scheduling.

This is a bit suboptimal because we don't have support for nice logging, retries and monitoring of task execution. It's also not so simple to clearly define task depedencies.

In the past for this use-case we used airflow (and even before that luigi). In the current pipeline we don't use anything, but just rely on systemd because airflow was such a pain to administer and manage.

It looks like the space of orchestration has moved forward quite a bit and there are several nice looking tools in this space at the moment:

Prefect: https://prefecthq.github.io/prefect-dask/
Dagster: https://dagster.io/vs
Mage: https://www.mage.ai/
Temporal: https://temporal.io/
flyte: https://flyte.org/

It might be worth spending some time evaluation these options and seeing if it makes sense to use them.

If we pick one of these orchestration tools, given that most of them also support parallelization, we could even get rid of dask and replace it with whatever orchestration tool we choose.

That would simplify the codebase and make monitoring and troubleshooting this in production more robust.

ainghazal · 2024-04-09T22:14:03Z

Not in the same general category as the ones you mention above, but one feature I like from Toil is the support for a workflow DSL (CWL/WDL) and the ability to run the same jobs locally with minimal overhead (file store) or in a fully-fledged HPC (thinking about reproducibility and supporting research with smaller subsets of the global dataset)

Major refactoring of oonidata into two separate packages: * oonidata, which is the end user pip installable package to download and parse measurements (should have minimal dependencies and not require additional components to run, eg. clickhouse) * oonipipeline, which is the thing that actually performs the analysis and processing of data. This does require external dependencies to run. Eventually we might want to move them into their own respective repos. This fixes the following issues: * #41, we keep oonidata as the name for the CLI tool, but oonipipeline is the actual component doing analysis. * #46, we are currently using temporal.io. After this lands we might want to make prototypes of the other orchestration systems to compare. * #57, we might eventually not go for monorepo, but for the moment this is what we got.

hellais · 2024-07-26T13:00:43Z

I would say we are pretty happy with temporal so we can consider this done.

hellais added enhancement New feature or request priority/medium labels Nov 27, 2023

hellais self-assigned this Nov 27, 2023

hellais added the funder/drl2022-2024 label Jan 29, 2024

hellais mentioned this issue Mar 6, 2024

Use Temporal for orchestration #58

Closed

hellais mentioned this issue Mar 17, 2024

OONI Data, OONI Pipeline split #60

Merged

hellais mentioned this issue Apr 25, 2024

Tag OONI Pipeline v5 alpha #66

Closed

13 tasks

hellais closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate orchestration systems for ooni/data #46

Evaluate orchestration systems for ooni/data #46

hellais commented Nov 27, 2023 •

edited

Loading

ainghazal commented Apr 9, 2024

hellais commented Jul 26, 2024

Evaluate orchestration systems for ooni/data #46

Evaluate orchestration systems for ooni/data #46

Comments

hellais commented Nov 27, 2023 • edited Loading

ainghazal commented Apr 9, 2024

hellais commented Jul 26, 2024

hellais commented Nov 27, 2023 •

edited

Loading