Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate orchestration systems for ooni/data #46

Closed
hellais opened this issue Nov 27, 2023 · 2 comments
Closed

Evaluate orchestration systems for ooni/data #46

hellais opened this issue Nov 27, 2023 · 2 comments
Assignees

Comments

@hellais
Copy link
Member

hellais commented Nov 27, 2023

At the moment OONI/data just runs through crobjob based scheduling.

This is a bit suboptimal because we don't have support for nice logging, retries and monitoring of task execution. It's also not so simple to clearly define task depedencies.

In the past for this use-case we used airflow (and even before that luigi). In the current pipeline we don't use anything, but just rely on systemd because airflow was such a pain to administer and manage.

It looks like the space of orchestration has moved forward quite a bit and there are several nice looking tools in this space at the moment:

It might be worth spending some time evaluation these options and seeing if it makes sense to use them.

If we pick one of these orchestration tools, given that most of them also support parallelization, we could even get rid of dask and replace it with whatever orchestration tool we choose.

That would simplify the codebase and make monitoring and troubleshooting this in production more robust.

@ainghazal
Copy link

Not in the same general category as the ones you mention above, but one feature I like from Toil is the support for a workflow DSL (CWL/WDL) and the ability to run the same jobs locally with minimal overhead (file store) or in a fully-fledged HPC (thinking about reproducibility and supporting research with smaller subsets of the global dataset)

hellais added a commit that referenced this issue Apr 15, 2024
Major refactoring of oonidata into two separate
packages:
* oonidata, which is the end user pip installable package to download
and parse measurements (should have minimal dependencies and not require
additional components to run, eg. clickhouse)
* oonipipeline, which is the thing that actually performs the analysis
and processing of data. This does require external dependencies to run.

Eventually we might want to move them into their own respective repos.

This fixes the following issues:
* #41, we keep oonidata as the name
for the CLI tool, but oonipipeline is the actual component doing
analysis.
* #46, we are currently using
temporal.io. After this lands we might want to make prototypes of the
other orchestration systems to compare.
* #57, we might eventually not go for
monorepo, but for the moment this is what we got.
@hellais hellais mentioned this issue Apr 25, 2024
13 tasks
@hellais
Copy link
Member Author

hellais commented Jul 26, 2024

I would say we are pretty happy with temporal so we can consider this done.

@hellais hellais closed this as completed Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants