-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate orchestration systems for ooni/data #46
Labels
Comments
Not in the same general category as the ones you mention above, but one feature I like from Toil is the support for a workflow DSL (CWL/WDL) and the ability to run the same jobs locally with minimal overhead (file store) or in a fully-fledged HPC (thinking about reproducibility and supporting research with smaller subsets of the global dataset) |
hellais
added a commit
that referenced
this issue
Apr 15, 2024
Major refactoring of oonidata into two separate packages: * oonidata, which is the end user pip installable package to download and parse measurements (should have minimal dependencies and not require additional components to run, eg. clickhouse) * oonipipeline, which is the thing that actually performs the analysis and processing of data. This does require external dependencies to run. Eventually we might want to move them into their own respective repos. This fixes the following issues: * #41, we keep oonidata as the name for the CLI tool, but oonipipeline is the actual component doing analysis. * #46, we are currently using temporal.io. After this lands we might want to make prototypes of the other orchestration systems to compare. * #57, we might eventually not go for monorepo, but for the moment this is what we got.
I would say we are pretty happy with temporal so we can consider this done. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
At the moment OONI/data just runs through crobjob based scheduling.
This is a bit suboptimal because we don't have support for nice logging, retries and monitoring of task execution. It's also not so simple to clearly define task depedencies.
In the past for this use-case we used airflow (and even before that luigi). In the current pipeline we don't use anything, but just rely on systemd because airflow was such a pain to administer and manage.
It looks like the space of orchestration has moved forward quite a bit and there are several nice looking tools in this space at the moment:
It might be worth spending some time evaluation these options and seeing if it makes sense to use them.
If we pick one of these orchestration tools, given that most of them also support parallelization, we could even get rid of dask and replace it with whatever orchestration tool we choose.
That would simplify the codebase and make monitoring and troubleshooting this in production more robust.
The text was updated successfully, but these errors were encountered: