Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ETL stage to the platform pipeline #24

Merged
merged 9 commits into from
Sep 19, 2024
Merged

feat: add ETL stage to the platform pipeline #24

merged 9 commits into from
Sep 19, 2024

Conversation

javfg
Copy link
Member

@javfg javfg commented Sep 16, 2024

This PR adds the ETL stage to the platform pipeline, completing the platform side of things in the orchestrator.

There are few extra additions. I've separated the code different commits which should not be squashed when merging. The important ones are:

  • d3a8a63: New HOCON parse features. This is the config format used by the ETL application.
  • f34ee7b: New Labels class that provides various utilities for handling Google Cloud labels.
  • b7f3701: New operators to handle file upload, from a string and from a remote location using a URL.
  • 79b1ee1: Dataproc operators to handle creation of clusters and submission of jobs. They simplify these tasks by having our defaults put in. These are tailored for ETL but could easily be made more generic for other parts.
  • 7390670: Some changes to the way we decide if we rerun a PIS step in subsequent run tries.

The main commit is 9af6ca1, where the platform-input-support dag is removed and platform is put in its place with the full pipeline. Some details:

  • The idea is to have a platform.yaml defining the run parameters. These are the things we change with each release, like bucket path, efo/cheml/ensembl versions, etc.
  • There is a platform.py file in the configuration which will provide all the values the DAG needs by using the previous yaml to override values in PIS/ETL configurations. That way, we don't have to edit those for each release.

I will detail this more in a design document later one, hopefully :).

This closes opentargets/issues#3396.

Copy link
Collaborator

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the comments are purely to ensure I understand the overall process. Otherwise approved :)

pyproject.toml Show resolved Hide resolved
src/ot_orchestration/dags/config/platform.py Show resolved Hide resolved
src/ot_orchestration/dags/config/platform.py Show resolved Hide resolved
src/ot_orchestration/dags/config/platform.py Show resolved Hide resolved
src/ot_orchestration/dags/config/platform.py Show resolved Hide resolved
src/ot_orchestration/utils/dataproc_platform.py Outdated Show resolved Hide resolved
src/ot_orchestration/utils/dataproc_platform.py Outdated Show resolved Hide resolved
src/ot_orchestration/utils/dataproc_platform.py Outdated Show resolved Hide resolved
src/ot_orchestration/utils/labels.py Show resolved Hide resolved
characters, underscores and dashes. The value can be at most 63 characters
long.
"""
return re.sub(r"[^a-z0-9-_]", "-", label.lower())[0:63]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By looking at this I had to check if slicing string with right index being bigger, then the length of the string will not break it. 🐙 For the sanity I would change it to

Suggested change
return re.sub(r"[^a-z0-9-_]", "-", label.lower())[0:63]
max_char_count = 63
rid = max_char_count if len(label) > max_char_count else len(label)
return re.sub(r"[^a-z0-9-_]", "-", label.lower())[0:rid]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need that check, the behavior of slices has been there for a long time (26 years). It's safe to assume it will not change.

You can even do

something[3:2]

although I'm not sure why would you. :D

@project-defiant
Copy link
Collaborator

@javfg align the PR name to the semantics

@javfg javfg changed the title Add ETL stage to the platform pipeline feat: add ETL stage to the platform pipeline Sep 19, 2024
@javfg javfg merged commit 21beac7 into dev Sep 19, 2024
2 checks passed
project-defiant pushed a commit that referenced this pull request Sep 23, 2024
This PR also adds some small features to support the Platform pipeline.

* feat: add hocon file parsing

* chore: update pis config fields

* feat: labels class

* feat: more file upload operators

* feat: platform etl dataproc operators

* fix: improve pis diff computation

* feat: complete platform dag

* fix: broken ppp conditional deps

* fix: apply suggestions from code review

Co-authored-by: Szymon Szyszkowski <ss60@mib117351s.internal.sanger.ac.uk>

---------

Co-authored-by: Szymon Szyszkowski <ss60@mib117351s.internal.sanger.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port the ETL launch to Airflow
2 participants