-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ETL stage to the platform pipeline #24
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the comments are purely to ensure I understand the overall process. Otherwise approved :)
characters, underscores and dashes. The value can be at most 63 characters | ||
long. | ||
""" | ||
return re.sub(r"[^a-z0-9-_]", "-", label.lower())[0:63] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By looking at this I had to check if slicing string with right index being bigger, then the length of the string will not break it. 🐙 For the sanity I would change it to
return re.sub(r"[^a-z0-9-_]", "-", label.lower())[0:63] | |
max_char_count = 63 | |
rid = max_char_count if len(label) > max_char_count else len(label) | |
return re.sub(r"[^a-z0-9-_]", "-", label.lower())[0:rid] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need that check, the behavior of slices has been there for a long time (26 years). It's safe to assume it will not change.
You can even do
something[3:2]
although I'm not sure why would you. :D
@javfg align the PR name to the semantics |
Co-authored-by: Szymon Szyszkowski <ss60@mib117351s.internal.sanger.ac.uk>
This PR also adds some small features to support the Platform pipeline. * feat: add hocon file parsing * chore: update pis config fields * feat: labels class * feat: more file upload operators * feat: platform etl dataproc operators * fix: improve pis diff computation * feat: complete platform dag * fix: broken ppp conditional deps * fix: apply suggestions from code review Co-authored-by: Szymon Szyszkowski <ss60@mib117351s.internal.sanger.ac.uk> --------- Co-authored-by: Szymon Szyszkowski <ss60@mib117351s.internal.sanger.ac.uk>
This PR adds the ETL stage to the platform pipeline, completing the platform side of things in the orchestrator.
There are few extra additions. I've separated the code different commits which should not be squashed when merging. The important ones are:
Labels
class that provides various utilities for handling Google Cloud labels.The main commit is 9af6ca1, where the
platform-input-support
dag is removed andplatform
is put in its place with the full pipeline. Some details:platform.yaml
defining the run parameters. These are the things we change with each release, like bucket path, efo/cheml/ensembl versions, etc.platform.py
file in the configuration which will provide all the values the DAG needs by using the previous yaml to override values in PIS/ETL configurations. That way, we don't have to edit those for each release.I will detail this more in a design document later one, hopefully :).
This closes opentargets/issues#3396.