-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: application configuration #386
Conversation
for more information, see https://pre-commit.ci
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #386 +/- ##
==========================================
+ Coverage 85.67% 86.16% +0.49%
==========================================
Files 89 97 +8
Lines 2101 2624 +523
==========================================
+ Hits 1800 2261 +461
- Misses 301 363 +62
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition. I still haven't tried it in Airflow, but all components are there. I have added a few suggestions.
My summary after review:
- We now register a configstore where each node is a dataclass for each step. For our use case, YAMLs take the parameters in the config store as reference and extend them.
- Every YAML has the same attributes as the respective config store, so that they can be overriden.
- The session class is registered at the time of instantiating the step.
- Steps are not dataclasses anymore. All the logic has been moved to the generator method.
local.yaml
(as opposed todataproc.yaml
) has been removed. Default params are provided in the config class.cli.py
and the DAGs have been adapted to accommodate new changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise this is going to be a dissenting opinion, but I still maintain that a config store adds an unnecessary extra layer of complexity and, more importantly, duplication.
If I was an external user, it would probably be easier for me to just add/edit a single YAML file rather than modifying several semantically redundant files across the entire code base.
I am certain that, rather than increasing duplication, we should be striving to reduce it even further. I think that an ideal situation would be for every parameter, its name, default value, type and description to be defined exactly once in the code.
This is certainly possible. GATK is an example of an extremely sophisticated bioinformatics software package with dozens of steps and many hundreds of parameters. (Admittedly it is written in Java, but all of the same general principles apply for Python as well). If we take a random obscure parameter, it is defined in precisely one place of the entire code base and everything, including documentation, is automatically generated from that one definition:
I think that, given the complexity of our code base and how it is likely to increase even further, with the current level of semantic duplication it will be very hard to maintain.
If there is truly no way to achieve what we are trying to achieve with Hydra, then it's probably a better decision to replace it with something else, rather than adding so much duplication.
I've been thinking about your points @tskir. I think it's important to think twice about these kind of changes. The problem that this PR comes to solve is the fact that the YAML was a dependency of the package. Someone could not pip install the package and run a step with Talking about the problems. I also see a bit of redundancy. The configs and the steps both have a very similar set of parameters (not exactly the same). I think there is room to think about how to use the config classes to parametrise the steps without having to write/document every parameter again. I think this is the only layer that could be considered redundant and imply some maintanance in my opinion. The other bit to reduce and potentially eliminate are the YAML configurations for the steps. A much more slimmer Airflow configuration with a list of inputs/outputs could be enough and we could superseed the current configs/ directory. Instead of the current As for hydra, I had similar thoughts in the past about whether is worth it for our case. But after thinking quite a lot about it, I concluded we are using it for quite a lot of things: composition, interpolation, instantiation, cli,... If we wouldn't use hydra we would probably require quite a lot of code to a reasonably similar functionality. |
This is yet another revised version of the application configuration.
Since the removal of the config store, the application could not run independently of a configuration YAML. This was a major limitation, as we could not expect a user of the package, to create a YAML every time they want to run a step.
The intention of this PR is to allow running the application in at least 3 ways:
a) as a CLI providing the appropiate parameters necessary to run a step, or overriding the default parameters
b) as a CLI in combination with YAML configuration to provide parameters
c) imported as a package
This PR implements this by recovering the configstore and the step configuration dataclasses. They contain all the necessary knowledge about what is required to run every step, what's required and what's not and some default parameters. By using, the config store, the application is aware of what are the available steps and what are the necessary parameters to run each of the steps. The application complains otherwise.
Our layer of YAML configuration
/config
is therefore no longer a dependency of the package. It only contains, the necessary information to populate/override the parameters that we are interested on for our own use-case of the package. This layer is getting slimer and slimer and I can see how the airflow layer should completely take over as it became our interface with the python package. Now that the YAML is no longer a dependency of the package, we could try to migrate the config to Airflow.To further document how to use the package I added 2 how-to guides:
I also added some additional tests for the CLI. I hope they become living documentation that ensure new steps are properly included and available to the CLI.
I also updated the airflow DAGs. This is not really tested so I hope I made no mistakes.