Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: application configuration #386

Merged
merged 18 commits into from
Jan 12, 2024
Merged

feat: application configuration #386

merged 18 commits into from
Jan 12, 2024

Conversation

d0choa
Copy link
Collaborator

@d0choa d0choa commented Jan 5, 2024

This is yet another revised version of the application configuration.

Since the removal of the config store, the application could not run independently of a configuration YAML. This was a major limitation, as we could not expect a user of the package, to create a YAML every time they want to run a step.

The intention of this PR is to allow running the application in at least 3 ways:

a) as a CLI providing the appropiate parameters necessary to run a step, or overriding the default parameters
b) as a CLI in combination with YAML configuration to provide parameters
c) imported as a package

This PR implements this by recovering the configstore and the step configuration dataclasses. They contain all the necessary knowledge about what is required to run every step, what's required and what's not and some default parameters. By using, the config store, the application is aware of what are the available steps and what are the necessary parameters to run each of the steps. The application complains otherwise.

Our layer of YAML configuration /config is therefore no longer a dependency of the package. It only contains, the necessary information to populate/override the parameters that we are interested on for our own use-case of the package. This layer is getting slimer and slimer and I can see how the airflow layer should completely take over as it became our interface with the python package. Now that the YAML is no longer a dependency of the package, we could try to migrate the config to Airflow.

To further document how to use the package I added 2 how-to guides:

  • how to run a step using the CLI
  • how to run a step using the CLI + YAML configuration

I also added some additional tests for the CLI. I hope they become living documentation that ensure new steps are properly included and available to the CLI.

I also updated the airflow DAGs. This is not really tested so I hope I made no mistakes.

@codecov-commenter
Copy link

codecov-commenter commented Jan 5, 2024

Codecov Report

Attention: 127 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (8934dea) 86.16%.
Report is 50 commits behind head on dev.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #386      +/-   ##
==========================================
+ Coverage   85.67%   86.16%   +0.49%     
==========================================
  Files          89       97       +8     
  Lines        2101     2624     +523     
==========================================
+ Hits         1800     2261     +461     
- Misses        301      363      +62     
Files Coverage Δ
src/airflow/dags/common_airflow.py 90.38% <100.00%> (ø)
src/airflow/dags/dag_preprocess.py 100.00% <ø> (ø)
src/airflow/dags/finngen_preprocess.py 100.00% <100.00%> (ø)
src/airflow/dags/gwas_catalog_harmonisation.py 43.47% <ø> (ø)
src/airflow/dags/gwas_curation_update.py 100.00% <100.00%> (ø)
src/otg/cli.py 91.66% <100.00%> (+21.66%) ⬆️
src/otg/common/session.py 87.50% <100.00%> (+0.32%) ⬆️
src/otg/config.py 100.00% <100.00%> (ø)
src/otg/dataset/dataset.py 91.80% <100.00%> (ø)
src/otg/dataset/l2g_feature_matrix.py 82.92% <ø> (+7.31%) ⬆️
... and 36 more

@d0choa d0choa marked this pull request as ready for review January 5, 2024 20:27
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition. I still haven't tried it in Airflow, but all components are there. I have added a few suggestions.
My summary after review:

  • We now register a configstore where each node is a dataclass for each step. For our use case, YAMLs take the parameters in the config store as reference and extend them.
  • Every YAML has the same attributes as the respective config store, so that they can be overriden.
  • The session class is registered at the time of instantiating the step.
  • Steps are not dataclasses anymore. All the logic has been moved to the generator method.
  • local.yaml (as opposed to dataproc.yaml) has been removed. Default params are provided in the config class.
  • cli.py and the DAGs have been adapted to accommodate new changes

src/otg/config.py Outdated Show resolved Hide resolved
src/otg/config.py Outdated Show resolved Hide resolved
src/otg/l2g.py Outdated Show resolved Hide resolved
src/otg/clump.py Outdated Show resolved Hide resolved
src/otg/clump.py Outdated Show resolved Hide resolved
src/otg/config.py Outdated Show resolved Hide resolved
src/otg/l2g.py Outdated Show resolved Hide resolved
src/otg/l2g.py Outdated Show resolved Hide resolved
docs/howto/run_step_using_config.md Show resolved Hide resolved
tests/test_cli.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tskir tskir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise this is going to be a dissenting opinion, but I still maintain that a config store adds an unnecessary extra layer of complexity and, more importantly, duplication.

If I was an external user, it would probably be easier for me to just add/edit a single YAML file rather than modifying several semantically redundant files across the entire code base.

I am certain that, rather than increasing duplication, we should be striving to reduce it even further. I think that an ideal situation would be for every parameter, its name, default value, type and description to be defined exactly once in the code.

This is certainly possible. GATK is an example of an extremely sophisticated bioinformatics software package with dozens of steps and many hundreds of parameters. (Admittedly it is written in Java, but all of the same general principles apply for Python as well). If we take a random obscure parameter, it is defined in precisely one place of the entire code base and everything, including documentation, is automatically generated from that one definition:

image

I think that, given the complexity of our code base and how it is likely to increase even further, with the current level of semantic duplication it will be very hard to maintain.

If there is truly no way to achieve what we are trying to achieve with Hydra, then it's probably a better decision to replace it with something else, rather than adding so much duplication.

@d0choa
Copy link
Collaborator Author

d0choa commented Jan 12, 2024

I've been thinking about your points @tskir. I think it's important to think twice about these kind of changes.

The problem that this PR comes to solve is the fact that the YAML was a dependency of the package. Someone could not pip install the package and run a step with otg step=XXX without having a yaml along it. This comes to protect the user of the package rather than the developer. Now the CLI will tell you what are the required parameters necessary to run any step and complain if you don't provide them. This is something that the YAML-only version can't do. I know the CLI use case is one you are not particularly keen but I think it's important to simplify how users can run the code beyond our use cases. This also adds another benefit which is a better separation of concerns between what the package allows (configstore) and what we use the package for (YAML).

Talking about the problems. I also see a bit of redundancy. The configs and the steps both have a very similar set of parameters (not exactly the same). I think there is room to think about how to use the config classes to parametrise the steps without having to write/document every parameter again. I think this is the only layer that could be considered redundant and imply some maintanance in my opinion. The other bit to reduce and potentially eliminate are the YAML configurations for the steps. A much more slimmer Airflow configuration with a list of inputs/outputs could be enough and we could superseed the current configs/ directory. Instead of the current Airflow -> configs -> otg we would like a configs -> Airflow -> otg setup.

As for hydra, I had similar thoughts in the past about whether is worth it for our case. But after thinking quite a lot about it, I concluded we are using it for quite a lot of things: composition, interpolation, instantiation, cli,... If we wouldn't use hydra we would probably require quite a lot of code to a reasonably similar functionality.

@d0choa d0choa merged commit c6ca806 into dev Jan 12, 2024
3 checks passed
@d0choa d0choa deleted the do_config_refactor branch January 12, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants