feat: application configuration #386

d0choa · 2024-01-05T17:30:11Z

This is yet another revised version of the application configuration.

Since the removal of the config store, the application could not run independently of a configuration YAML. This was a major limitation, as we could not expect a user of the package, to create a YAML every time they want to run a step.

The intention of this PR is to allow running the application in at least 3 ways:

a) as a CLI providing the appropiate parameters necessary to run a step, or overriding the default parameters
b) as a CLI in combination with YAML configuration to provide parameters
c) imported as a package

This PR implements this by recovering the configstore and the step configuration dataclasses. They contain all the necessary knowledge about what is required to run every step, what's required and what's not and some default parameters. By using, the config store, the application is aware of what are the available steps and what are the necessary parameters to run each of the steps. The application complains otherwise.

Our layer of YAML configuration /config is therefore no longer a dependency of the package. It only contains, the necessary information to populate/override the parameters that we are interested on for our own use-case of the package. This layer is getting slimer and slimer and I can see how the airflow layer should completely take over as it became our interface with the python package. Now that the YAML is no longer a dependency of the package, we could try to migrate the config to Airflow.

To further document how to use the package I added 2 how-to guides:

how to run a step using the CLI
how to run a step using the CLI + YAML configuration

I also added some additional tests for the CLI. I hope they become living documentation that ensure new steps are properly included and available to the CLI.

I also updated the airflow DAGs. This is not really tested so I hope I made no mistakes.

for more information, see https://pre-commit.ci

codecov-commenter · 2024-01-05T20:25:45Z

Codecov Report

Attention: 127 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (8934dea) 86.16%.
Report is 50 commits behind head on dev.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #386      +/-   ##
==========================================
+ Coverage   85.67%   86.16%   +0.49%     
==========================================
  Files          89       97       +8     
  Lines        2101     2624     +523     
==========================================
+ Hits         1800     2261     +461     
- Misses        301      363      +62

Files	Coverage Δ
src/airflow/dags/common_airflow.py	`90.38% <100.00%> (ø)`
src/airflow/dags/dag_preprocess.py	`100.00% <ø> (ø)`
src/airflow/dags/finngen_preprocess.py	`100.00% <100.00%> (ø)`
src/airflow/dags/gwas_catalog_harmonisation.py	`43.47% <ø> (ø)`
src/airflow/dags/gwas_curation_update.py	`100.00% <100.00%> (ø)`
src/otg/cli.py	`91.66% <100.00%> (+21.66%)`	⬆️
src/otg/common/session.py	`87.50% <100.00%> (+0.32%)`	⬆️
src/otg/config.py	`100.00% <100.00%> (ø)`
src/otg/dataset/dataset.py	`91.80% <100.00%> (ø)`
src/otg/dataset/l2g_feature_matrix.py	`82.92% <ø> (+7.31%)`	⬆️
... and 36 more

ireneisdoomed

Great addition. I still haven't tried it in Airflow, but all components are there. I have added a few suggestions.
My summary after review:

We now register a configstore where each node is a dataclass for each step. For our use case, YAMLs take the parameters in the config store as reference and extend them.
Every YAML has the same attributes as the respective config store, so that they can be overriden.
The session class is registered at the time of instantiating the step.
Steps are not dataclasses anymore. All the logic has been moved to the generator method.
local.yaml (as opposed to dataproc.yaml) has been removed. Default params are provided in the config class.
cli.py and the DAGs have been adapted to accommodate new changes

src/otg/config.py

src/otg/l2g.py

src/otg/clump.py

src/otg/config.py

src/otg/l2g.py

docs/howto/run_step_using_config.md

tests/test_cli.py

tskir

I realise this is going to be a dissenting opinion, but I still maintain that a config store adds an unnecessary extra layer of complexity and, more importantly, duplication.

If I was an external user, it would probably be easier for me to just add/edit a single YAML file rather than modifying several semantically redundant files across the entire code base.

I am certain that, rather than increasing duplication, we should be striving to reduce it even further. I think that an ideal situation would be for every parameter, its name, default value, type and description to be defined exactly once in the code.

This is certainly possible. GATK is an example of an extremely sophisticated bioinformatics software package with dozens of steps and many hundreds of parameters. (Admittedly it is written in Java, but all of the same general principles apply for Python as well). If we take a random obscure parameter, it is defined in precisely one place of the entire code base and everything, including documentation, is automatically generated from that one definition:

I think that, given the complexity of our code base and how it is likely to increase even further, with the current level of semantic duplication it will be very hard to maintain.

If there is truly no way to achieve what we are trying to achieve with Hydra, then it's probably a better decision to replace it with something else, rather than adding so much duplication.

d0choa · 2024-01-12T11:04:07Z

I've been thinking about your points @tskir. I think it's important to think twice about these kind of changes.

The problem that this PR comes to solve is the fact that the YAML was a dependency of the package. Someone could not pip install the package and run a step with otg step=XXX without having a yaml along it. This comes to protect the user of the package rather than the developer. Now the CLI will tell you what are the required parameters necessary to run any step and complain if you don't provide them. This is something that the YAML-only version can't do. I know the CLI use case is one you are not particularly keen but I think it's important to simplify how users can run the code beyond our use cases. This also adds another benefit which is a better separation of concerns between what the package allows (configstore) and what we use the package for (YAML).

Talking about the problems. I also see a bit of redundancy. The configs and the steps both have a very similar set of parameters (not exactly the same). I think there is room to think about how to use the config classes to parametrise the steps without having to write/document every parameter again. I think this is the only layer that could be considered redundant and imply some maintanance in my opinion. The other bit to reduce and potentially eliminate are the YAML configurations for the steps. A much more slimmer Airflow configuration with a list of inputs/outputs could be enough and we could superseed the current configs/ directory. Instead of the current Airflow -> configs -> otg we would like a configs -> Airflow -> otg setup.

As for hydra, I had similar thoughts in the past about whether is worth it for our case. But after thinking quite a lot about it, I concluded we are using it for quite a lot of things: composition, interpolation, instantiation, cli,... If we wouldn't use hydra we would probably require quite a lot of code to a reasonably similar functionality.

d0choa and others added 9 commits January 5, 2024 15:06

feat: restoring config store

cd1eaa9

test: adjusting tests

6a38c70

fix: adjust dags to new configs

7107d0c

fix: missing config rename

2b8806d

docs: cli and config how-to to assist using the package

81aeff5

chore: merge dev

47dcdad

[pre-commit.ci] auto fixes from pre-commit.com hooks

99b036e

for more information, see https://pre-commit.ci

fix: optional parameter

b5742d7

fix: incorrect id

86850b4

d0choa marked this pull request as ready for review January 5, 2024 20:27

d0choa requested review from ireneisdoomed, tskir and DSuveges January 5, 2024 20:27

ireneisdoomed approved these changes Jan 8, 2024

View reviewed changes

tskir reviewed Jan 9, 2024

View reviewed changes

d0choa added 7 commits January 10, 2024 15:09

chore: merge dev

eb284a3

chore: merge dev

a8a2454

test: update CLI options

343de45

fix: rename class

c8a8b6b

test: include missing steps in CLI tests

51c1992

fix: optional arguments are proper composition

64ce9bb

chore: merge dev

831a3e2

d0choa added 2 commits January 12, 2024 11:32

test: migrate tests to unittest.mock

8934dea

chore: merge dev

3c5eb47

d0choa merged commit c6ca806 into dev Jan 12, 2024
3 checks passed

d0choa deleted the do_config_refactor branch January 12, 2024 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: application configuration #386

feat: application configuration #386

d0choa commented Jan 5, 2024

codecov-commenter commented Jan 5, 2024 •

edited

Loading

ireneisdoomed left a comment

tskir left a comment •

edited

Loading

d0choa commented Jan 12, 2024

feat: application configuration #386

feat: application configuration #386

Conversation

d0choa commented Jan 5, 2024

codecov-commenter commented Jan 5, 2024 • edited Loading

Codecov Report

ireneisdoomed left a comment

Choose a reason for hiding this comment

tskir left a comment • edited Loading

Choose a reason for hiding this comment

d0choa commented Jan 12, 2024

codecov-commenter commented Jan 5, 2024 •

edited

Loading

tskir left a comment •

edited

Loading