-
Notifications
You must be signed in to change notification settings - Fork 1
Introduction of a YAML configuration file for CodeEntropy: #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Modularize 'main_mcc.py' to separate configuration and input stages - Functions include: 'load_config", 'setup_argsparse' and 'merge_configs' - Improved error handling by adding specific checks on inputs that are required - Improve error handling to catch and raise specific errors with clear messages - Ensure YAML configuration values are correctly merged with CLI argument - Add detailed comments to explain the purpose and functionality of each section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I thought that this was fine, but I have been thinking a little bit more about it and I think what we want to do is have a really nice clean format for yaml file. This yaml file is for users to configure the run of the CodeEntropy application, so we want to hide away as much of the complexity with what parameter defaults etc are.
So with this in mind I think we can redesign just slightly what we are doing. So I think we should consider making the following changes:
for the yaml spec, I think that it should go something like this:
<run_name>:
top_traj_file: <param>
selection_string: <param>
start: <param>
end: <param>
step: <param>
bin_width: <param>
tempra: <param>
verbose: <param>
thread: <param>
outfile: <param>
resfile: <param>
mout: <param>
force_partitioning: <param>
waterEntropy: <param>
This gives a more simplified yaml structure for the user, and gives the flexibility in future for example to specify the ability to do high throughput processing and configure per run. We should not add this multiprocessing now, but for example you could see the ability to run yamls of the form:
run1:
top_traj_file: <param>
selection_string: <param>
run2:
top_traj_file: <param>
selection_string: <param>
And control certain variables on a per run basis, we could even have a global setting in yaml in future too that applies to all runs unless overridden!
This then brings me to the information we have taken out of the yaml format in your original PR. I think for example that we can create some kind of structure to hold the information internal to the program. Say if we had a dictionary:
arg_map = {
"tempra": { "type": "float",
"help": "Temperature for entropy calculation (K)",
"default": 298,},
}
You could essentially then do things like the parser add in a for loop instead of lots of explicit declarations over many 10s of lines, and you can use this internal information to set defaults for things not declared in yaml files provided by the user. This dict would then set the place where all things defaulted in the program should then be documented properly later when we rewrite the docs.
In addition to making the above changes, having a dict like that inside the program with the internal logic for defaults, it is then quite easy to write a set of tests for this new part of the code to check that the following scenarios are working properly:
- What happens if nothing on the CLI and no yaml is given (do we set defaults and try to run, or crash gracefully, you can test that the correct exception is raised for example)
- You can test if the minimum amount of information to run is given (ie files), does it set defaults from the dict correctly
- You can test if full run params are on the CLI, do they override the defaults
- You can test if the full run params are given in YAML, do they override the defaults
- You can test if the above YAML is given and a param on the CLI, is the YAML overriden correctly.
- For all of these tests you might need to use mock, to mock out the call to actually run CodeEntropy, so you can just run tests on the input parts of the code.
Another option you might want to consider as well, is that you could for example bring in a library like https://pypi.org/project/voluptuous/ or https://docs.pydantic.dev/latest/ where you would declare your data schema in one for their dict forms and they will give you validation methods for things like type etc once you have brought in data.
Does this all make sense?
- Simplified the YAML input structure to make it more user-friendly and flexible. - The YAML configuration now supports multiple job runs using a dictionary format, allowing easier scaling for future additions like multiprocessing. - Changed the YAML format to store run configurations in a dictionary for better internal management. - If no configuration file is provided or if the config is empty, the logic defaults to a single run using the provided CLI arguments. - Ensured that necessary arguments (e.g., `top_traj_file`, `selection_string`) are validated, even when falling back to CLI arguments. - Added tests to ensure the new behavior works as expected, including cases where no YAML file is provided and CLI arguments are used instead.
…e pyproject.toml file
|
I think it is just a case of adding pyyaml to the pyproject.toml for the failing test CI. This is as a CE dep. |
jimboid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty awesome! Tests also look very complete and test the scenarios indicated from the previous PR feedback. This is really great work. Approved to merge.
Summary
This PR introduces a YAML configuration file for CodeEntropy, modularizes the main script, and improves error handling to ensure required inputs are properly managed.
Changes
Modularize
main_mcc.py:load_config,setup_argparse, andmerge_configsforce_partitioningandwaterEntropythis is mentioned in issue Implement Dual Input Interface (CLI Args & YAML) #32Enhance error handling :
top_traj_fileandselection_stringMerge YAML and CLI configurations :
Impact