Introduction of a YAML configuration file for CodeEntropy: #65

harryswift01 · 2025-03-14T17:14:15Z

Summary

This PR introduces a YAML configuration file for CodeEntropy, modularizes the main script, and improves error handling to ensure required inputs are properly managed.

Changes

Modularize main_mcc.py :

Split the script into modular functions: load_config, setup_argparse, and merge_configs
Removed any argument related to the poseidon code and introduced two new argument force_partitioning and waterEntropy this is mentioned in issue Implement Dual Input Interface (CLI Args & YAML) #32
Improve readability and maintainability by separating configuration and input stages.

Enhance error handling :

Add specific checks for required inputs top_traj_file and selection_string
Raise descriptive errors if required arguments are missing
Improve error handling to catch and raise specific errors with clear messages

Merge YAML and CLI configurations :

Ensure YAML configuration values are correctly merged with CLI arguments, YAML is the default with CLI arguments overiding any specific value in the YAML file
Add detailed comments to explain the purpose and functionality of each section

Impact

Improves the organization and maintainability of the codebase
Ensures required inputs are properly handled, reducing the likelihood of runtime errors
Provides clear error messages, making it easier to debug issues
Enhances the overall robustness and user-friendliness of the script

- Modularize 'main_mcc.py' to separate configuration and input stages - Functions include: 'load_config", 'setup_argsparse' and 'merge_configs' - Improved error handling by adding specific checks on inputs that are required - Improve error handling to catch and raise specific errors with clear messages - Ensure YAML configuration values are correctly merged with CLI argument - Add detailed comments to explain the purpose and functionality of each section

jimboid

Initially I thought that this was fine, but I have been thinking a little bit more about it and I think what we want to do is have a really nice clean format for yaml file. This yaml file is for users to configure the run of the CodeEntropy application, so we want to hide away as much of the complexity with what parameter defaults etc are.

So with this in mind I think we can redesign just slightly what we are doing. So I think we should consider making the following changes:

for the yaml spec, I think that it should go something like this:

<run_name>:
  top_traj_file: <param>
  selection_string: <param>
  start: <param>
  end: <param>
  step: <param>
  bin_width: <param>
  tempra: <param>
  verbose: <param>
  thread: <param>
  outfile: <param>
  resfile: <param>
  mout: <param>
  force_partitioning: <param>
  waterEntropy: <param>

This gives a more simplified yaml structure for the user, and gives the flexibility in future for example to specify the ability to do high throughput processing and configure per run. We should not add this multiprocessing now, but for example you could see the ability to run yamls of the form:

run1:
  top_traj_file: <param>
  selection_string: <param>

run2:
  top_traj_file: <param>
  selection_string: <param>

And control certain variables on a per run basis, we could even have a global setting in yaml in future too that applies to all runs unless overridden!

This then brings me to the information we have taken out of the yaml format in your original PR. I think for example that we can create some kind of structure to hold the information internal to the program. Say if we had a dictionary:

arg_map = {
"tempra": { "type": "float",
"help": "Temperature for entropy calculation (K)",
"default": 298,},
}

You could essentially then do things like the parser add in a for loop instead of lots of explicit declarations over many 10s of lines, and you can use this internal information to set defaults for things not declared in yaml files provided by the user. This dict would then set the place where all things defaulted in the program should then be documented properly later when we rewrite the docs.

In addition to making the above changes, having a dict like that inside the program with the internal logic for defaults, it is then quite easy to write a set of tests for this new part of the code to check that the following scenarios are working properly:

What happens if nothing on the CLI and no yaml is given (do we set defaults and try to run, or crash gracefully, you can test that the correct exception is raised for example)
You can test if the minimum amount of information to run is given (ie files), does it set defaults from the dict correctly
You can test if full run params are on the CLI, do they override the defaults
You can test if the full run params are given in YAML, do they override the defaults
You can test if the above YAML is given and a param on the CLI, is the YAML overriden correctly.
For all of these tests you might need to use mock, to mock out the call to actually run CodeEntropy, so you can just run tests on the input parts of the code.

Another option you might want to consider as well, is that you could for example bring in a library like https://pypi.org/project/voluptuous/ or https://docs.pydantic.dev/latest/ where you would declare your data schema in one for their dict forms and they will give you validation methods for things like type etc once you have brought in data.

Does this all make sense?

- Simplified the YAML input structure to make it more user-friendly and flexible. - The YAML configuration now supports multiple job runs using a dictionary format, allowing easier scaling for future additions like multiprocessing. - Changed the YAML format to store run configurations in a dictionary for better internal management. - If no configuration file is provided or if the config is empty, the logic defaults to a single run using the provided CLI arguments. - Ensured that necessary arguments (e.g., `top_traj_file`, `selection_string`) are validated, even when falling back to CLI arguments. - Added tests to ensure the new behavior works as expected, including cases where no YAML file is provided and CLI arguments are used instead.

…e pyproject.toml file

jimboid · 2025-03-17T16:23:17Z

I think it is just a case of adding pyyaml to the pyproject.toml for the failing test CI. This is as a CE dep.

jimboid

This looks pretty awesome! Tests also look very complete and test the scenarios indicated from the previous PR feedback. This is really great work. Approved to merge.

harryswift01 added 2 commits March 14, 2025 16:49

Addition of 'force_partitioning' and 'waterEntropy' arguments

3bc5250

harryswift01 added this to the WP4 - Input/Output milestone Mar 14, 2025

harryswift01 requested a review from jimboid March 14, 2025 17:14

harryswift01 self-assigned this Mar 14, 2025

harryswift01 linked an issue Mar 14, 2025 that may be closed by this pull request

Implement Dual Input Interface (CLI Args & YAML) #32

Closed

2 tasks

This comment was marked as outdated.

Sign in to view

jimboid self-requested a review March 15, 2025 11:14

jimboid requested changes Mar 15, 2025

View reviewed changes

harryswift01 added 2 commits March 17, 2025 16:12

Removal of .coveragerc file and included PyYAML as a dependency in th…

531b58a

…e pyproject.toml file

Updated version of PyYAML

ad0af42

jimboid approved these changes Mar 17, 2025

View reviewed changes

harryswift01 merged commit 6228b54 into main Mar 17, 2025
6 checks passed

harryswift01 deleted the 32-implement-dual-input-interface branch March 17, 2025 16:33

harryswift01 mentioned this pull request Mar 20, 2025

Argument Mismatch When Implementing Dual Input Interface #68

Closed

jimboid modified the milestones: WP4 - Input/Output, 1.0.0 release Jun 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduction of a YAML configuration file for CodeEntropy: #65

Introduction of a YAML configuration file for CodeEntropy: #65

Uh oh!

harryswift01 commented Mar 14, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

jimboid left a comment •

edited

Loading

Uh oh!

jimboid commented Mar 17, 2025 •

edited

Loading

Uh oh!

jimboid left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Introduction of a YAML configuration file for CodeEntropy: #65

Introduction of a YAML configuration file for CodeEntropy: #65

Uh oh!

Conversation

harryswift01 commented Mar 14, 2025

Summary

Changes

Impact

Uh oh!

This comment was marked as outdated.

Uh oh!

jimboid left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimboid commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimboid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jimboid left a comment •

edited

Loading

jimboid commented Mar 17, 2025 •

edited

Loading