Skip to content

Conversation

@harryswift01
Copy link
Contributor

Summary

This PR introduces a YAML configuration file for CodeEntropy, modularizes the main script, and improves error handling to ensure required inputs are properly managed.

Changes

Modularize main_mcc.py :

  • Split the script into modular functions: load_config, setup_argparse, and merge_configs
  • Removed any argument related to the poseidon code and introduced two new argument force_partitioning and waterEntropy this is mentioned in issue Implement Dual Input Interface (CLI Args & YAML) #32
  • Improve readability and maintainability by separating configuration and input stages.

Enhance error handling :

  • Add specific checks for required inputs top_traj_file and selection_string
  • Raise descriptive errors if required arguments are missing
  • Improve error handling to catch and raise specific errors with clear messages

Merge YAML and CLI configurations :

  • Ensure YAML configuration values are correctly merged with CLI arguments, YAML is the default with CLI arguments overiding any specific value in the YAML file
  • Add detailed comments to explain the purpose and functionality of each section

Impact

  • Improves the organization and maintainability of the codebase
  • Ensures required inputs are properly handled, reducing the likelihood of runtime errors
  • Provides clear error messages, making it easier to debug issues
  • Enhances the overall robustness and user-friendliness of the script

- Modularize 'main_mcc.py' to separate configuration and input stages
 - Functions include: 'load_config", 'setup_argsparse' and 'merge_configs'
- Improved error handling by adding specific checks on inputs that are required
- Improve error handling to catch and raise specific errors with clear messages
- Ensure YAML configuration values are correctly merged with CLI argument
- Add detailed comments to explain the purpose and functionality of each section
@harryswift01 harryswift01 added this to the WP4 - Input/Output milestone Mar 14, 2025
@harryswift01 harryswift01 requested a review from jimboid March 14, 2025 17:14
@harryswift01 harryswift01 self-assigned this Mar 14, 2025
@harryswift01 harryswift01 linked an issue Mar 14, 2025 that may be closed by this pull request
2 tasks
jimboid

This comment was marked as outdated.

@jimboid jimboid self-requested a review March 15, 2025 11:14
Copy link
Member

@jimboid jimboid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I thought that this was fine, but I have been thinking a little bit more about it and I think what we want to do is have a really nice clean format for yaml file. This yaml file is for users to configure the run of the CodeEntropy application, so we want to hide away as much of the complexity with what parameter defaults etc are.

So with this in mind I think we can redesign just slightly what we are doing. So I think we should consider making the following changes:

for the yaml spec, I think that it should go something like this:

<run_name>:
  top_traj_file: <param>
  selection_string: <param>
  start: <param>
  end: <param>
  step: <param>
  bin_width: <param>
  tempra: <param>
  verbose: <param>
  thread: <param>
  outfile: <param>
  resfile: <param>
  mout: <param>
  force_partitioning: <param>
  waterEntropy: <param>

This gives a more simplified yaml structure for the user, and gives the flexibility in future for example to specify the ability to do high throughput processing and configure per run. We should not add this multiprocessing now, but for example you could see the ability to run yamls of the form:

run1:
  top_traj_file: <param>
  selection_string: <param>

run2:
  top_traj_file: <param>
  selection_string: <param>

And control certain variables on a per run basis, we could even have a global setting in yaml in future too that applies to all runs unless overridden!

This then brings me to the information we have taken out of the yaml format in your original PR. I think for example that we can create some kind of structure to hold the information internal to the program. Say if we had a dictionary:

arg_map = {
"tempra": { "type": "float",
"help": "Temperature for entropy calculation (K)",
"default": 298,},
}

You could essentially then do things like the parser add in a for loop instead of lots of explicit declarations over many 10s of lines, and you can use this internal information to set defaults for things not declared in yaml files provided by the user. This dict would then set the place where all things defaulted in the program should then be documented properly later when we rewrite the docs.

In addition to making the above changes, having a dict like that inside the program with the internal logic for defaults, it is then quite easy to write a set of tests for this new part of the code to check that the following scenarios are working properly:

  1. What happens if nothing on the CLI and no yaml is given (do we set defaults and try to run, or crash gracefully, you can test that the correct exception is raised for example)
  2. You can test if the minimum amount of information to run is given (ie files), does it set defaults from the dict correctly
  3. You can test if full run params are on the CLI, do they override the defaults
  4. You can test if the full run params are given in YAML, do they override the defaults
  5. You can test if the above YAML is given and a param on the CLI, is the YAML overriden correctly.
  6. For all of these tests you might need to use mock, to mock out the call to actually run CodeEntropy, so you can just run tests on the input parts of the code.

Another option you might want to consider as well, is that you could for example bring in a library like https://pypi.org/project/voluptuous/ or https://docs.pydantic.dev/latest/ where you would declare your data schema in one for their dict forms and they will give you validation methods for things like type etc once you have brought in data.

Does this all make sense?

- Simplified the YAML input structure to make it more user-friendly and flexible.
- The YAML configuration now supports multiple job runs using a dictionary format, allowing easier scaling for future additions like multiprocessing.
- Changed the YAML format to store run configurations in a dictionary for better internal management.
- If no configuration file is provided or if the config is empty, the logic defaults to a single run using the provided CLI arguments.
- Ensured that necessary arguments (e.g., `top_traj_file`, `selection_string`) are validated, even when falling back to CLI arguments.
- Added tests to ensure the new behavior works as expected, including cases where no YAML file is provided and CLI arguments are used instead.
@jimboid
Copy link
Member

jimboid commented Mar 17, 2025

I think it is just a case of adding pyyaml to the pyproject.toml for the failing test CI. This is as a CE dep.

Copy link
Member

@jimboid jimboid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty awesome! Tests also look very complete and test the scenarios indicated from the previous PR feedback. This is really great work. Approved to merge.

@harryswift01 harryswift01 merged commit 6228b54 into main Mar 17, 2025
6 checks passed
@harryswift01 harryswift01 deleted the 32-implement-dual-input-interface branch March 17, 2025 16:33
@jimboid jimboid modified the milestones: WP4 - Input/Output, 1.0.0 release Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Dual Input Interface (CLI Args & YAML)

3 participants