You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Enhance process_systems to recursively search all paths in systems list (#5033)
### Description
This PR modifies the `process_systems` utility function to change how it
handles list inputs.
Previously, if the `systems` argument was a `str`, the function would
recursively search that path for systems. However, if `systems` was a
`list`, the function would return the list as-is, assuming it was
already a complete list of system paths.
This update unifies the logic. The function now treats *every* string
path—whether it's a single `str` input or an item within a `list`—as a
directory to be recursively searched. It also refactors the internal
logic to first normalize the input into a list of paths and then process
them uniformly, improving code clarity and maintainability.
### Motivation and Justification
The original implementation's inconsistent handling of `str` versus
`list` inputs caused two significant problems:
1. **Broken JSON Configurations:** A very common use case, specifying a
single data directory in `input.json` like `"systems":
["/path/to/training_data"]`, would fail. The function would not search
inside `/path/to/training_data` for the actual system directories (e.g.,
`set.000`, `set.001`, etc.).
2. **Inability to Aggregate Data:** It was impossible for users to
combine multiple datasets by providing a list of top-level directories,
such as `"systems": ["/path/to/dataset_A", "/path/to/dataset_B"]`.
This change solves both problems by ensuring that paths provided in a
list are searched recursively, just as a single string path would be.
### Benefits
* **Fixes Bug:** Correctly processes the common configuration of a
single-item list in `input.json`.
* **Enables Data Aggregation:** Users can now successfully provide a
list of multiple data directories to be searched and combined.
* **Improves Consistency:** The function's behavior is now intuitive and
consistent, regardless of whether the user provides a single `str` or a
`list[str]`.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Clarified how training/validation system paths may be specified: a
single system directory or a parent directory to recursively discover
systems; lists of paths are explicitly supported and processed per-item.
* **Improvements**
* Input handling for system paths enhanced to accept multiple paths and
consolidate discovered systems, with expanded pattern-based discovery
for more flexible data input.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
"This key can be provided with a list that specifies the systems, or be provided with a string "
2997
-
"by which the prefix of all systems are given and the list of the systems is automatically generated."
2996
+
"This key can be a list or a str. "
2997
+
"When provided as a string, it can be a system directory path (containing 'type.raw') or a parent directory path to recursively search for all system subdirectories. "
2998
+
"When provided as a list, each string item in the list is processed the same way as individual string inputs, i.e., each path can be a system directory or a parent directory to recursively search for all system subdirectories."
2998
2999
)
2999
3000
doc_patterns= (
3000
3001
"The customized patterns used in `rglob` to collect all training systems. "
"This key can be provided with a list that specifies the systems, or be provided with a string "
3078
-
"by which the prefix of all systems are given and the list of the systems is automatically generated."
3078
+
"This key can be a list or a str. "
3079
+
"When provided as a string, it can be a system directory path (containing 'type.raw') or a parent directory path to recursively search for all system subdirectories. "
3080
+
"When provided as a list, each string item in the list is processed the same way as individual string inputs, i.e., each path can be a system directory or a parent directory to recursively search for all system subdirectories."
3079
3081
)
3080
3082
doc_patterns= (
3081
3083
"The customized patterns used in `rglob` to collect all validation systems. "
Copy file name to clipboardExpand all lines: doc/train/training-advanced.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,8 +76,8 @@ Other training parameters are given in the {ref}`training <training>` section.
76
76
The sections {ref}`training_data <training/training_data>` and {ref}`validation_data <training/validation_data>` give the training dataset and validation dataset, respectively. Taking the training dataset for example, the keys are explained below:
77
77
78
78
- {ref}`systems <training/training_data/systems>` provide paths of the training data systems. DeePMD-kit allows you to provide multiple systems with different numbers of atoms. This key can be a `list` or a `str`.
79
-
-`list`: {ref}`systems <training/training_data/systems>`gives the training data systems.
80
-
-`str`: {ref}`systems <training/training_data/systems>`should be a valid path. DeePMD-kit will recursively search all data systems in this path.
79
+
-`str`: {ref}`systems <training/training_data/systems>`should be a valid path. It can be a system directory path (containing 'type.raw') or a parent directory path to recursively search for all system subdirectories.
80
+
-`list`: {ref}`systems <training/training_data/systems>`gives a list of paths. Each string item in the list is processed the same way as individual string inputs, i.e., each path can be a system directory or a parent directory to recursively search for all system subdirectories.
81
81
- At each training step, DeePMD-kit randomly picks {ref}`batch_size <training/training_data/batch_size>` frame(s) from one of the systems. The probability of using a system is by default in proportion to the number of batches in the system. More options are available for automatically determining the probability of using systems. One can set the key {ref}`auto_prob <training/training_data/auto_prob>` to
82
82
-`"prob_uniform"` all systems are used with the same probability.
83
83
-`"prob_sys_size"` the probability of using a system is proportional to its size (number of frames).
0 commit comments