Skip to content

Commit 25fa707

Browse files
authored
feat: Enhance process_systems to recursively search all paths in systems list (#5033)
### Description This PR modifies the `process_systems` utility function to change how it handles list inputs. Previously, if the `systems` argument was a `str`, the function would recursively search that path for systems. However, if `systems` was a `list`, the function would return the list as-is, assuming it was already a complete list of system paths. This update unifies the logic. The function now treats *every* string path—whether it's a single `str` input or an item within a `list`—as a directory to be recursively searched. It also refactors the internal logic to first normalize the input into a list of paths and then process them uniformly, improving code clarity and maintainability. ### Motivation and Justification The original implementation's inconsistent handling of `str` versus `list` inputs caused two significant problems: 1. **Broken JSON Configurations:** A very common use case, specifying a single data directory in `input.json` like `"systems": ["/path/to/training_data"]`, would fail. The function would not search inside `/path/to/training_data` for the actual system directories (e.g., `set.000`, `set.001`, etc.). 2. **Inability to Aggregate Data:** It was impossible for users to combine multiple datasets by providing a list of top-level directories, such as `"systems": ["/path/to/dataset_A", "/path/to/dataset_B"]`. This change solves both problems by ensuring that paths provided in a list are searched recursively, just as a single string path would be. ### Benefits * **Fixes Bug:** Correctly processes the common configuration of a single-item list in `input.json`. * **Enables Data Aggregation:** Users can now successfully provide a list of multiple data directories to be searched and combined. * **Improves Consistency:** The function's behavior is now intuitive and consistent, regardless of whether the user provides a single `str` or a `list[str]`. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Clarified how training/validation system paths may be specified: a single system directory or a parent directory to recursively discover systems; lists of paths are explicitly supported and processed per-item. * **Improvements** * Input handling for system paths enhanced to accept multiple paths and consolidate discovered systems, with expanded pattern-based discovery for more flexible data input. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent 7778e2e commit 25fa707

File tree

3 files changed

+29
-12
lines changed

3 files changed

+29
-12
lines changed

deepmd/utils/argcheck.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2993,8 +2993,9 @@ def training_data_args() -> list[
29932993
link_sys = make_link("systems", "training/training_data/systems")
29942994
doc_systems = (
29952995
"The data systems for training. "
2996-
"This key can be provided with a list that specifies the systems, or be provided with a string "
2997-
"by which the prefix of all systems are given and the list of the systems is automatically generated."
2996+
"This key can be a list or a str. "
2997+
"When provided as a string, it can be a system directory path (containing 'type.raw') or a parent directory path to recursively search for all system subdirectories. "
2998+
"When provided as a list, each string item in the list is processed the same way as individual string inputs, i.e., each path can be a system directory or a parent directory to recursively search for all system subdirectories."
29982999
)
29993000
doc_patterns = (
30003001
"The customized patterns used in `rglob` to collect all training systems. "
@@ -3074,8 +3075,9 @@ def validation_data_args() -> list[
30743075
link_sys = make_link("systems", "training/validation_data/systems")
30753076
doc_systems = (
30763077
"The data systems for validation. "
3077-
"This key can be provided with a list that specifies the systems, or be provided with a string "
3078-
"by which the prefix of all systems are given and the list of the systems is automatically generated."
3078+
"This key can be a list or a str. "
3079+
"When provided as a string, it can be a system directory path (containing 'type.raw') or a parent directory path to recursively search for all system subdirectories. "
3080+
"When provided as a list, each string item in the list is processed the same way as individual string inputs, i.e., each path can be a system directory or a parent directory to recursively search for all system subdirectories."
30793081
)
30803082
doc_patterns = (
30813083
"The customized patterns used in `rglob` to collect all validation systems. "

deepmd/utils/data_system.py

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -790,6 +790,7 @@ def process_systems(
790790
"""Process the user-input systems.
791791
792792
If it is a single directory, search for all the systems in the directory.
793+
If it is a list, each item in the list is treated as a directory to search.
793794
Check if the systems are valid.
794795
795796
Parameters
@@ -801,17 +802,31 @@ def process_systems(
801802
802803
Returns
803804
-------
804-
list of str
805+
result_systems: list of str
805806
The valid systems
806807
"""
808+
# Normalize input to a list of paths to search
807809
if isinstance(systems, str):
810+
search_paths = [systems]
811+
elif isinstance(systems, list):
812+
search_paths = systems
813+
else:
814+
# Handle unsupported input types
815+
raise ValueError(
816+
f"Invalid systems type: {type(systems)}. Must be str or list[str]."
817+
)
818+
819+
# Iterate over the search_paths list and apply expansion logic to each path
820+
result_systems = []
821+
for path in search_paths:
808822
if patterns is None:
809-
systems = expand_sys_str(systems)
823+
expanded_paths = expand_sys_str(path)
810824
else:
811-
systems = rglob_sys_str(systems, patterns)
812-
elif isinstance(systems, list):
813-
systems = systems.copy()
814-
return systems
825+
expanded_paths = rglob_sys_str(path, patterns)
826+
827+
result_systems.extend(expanded_paths)
828+
829+
return result_systems
815830

816831

817832
def get_data(

doc/train/training-advanced.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,8 +76,8 @@ Other training parameters are given in the {ref}`training <training>` section.
7676
The sections {ref}`training_data <training/training_data>` and {ref}`validation_data <training/validation_data>` give the training dataset and validation dataset, respectively. Taking the training dataset for example, the keys are explained below:
7777

7878
- {ref}`systems <training/training_data/systems>` provide paths of the training data systems. DeePMD-kit allows you to provide multiple systems with different numbers of atoms. This key can be a `list` or a `str`.
79-
- `list`: {ref}`systems <training/training_data/systems>` gives the training data systems.
80-
- `str`: {ref}`systems <training/training_data/systems>` should be a valid path. DeePMD-kit will recursively search all data systems in this path.
79+
- `str`: {ref}`systems <training/training_data/systems>` should be a valid path. It can be a system directory path (containing 'type.raw') or a parent directory path to recursively search for all system subdirectories.
80+
- `list`: {ref}`systems <training/training_data/systems>` gives a list of paths. Each string item in the list is processed the same way as individual string inputs, i.e., each path can be a system directory or a parent directory to recursively search for all system subdirectories.
8181
- At each training step, DeePMD-kit randomly picks {ref}`batch_size <training/training_data/batch_size>` frame(s) from one of the systems. The probability of using a system is by default in proportion to the number of batches in the system. More options are available for automatically determining the probability of using systems. One can set the key {ref}`auto_prob <training/training_data/auto_prob>` to
8282
- `"prob_uniform"` all systems are used with the same probability.
8383
- `"prob_sys_size"` the probability of using a system is proportional to its size (number of frames).

0 commit comments

Comments
 (0)