feat: Enhance process_systems to recursively search all paths in systems list (#5033)

OutisLi · web-flow · commit 25fa7079eed9 · 2025-11-07T14:08:59.000Z
### Description

This PR modifies the `process_systems` utility function to change how it
handles list inputs.

Previously, if the `systems` argument was a `str`, the function would
recursively search that path for systems. However, if `systems` was a
`list`, the function would return the list as-is, assuming it was
already a complete list of system paths.

This update unifies the logic. The function now treats *every* string
path—whether it's a single `str` input or an item within a `list`—as a
directory to be recursively searched. It also refactors the internal
logic to first normalize the input into a list of paths and then process
them uniformly, improving code clarity and maintainability.

### Motivation and Justification

The original implementation's inconsistent handling of `str` versus
`list` inputs caused two significant problems:

1. **Broken JSON Configurations:** A very common use case, specifying a
single data directory in `input.json` like `"systems":
["/path/to/training_data"]`, would fail. The function would not search
inside `/path/to/training_data` for the actual system directories (e.g.,
`set.000`, `set.001`, etc.).
2. **Inability to Aggregate Data:** It was impossible for users to
combine multiple datasets by providing a list of top-level directories,
such as `"systems": ["/path/to/dataset_A", "/path/to/dataset_B"]`.

This change solves both problems by ensuring that paths provided in a
list are searched recursively, just as a single string path would be.

### Benefits

* **Fixes Bug:** Correctly processes the common configuration of a
single-item list in `input.json`.
* **Enables Data Aggregation:** Users can now successfully provide a
list of multiple data directories to be searched and combined.
* **Improves Consistency:** The function's behavior is now intuitive and
consistent, regardless of whether the user provides a single `str` or a
`list[str]`.

&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;
## Summary by CodeRabbit

* **Documentation**
* Clarified how training/validation system paths may be specified: a
single system directory or a parent directory to recursively discover
systems; lists of paths are explicitly supported and processed per-item.

* **Improvements**
* Input handling for system paths enhanced to accept multiple paths and
consolidate discovered systems, with expanded pattern-based discovery
for more flexible data input.
&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;
diff --git a/deepmd/utils/argcheck.py b/deepmd/utils/argcheck.py
@@ -2993,8 +2993,9 @@ def training_data_args() -> list[
     link_sys = make_link("systems", "training/training_data/systems")
     doc_systems = (
         "The data systems for training. "
-        "This key can be provided with a list that specifies the systems, or be provided with a string "
-        "by which the prefix of all systems are given and the list of the systems is automatically generated."
+        "This key can be a list or a str. "
+        "When provided as a string, it can be a system directory path (containing 'type.raw') or a parent directory path to recursively search for all system subdirectories. "
+        "When provided as a list, each string item in the list is processed the same way as individual string inputs, i.e., each path can be a system directory or a parent directory to recursively search for all system subdirectories."
     )
     doc_patterns = (
         "The customized patterns used in `rglob` to collect all training systems. "
@@ -3074,8 +3075,9 @@ def validation_data_args() -> list[
     link_sys = make_link("systems", "training/validation_data/systems")
     doc_systems = (
         "The data systems for validation. "
-        "This key can be provided with a list that specifies the systems, or be provided with a string "
-        "by which the prefix of all systems are given and the list of the systems is automatically generated."
+        "This key can be a list or a str. "
+        "When provided as a string, it can be a system directory path (containing 'type.raw') or a parent directory path to recursively search for all system subdirectories. "
+        "When provided as a list, each string item in the list is processed the same way as individual string inputs, i.e., each path can be a system directory or a parent directory to recursively search for all system subdirectories."
     )
     doc_patterns = (
         "The customized patterns used in `rglob` to collect all validation systems. "
diff --git a/deepmd/utils/data_system.py b/deepmd/utils/data_system.py
@@ -790,6 +790,7 @@ def process_systems(
     """Process the user-input systems.
 
     If it is a single directory, search for all the systems in the directory.
+    If it is a list, each item in the list is treated as a directory to search.
     Check if the systems are valid.
 
     Parameters
@@ -801,17 +802,31 @@ def process_systems(
 
     Returns
     -------
-    list of str
+    result_systems: list of str
         The valid systems
     """
+    # Normalize input to a list of paths to search
     if isinstance(systems, str):
+        search_paths = [systems]
+    elif isinstance(systems, list):
+        search_paths = systems
+    else:
+        # Handle unsupported input types
+        raise ValueError(
+            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+        )
+
+    # Iterate over the search_paths list and apply expansion logic to each path
+    result_systems = []
+    for path in search_paths:
         if patterns is None:
-            systems = expand_sys_str(systems)
+            expanded_paths = expand_sys_str(path)
         else:
-            systems = rglob_sys_str(systems, patterns)
-    elif isinstance(systems, list):
-        systems = systems.copy()
-    return systems
+            expanded_paths = rglob_sys_str(path, patterns)
+
+        result_systems.extend(expanded_paths)
+
+    return result_systems
 
 
 def get_data(
diff --git a/doc/train/training-advanced.md b/doc/train/training-advanced.md
@@ -76,8 +76,8 @@ Other training parameters are given in the {ref}`training <training>` section.
 The sections {ref}`training_data <training/training_data>` and {ref}`validation_data <training/validation_data>` give the training dataset and validation dataset, respectively. Taking the training dataset for example, the keys are explained below:
 
 - {ref}`systems <training/training_data/systems>` provide paths of the training data systems. DeePMD-kit allows you to provide multiple systems with different numbers of atoms. This key can be a `list` or a `str`.
-  - `list`: {ref}`systems <training/training_data/systems>` gives the training data systems.
-  - `str`: {ref}`systems <training/training_data/systems>` should be a valid path. DeePMD-kit will recursively search all data systems in this path.
+  - `str`: {ref}`systems <training/training_data/systems>` should be a valid path. It can be a system directory path (containing 'type.raw') or a parent directory path to recursively search for all system subdirectories.
+  - `list`: {ref}`systems <training/training_data/systems>` gives a list of paths. Each string item in the list is processed the same way as individual string inputs, i.e., each path can be a system directory or a parent directory to recursively search for all system subdirectories.
 - At each training step, DeePMD-kit randomly picks {ref}`batch_size <training/training_data/batch_size>` frame(s) from one of the systems. The probability of using a system is by default in proportion to the number of batches in the system. More options are available for automatically determining the probability of using systems. One can set the key {ref}`auto_prob <training/training_data/auto_prob>` to
   - `"prob_uniform"` all systems are used with the same probability.
   - `"prob_sys_size"` the probability of using a system is proportional to its size (number of frames).