feat: Performance Optimization: Data Loading and Statistics Acceleration #5040

OutisLi · 2025-11-10T02:36:07Z

Overview

This PR introduces performance optimizations for data loading and statistics computation in deepmd-kit. The changes focus on multi-threading parallelization, memory-mapped I/O, and efficient filesystem operations.

Changes Summary

1. Multi-threaded Statistics Computation (`deepmd/pt/utils/stat.py`)

Introduced ThreadPoolExecutor for parallel processing of multiple datasets
Refactored make_stat_input to use thread pool with 256 workers
Created _process_one_dataset helper function for individual dataset processing
Significantly accelerates statistics computation for multi-system datasets

2. Efficient System Path Lookup (`deepmd/common.py`)

Optimized expand_sys_str to use rglob("type.raw") instead of rglob("*") + filtering
Added parent property to DPOSPath and DPH5Path classes in deepmd/utils/path.py
Performance: 10x speedup for system discovery (as noted in commit message)

3. Memory-mapped Data Loading (`deepmd/utils/data.py`)

Added _get_nframes method to read numpy file headers without loading data
Modified get_numb_batch to use the new method instead of loading entire dataset
Uses np.lib.format.read_magic and read_array_header_* to extract shape information
Reduces memory consumption for large datasets

4. Parallel Statistics File Loading (`deepmd/utils/env_mat_stat.py`)

Implemented ThreadPoolExecutor for parallel loading of stat files
Added _load_stat_file static method with error handling
Uses 128 worker threads for I/O-bound operations
Enhanced file format validation and malformed file handling

Performance Impact

Component	Before	After	Improvement
System path lookup	O(n) file traversal	O(k) direct match	10x faster
Statistics computation	Sequential processing	256-thread parallel	Significant
Data loading	Full dataset load	Header-only read	Memory efficient
Statistics loading	Sequential file I/O	128-thread parallel	Significant

Compatibility

✅ Backward Compatible: All API interfaces remain unchanged
✅ Data Format: No changes to data file formats
✅ Functionality: All existing features work normally

Summary by CodeRabbit

Performance Improvements
- Optimized frame detection to avoid loading complete datasets during initialization, enhancing startup performance for large data files.
- Improved support for multiple data format variants with more efficient metadata reading.

Copilot

Pull Request Overview

This PR introduces several performance optimizations for loading and processing data in the DeepMD-kit framework:

Adds a parent property to the DPPath abstract class and its implementations
Optimizes file system traversal by directly searching for type.raw files instead of iterating over all directories
Implements parallel loading of statistics files using ThreadPoolExecutor
Optimizes frame count retrieval by reading only .npy file headers instead of loading entire arrays
Parallelizes dataset processing for statistics gathering

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
deepmd/utils/path.py	Adds abstract `parent` property to `DPPath` and implements it in `DPOSPath` and `DPH5Path`
deepmd/common.py	Optimizes `expand_sys_str` to use direct file globbing instead of directory iteration
deepmd/utils/env_mat_stat.py	Parallelizes statistics file loading with `ThreadPoolExecutor` and adds error handling
deepmd/utils/data.py	Optimizes `_get_nframes` to read only `.npy` headers instead of loading full arrays
deepmd/pt/utils/stat.py	Refactors `make_stat_input` to process datasets in parallel using `ThreadPoolExecutor`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

deepmd/utils/env_mat_stat.py

deepmd/pt/utils/stat.py

deepmd/utils/data.py

deepmd/pt/utils/stat.py

deepmd/utils/env_mat_stat.py

coderabbitai · 2025-11-10T02:39:34Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Modified deepmd/utils/data.py to optimize frame count detection: replaced direct iteration through dataset items with header-only reading of .npy files, supporting both DPH5Path and regular NPY formats across versions 1.x–3.x.

Changes

Cohort / File(s)	Summary
Frame count optimization `deepmd/utils/data.py`	Added `Union` typing import; refactored `__init__` to compute `frames_list` via `_get_nframes(set_name)` calls; modified `get_numb_batch` to retrieve frame counts directly without loading full datasets; extended `_get_nframes` signature to accept `Union[DPPath, str]` and implemented header-only `.npy` parsing for NPY format versions 1.x/2.x/3.x, with fallback to shape[0] or default 1; preserved DPH5Path variant handling via `path.root[path._name].shape[0]`

Sequence Diagram

sequenceDiagram
    participant get_numb_batch
    participant _get_nframes as _get_nframes(set_name)
    participant NPYReader as NPY Header Reader
    
    rect rgb(220, 240, 255)
    Note over get_numb_batch,_get_nframes: NEW: Header-only approach
    get_numb_batch->>_get_nframes: _get_nframes(set_name)
    alt DPH5Path detected
        _get_nframes->>_get_nframes: nframes = path.root[path._name].shape[0]
    else Standard path
        _get_nframes->>NPYReader: Read coord.npy header only
        NPYReader->>NPYReader: Parse NPY v1.x/2.x/3.x format
        NPYReader-->>_get_nframes: nframes from header shape[0]
    end
    _get_nframes-->>get_numb_batch: Return nframes
    get_numb_batch->>get_numb_batch: ret = max(1, nframes // batch_size)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

NPY header parsing logic for multiple format versions (1.x, 2.x, 3.x) requires careful validation of byte-level format handling
DPH5Path vs. string path branching logic must be verified for correctness
Impact on get_numb_batch behavior and downstream callers should be traced to ensure batch sizing remains correct

Possibly related PRs

perf: accelerate data loading in training #5023: Also modifies deepmd/utils/data.py with .npy header validation for single-frame loading and memmap creation, sharing the same header-parsing infrastructure.

Suggested reviewers

iProzd
CaRoLZhangxy
njzjz

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: Performance Optimization: Data Loading and Statistics Acceleration' directly aligns with the PR's main objectives of optimizing data loading, statistics computation, and related operations.
Docstring Coverage	✅ Passed	Docstring coverage is 86.36% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6986773 and c5bfec2.

📒 Files selected for processing (1)

deepmd/utils/data.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

deepmd/utils/data.py (1)

deepmd/utils/path.py (2)

DPPath (28-158)

DPH5Path (285-490)

🪛 Ruff (0.14.4)

deepmd/utils/data.py

602-602: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Analyze (c-cpp)
GitHub Check: Test C++ (true)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Test C++ (false)
GitHub Check: Analyze (python)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)

🔇 Additional comments (4)

deepmd/utils/data.py (4)

17-17: LGTM!

The Union import is correctly added to support the Union[DPPath, str] type hint in the _get_nframes method signature.

139-139: Previous critical bug fixed!

The earlier issue where _get_nframes was called without checking for DPH5Path instances has been resolved. The type check is now correctly implemented inside the _get_nframes method (line 587), ensuring both HDF5 and filesystem paths are handled appropriately.

341-347: Excellent performance optimization!

This change avoids loading the entire dataset just to compute the batch count. By using _get_nframes to read only the header, memory usage is significantly reduced while maintaining correctness. The logic correctly computes batches as nframes // batch_size with a minimum of 1.

583-604: Header-only reading implementation looks solid!

The _get_nframes method efficiently determines frame counts without loading entire arrays. Key strengths:

Correctly handles both DPH5Path (HDF5 datasets) and filesystem paths

Supports NPY versions 1.x, 2.x, and 3.x

Unused variables properly prefixed with _ (lines 594, 598)

Shape extraction logic (line 603) correctly handles both 2D arrays (nframes, ndof) and 1D arrays (ndof,), treating 1D as a single frame—consistent with _load_set behavior

The static analysis warning (TRY003) on line 602 is a false positive; the error message f"Unsupported .npy file version: {version}" is concise and appropriate.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

deepmd/utils/data.py (1)

585-600: Efficient header-only reading implementation with minor style fix.

The implementation correctly reads numpy file headers across versions 1.x, 2.x, and 3.x to extract frame count without loading data. This is a solid optimization.

Prefix unused unpacked variables with underscores to follow Python conventions:

         with open(str(path), "rb") as f:
             version = np.lib.format.read_magic(f)
             if version[0] == 1:
-                shape, fortran_order, dtype = np.lib.format.read_array_header_1_0(f)
+                shape, _fortran_order, _dtype = np.lib.format.read_array_header_1_0(f)
             elif version[0] in [2, 3]:
-                shape, fortran_order, dtype = np.lib.format.read_array_header_2_0(f)
+                shape, _fortran_order, _dtype = np.lib.format.read_array_header_2_0(f)
             else:
                 raise ValueError(f"Unsupported .npy file version: {version}")

deepmd/utils/env_mat_stat.py (1)

228-242: Consider more specific exception handling.

The helper function correctly validates shape and logs warnings for failures. However, catching bare Exception could mask critical errors like KeyboardInterrupt or MemoryError.

Catch more specific exceptions to avoid hiding critical issues:

     @staticmethod
     def _load_stat_file(file_path: DPPath) -> tuple[str, StatItem]:
         """Helper function for parallel loading of stat files."""
         try:
             arr = file_path.load_numpy()
             if arr.shape == (3,):
                 return file_path.name, StatItem(
                     number=arr[0], sum=arr[1], squared_sum=arr[2]
                 )
             else:
                 log.warning(f"Skipping malformed stat file: {file_path.name}")
                 return file_path.name, None
-        except Exception as e:
+        except (IOError, OSError, ValueError, RuntimeError) as e:
             log.warning(f"Failed to load stat file {file_path.name}: {e}")
             return file_path.name, None

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f25e16 and 256a92b.

📒 Files selected for processing (5)

deepmd/common.py (1 hunks)
deepmd/pt/utils/stat.py (3 hunks)
deepmd/utils/data.py (2 hunks)
deepmd/utils/env_mat_stat.py (4 hunks)
deepmd/utils/path.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/pt/utils/stat.py
deepmd/utils/path.py
deepmd/utils/data.py
deepmd/utils/env_mat_stat.py
deepmd/common.py

🧬 Code graph analysis (3)

deepmd/pt/utils/stat.py (1)

deepmd/pt/utils/utils.py (1)

dict_to_device (279-289)

deepmd/utils/env_mat_stat.py (1)

deepmd/utils/path.py (10)

glob (80-92)

glob (218-232)

glob (382-404)

name (145-146)

name (271-273)

name (478-480)

DPPath (28-163)

load_numpy (50-57)

load_numpy (185-193)

load_numpy (345-353)

deepmd/common.py (1)

deepmd/utils/path.py (9)

parent (162-163)

parent (276-278)

parent (483-488)

rglob (95-108)

rglob (234-248)

rglob (406-420)

is_file (111-112)

is_file (250-252)

is_file (443-447)

🪛 Ruff (0.14.3)

deepmd/utils/data.py

596-596: Unpacked variable fortran_order is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

596-596: Unpacked variable dtype is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

598-598: Avoid specifying long messages outside the exception class

(TRY003)

deepmd/utils/env_mat_stat.py

161-161: Avoid specifying long messages outside the exception class

(TRY003)

240-240: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)

GitHub Check: CodeQL analysis (python)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Analyze (python)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Analyze (c-cpp)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Test C++ (true)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test C++ (false)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)

🔇 Additional comments (8)

deepmd/common.py (1)

205-208: LGTM! Efficient path discovery using parent property.

The refactored logic directly searches for type.raw files and leverages the new parent property to collect directories, replacing the previous approach of filtering all paths. This aligns with the ~10× speedup claim in the PR objectives.

deepmd/utils/data.py (1)

344-349: Excellent memory optimization using header-only reads.

The change to use _get_nframes for batch count computation eliminates the need to load entire coordinate arrays into memory, significantly reducing memory footprint for large datasets.

deepmd/utils/path.py (3)

160-164: LGTM! Well-designed abstract property.

The abstract parent property establishes a consistent interface for path navigation across all DPPath implementations, enabling optimizations like the one in deepmd/common.py line 205.

275-279: LGTM! Clean implementation for filesystem paths.

The implementation correctly delegates to pathlib's parent property while preserving the mode parameter and returning the same concrete type.

482-489: LGTM! Correct HDF5 path parent navigation.

The implementation correctly computes the parent path in the HDF5 hierarchy by removing the last path component, with proper handling of the root ("/") case.

deepmd/utils/env_mat_stat.py (1)

147-169: Excellent parallelization of statistics file loading.

The parallel loading with ThreadPoolExecutor accelerates I/O-bound statistics loading. The empty files check and None-filtering provide robust error handling while allowing processing to continue despite individual file failures.

deepmd/pt/utils/stat.py (2)

42-66: Excellent parallelization of data packing for statistics.

The refactored make_stat_input effectively parallelizes I/O-intensive dataloader processing using 256 workers. The clear separation of per-dataset work into _process_one_dataset maintains code clarity while enabling significant performance improvements.

69-131: LGTM! Well-structured helper for parallel dataset processing.

The _process_one_dataset helper cleanly encapsulates the per-dataset logic, including CPU context, batch iteration, fparam handling, and data accumulation. The implementation preserves all original processing logic while enabling parallelization.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

deepmd/utils/data.py (1)

590-605: Critical: _get_nframes fails for HDF5 paths.

Opening a DPH5Path using str(path) will fail because the string representation includes a # separator (e.g., "file.h5#/path/to/data"), which is not a valid filesystem path. This method is called at line 142 on all directories in self.dirs, which may include DPH5Path instances, causing runtime failures.

Apply this diff to handle both path types:

 def _get_nframes(self, set_name: DPPath) -> int:
     # get nframes
     if not isinstance(set_name, DPPath):
         set_name = DPPath(set_name)
     path = set_name / "coord.npy"
-    # Read only the header to get shape
-    with open(str(path), "rb") as f:
-        version = np.lib.format.read_magic(f)
-        if version[0] == 1:
-            shape, fortran_order, dtype = np.lib.format.read_array_header_1_0(f)
-        elif version[0] in [2, 3]:
-            shape, fortran_order, dtype = np.lib.format.read_array_header_2_0(f)
-        else:
-            raise ValueError(f"Unsupported .npy file version: {version}")
-    nframes = shape[0]
+    # For HDF5, we must load the array to get shape; for filesystem, read only the header
+    if isinstance(path, DPH5Path):
+        nframes = path.load_numpy().shape[0]
+    else:
+        with open(str(path), "rb") as f:
+            version = np.lib.format.read_magic(f)
+            if version[0] == 1:
+                shape, _, _ = np.lib.format.read_array_header_1_0(f)
+            elif version[0] in [2, 3]:
+                shape, _, _ = np.lib.format.read_array_header_2_0(f)
+            else:
+                raise ValueError(f"Unsupported .npy file version: {version}")
+        nframes = shape[0]
     return nframes

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 256a92b and 96230a3.

📒 Files selected for processing (3)

deepmd/pt/utils/stat.py (3 hunks)
deepmd/utils/data.py (2 hunks)
deepmd/utils/env_mat_stat.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/utils/data.py
deepmd/pt/utils/stat.py
deepmd/utils/env_mat_stat.py

🧬 Code graph analysis (3)

deepmd/utils/data.py (1)

deepmd/utils/path.py (1)

DPH5Path (295-508)

deepmd/pt/utils/stat.py (1)

deepmd/pt/utils/utils.py (1)

dict_to_device (279-289)

deepmd/utils/env_mat_stat.py (1)

deepmd/utils/path.py (10)

glob (80-92)

glob (218-232)

glob (382-404)

name (145-146)

name (271-273)

name (478-480)

DPPath (28-163)

load_numpy (50-57)

load_numpy (185-193)

load_numpy (345-353)

🪛 Ruff (0.14.3)

deepmd/utils/data.py

601-601: Unpacked variable fortran_order is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

601-601: Unpacked variable dtype is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

603-603: Avoid specifying long messages outside the exception class

(TRY003)

deepmd/utils/env_mat_stat.py

162-162: Avoid specifying long messages outside the exception class

(TRY003)

241-241: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Analyze (c-cpp)
GitHub Check: Analyze (python)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (clang, clang)

🔇 Additional comments (4)

deepmd/utils/env_mat_stat.py (2)

148-169: LGTM!

The parallel loading implementation is well-structured. The worker count calculation min(64, (os.cpu_count() or 1) * 4) is reasonable for I/O-bound operations and addresses previous concerns about excessive workers. The error handling via _load_stat_file returning None for failed loads is appropriate.

229-243: Error handling is appropriate for parallel loading.

Catching Exception at line 241 is flagged by static analysis, but it's acceptable here since this helper is designed to be fault-tolerant during parallel file loading. Returning None on error allows the caller to filter out failed loads while continuing with successful ones.

deepmd/pt/utils/stat.py (2)

43-67: Parallel processing structure is sound.

The refactoring to use ThreadPoolExecutor is well-implemented. The worker count min(128, (os.cpu_count() or 1) * 6) is aggressive but justified for I/O-intensive data loading workloads. The comment on line 63 correctly identifies this as I/O-bound.

70-132: LGTM!

The helper function correctly encapsulates per-dataset processing for parallel execution. The use of torch.device("cpu") context (line 88) is appropriate for data loading. Line 124 properly checks list length before accessing the first element, addressing previous review concerns.

deepmd/utils/data.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

deepmd/utils/data.py (3)
142-147: Optimize HDF5 frame counting to avoid loading full datasets.

The fix correctly handles DPH5Path instances, but calling self._load_set(item)["coord"].shape[0] loads the entire coordinate array just to read its shape. HDF5 datasets expose shape metadata without requiring data transfer.

Apply this optimization to read shape directly for HDF5 paths:
 frames_list = [
-    self._load_set(item)["coord"].shape[0]
-    if isinstance(item, DPH5Path)
+    ((item / "coord.npy").root[(item / "coord.npy")._name].shape[0]
+     if isinstance(item, DPH5Path)
+     else self._get_nframes(item))
-    else self._get_nframes(item)
     for item in self.dirs
 ]
This avoids the memory and I/O overhead of loading potentially gigabyte-sized coordinate arrays during initialization.

349-359: Optimize HDF5 frame counting to avoid loading full datasets.

Similar to the issue at lines 142-147, calling self._load_set(set_name) for DPH5Path instances loads the entire coordinate array when only the shape is needed.

Apply this optimization:
 set_name = self.dirs[set_idx]
 if isinstance(set_name, DPH5Path):
-    data = self._load_set(set_name)
-    nframes = data["coord"].shape[0]
+    coord_path = set_name / "coord.npy"
+    nframes = coord_path.root[coord_path._name].shape[0]
 else:
     # Directly obtain the number of frames to avoid loading the entire dataset
     nframes = self._get_nframes(set_name)
This avoids unnecessary I/O when computing batch counts for HDF5-backed datasets.

600-608: Address static analysis warnings for unused variables.

The unpacked variables fortran_order and dtype at lines 604 and 606 are never used, triggering Ruff warnings (RUF059).

As per coding guidelines, apply this fix to silence the warnings:
         version = np.lib.format.read_magic(f)
         if version[0] == 1:
-            shape, fortran_order, dtype = np.lib.format.read_array_header_1_0(f)
+            shape, _, _ = np.lib.format.read_array_header_1_0(f)
         elif version[0] in [2, 3]:
-            shape, fortran_order, dtype = np.lib.format.read_array_header_2_0(f)
+            shape, _, _ = np.lib.format.read_array_header_2_0(f)
         else:
             raise ValueError(f"Unsupported .npy file version: {version}")
This follows the dummy variable pattern to indicate intentionally unused values.

Based on coding guidelines

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96230a3 and ac3dbbc.

📒 Files selected for processing (1)

deepmd/utils/data.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/utils/data.py

🧬 Code graph analysis (1)

deepmd/utils/data.py (1)

deepmd/utils/path.py (1)

DPH5Path (295-508)

🪛 Ruff (0.14.3)

deepmd/utils/data.py

606-606: Unpacked variable fortran_order is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

606-606: Unpacked variable dtype is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

608-608: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Analyze (c-cpp)
GitHub Check: Analyze (python)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

deepmd/utils/data.py (3)
142-147: Critical bug fixed, but consider optimizing HDF5 shape access.

Good work fixing the crash for HDF5 paths! The type check correctly prevents passing DPH5Path to _get_nframes. However, calling self._load_set(item) loads the entire dataset into memory just to get the coordinate shape, which can be expensive during initialization for large HDF5 files.

Consider accessing the HDF5 dataset shape directly without loading data:
 frames_list = [
-    self._load_set(item)["coord"].shape[0]
-    if isinstance(item, DPH5Path)
+    (item / "coord.npy").root[(item / "coord.npy")._name].shape[0]
+    if isinstance(item, DPH5Path) 
     else self._get_nframes(item)
     for item in self.dirs
 ]
Or extract this into a helper method for clarity:
def _get_nframes_h5(self, h5path: DPH5Path) -> int:
    """Get number of frames from HDF5 coord dataset without loading data."""
    coord_path = h5path / "coord.npy"
    return coord_path.root[coord_path._name].shape[0]
349-359: Same HDF5 efficiency concern as line 142.

The type-based branching correctly handles both path types, and the comment on line 354 is helpful. However, the same memory efficiency issue applies here: loading the entire dataset just to get coord.shape[0] is expensive for large HDF5 files.

Apply the same HDF5 shape optimization suggested for lines 142-147, or extract a shared helper method to avoid code duplication:
def _get_nframes_any(self, path: DPPath) -> int:
    """Get number of frames from any path type without loading full data."""
    if isinstance(path, DPH5Path):
        coord_path = path / "coord.npy"
        return coord_path.root[coord_path._name].shape[0]
    else:
        return self._get_nframes(path)
Then use it in both __init__ (line 145) and here.

595-610: Excellent header-only optimization with minor style nitpicks.

The header-only read is a significant performance improvement for large datasets. The version handling correctly supports numpy format versions 1.x through 3.x.

A few optional style improvements:
Simplify line 609 logic: Since np.lib.format.read_array_header_* always returns a tuple, the isinstance check is unnecessary:
-        nframes = shape[0] if (len(shape) if isinstance(shape, tuple) else 0) > 1 else 1
+        nframes = shape[0] if len(shape) > 1 else 1
Address static analysis hints (lines 604, 606): Prefix unused unpacked variables with underscore per Ruff RUF059:
-            shape, fortran_order, dtype = np.lib.format.read_array_header_1_0(f)
+            shape, _fortran_order, _dtype = np.lib.format.read_array_header_1_0(f)
And similarly for line 606.
Shorten error message (line 608) per Ruff TRY003: Consider moving to a constant or using a shorter inline message.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac3dbbc and db6aa3c.

📒 Files selected for processing (1)

deepmd/utils/data.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/utils/data.py

🧬 Code graph analysis (1)

deepmd/utils/data.py (1)

deepmd/utils/path.py (1)

DPH5Path (295-508)

🪛 Ruff (0.14.3)

deepmd/utils/data.py

606-606: Unpacked variable fortran_order is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

606-606: Unpacked variable dtype is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

608-608: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Analyze (python)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Analyze (c-cpp)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (cuda, cuda)

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

deepmd/utils/data.py (2)
139-144: Good fix for the past review concern, but consider HDF5 optimization.

The type check correctly addresses the critical bug identified in previous reviews. However, there's an optimization inconsistency:

Filesystem paths: Read only the header via _get_nframes (memory efficient)

HDF5 paths: Load the entire dataset via _load_set to get shape (memory intensive)

For HDF5, you could access the shape directly without loading the full array by reaching into the dataset object. Consider adding a method to the DPPath interface that returns array shape without loading, then use it here:
frames_list = [
    self._get_array_shape(item / "coord.npy")[0]
    for item in self.dirs
]
Where _get_array_shape would:

For filesystem: use the current header-reading logic

For HDF5: access dataset.shape directly

This would make the optimization consistent across both storage backends.

346-356: Same HDF5 optimization opportunity as in __init__.

The type check correctly handles both path types, but the HDF5 branch loads the entire dataset just to obtain nframes. Consider the same shape-access optimization suggested for lines 139-144 to avoid loading full datasets during batch size calculation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db6aa3c and ce7fe95.

📒 Files selected for processing (3)

deepmd/common.py (1 hunks)
deepmd/utils/data.py (3 hunks)
deepmd/utils/path.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

deepmd/utils/path.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/common.py
deepmd/utils/data.py

🧬 Code graph analysis (2)

deepmd/common.py (1)

deepmd/utils/path.py (9)

parent (162-163)

parent (276-278)

parent (483-488)

rglob (95-108)

rglob (234-248)

rglob (406-420)

is_file (111-112)

is_file (250-252)

is_file (443-447)

deepmd/utils/data.py (1)

deepmd/utils/path.py (1)

DPH5Path (295-508)

🪛 Ruff (0.14.4)

deepmd/utils/data.py

603-603: Unpacked variable fortran_order is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

603-603: Unpacked variable dtype is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

605-605: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Analyze (c-cpp)
GitHub Check: Analyze (python)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test C++ (false)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Test C++ (true)

🔇 Additional comments (1)

deepmd/common.py (1)

205-205: LGTM! Efficient path discovery optimization.

The refactored approach directly targets type.raw files and extracts their parent directories, eliminating unnecessary iteration and filtering. This aligns with the reported ~10× speedup for system path lookup.

deepmd/utils/data.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

deepmd/utils/data.py (1)
340-350: Suggest refactoring: Use _get_nframes consistently for both path types.

The current implementation loads the entire dataset for HDF5 paths (self._load_set(set_name)) but uses the efficient header-only read for regular paths. Since _get_nframes (lines 589-591) now handles HDF5 paths efficiently by accessing the shape attribute without loading data, you can simplify this method and improve performance for HDF5 datasets.

Apply this diff to use _get_nframes consistently:
     def get_numb_batch(self, batch_size: int, set_idx: int) -> int:
         """Get the number of batches in a set."""
         set_name = self.dirs[set_idx]
-        if isinstance(set_name, DPH5Path):
-            data = self._load_set(set_name)
-            nframes = data["coord"].shape[0]
-        else:
-            # Directly obtain the number of frames to avoid loading the entire dataset
-            nframes = self._get_nframes(set_name)
+        # Directly obtain the number of frames to avoid loading the entire dataset
+        nframes = self._get_nframes(set_name)
         ret = nframes // batch_size
         if ret == 0:
             ret = 1
         return ret

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4e0924 and f337168.

📒 Files selected for processing (1)

deepmd/utils/data.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/utils/data.py

🧬 Code graph analysis (1)

deepmd/utils/data.py (1)

deepmd/utils/path.py (2)

DPH5Path (295-508)

DPPath (28-163)

🪛 Ruff (0.14.4)

deepmd/utils/data.py

606-606: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)

GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (c-cpp)
GitHub Check: Analyze (python)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Test C++ (true)
GitHub Check: Test C++ (false)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (rocm, rocm)

🔇 Additional comments (2)

deepmd/utils/data.py (2)

138-138: LGTM: Frame counting now works correctly for all path types.

The past review concern about DPH5Path handling has been properly addressed. The updated _get_nframes method (lines 586-608) now handles both HDF5 paths and regular filesystem paths efficiently, so this line works correctly for all items in self.dirs.

586-608: Well-implemented header-only frame counting with good type handling.

The implementation efficiently reads only the .npy header to extract shape information, avoiding memory-intensive full array loads. The method correctly handles:

Both DPH5Path (via HDF5 shape access) and regular filesystem paths

NumPy format versions 1.x, 2.x, and 3.x

Proper conversion of string paths to DPPath instances

Past review feedback has been addressed: unused variables are prefixed with underscores (lines 598, 602) and the error message is concise (line 606).

Minor note: The static analysis hint (TRY003) on line 606 is pedantic—the error message is appropriately brief. As per coding guidelines, please run ruff check . and ruff format . before committing to ensure all linting requirements are satisfied.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

deepmd/utils/data.py (1)
583-605: Well-implemented header-only frame counting with proper HDF5 support.

The _get_nframes implementation correctly handles both HDF5 and filesystem paths, reading only the .npy header to extract shape information without loading full arrays. All previously suggested refactors have been applied.

The static analysis hint (TRY003) about the error message at line 603 appears to be a false positive—the message is already concise and clear.

If you prefer to silence the TRY003 hint, you could define a custom exception, but this is entirely optional:
class UnsupportedNumpyVersionError(ValueError):
    """Raised when an unsupported .npy file version is encountered."""
    pass

# Then at line 603:
raise UnsupportedNumpyVersionError(f"version {version}")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3052a78 and 6986773.

📒 Files selected for processing (1)

deepmd/utils/data.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/utils/data.py

🧬 Code graph analysis (1)

deepmd/utils/data.py (1)

deepmd/utils/path.py (2)

DPPath (28-163)

DPH5Path (295-508)

🪛 Ruff (0.14.4)

deepmd/utils/data.py

603-603: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Analyze (python)
GitHub Check: Analyze (c-cpp)

🔇 Additional comments (3)

deepmd/utils/data.py (3)

17-17: LGTM: Import addition is appropriate.

The Union import is correctly used for the _get_nframes signature to accept both DPPath and str types.

139-139: Critical issue resolved: HDF5 path handling is now correct.

The previous critical bug where _get_nframes was called without type checking for HDF5 paths has been successfully fixed. The _get_nframes method (lines 583-605) now properly handles both DPH5Path and filesystem paths.

341-347: Excellent optimization: Avoids loading full dataset.

The refactored get_numb_batch now uses _get_nframes to obtain frame counts without loading the entire dataset, delivering the memory efficiency improvements described in the PR objectives.

codecov · 2025-11-14T02:26:15Z

Codecov Report

❌ Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.18%. Comparing base (1ccc57d) to head (c5bfec2).
⚠️ Report is 1 commits behind head on devel.

Files with missing lines	Patch %	Lines
deepmd/utils/data.py	80.00%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            devel    #5040   +/-   ##
=======================================
  Coverage   84.18%   84.18%           
=======================================
  Files         709      709           
  Lines       70217    70219    +2     
  Branches     3620     3618    -2     
=======================================
+ Hits        59114    59117    +3     
+ Misses       9936     9935    -1     
  Partials     1167     1167

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

deepmd/utils/data.py

…n find type.raw. 10x accelarate

…data loading

…s symlink

Copilot AI review requested due to automatic review settings November 10, 2025 02:36

github-actions bot added the Python label Nov 10, 2025

Copilot AI reviewed Nov 10, 2025

View reviewed changes

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

deepmd/utils/data.py Show resolved Hide resolved

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

OutisLi force-pushed the pr/datastat branch from db6aa3c to ce7fe95 Compare November 13, 2025 02:16

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

deepmd/utils/data.py Outdated Show resolved Hide resolved

OutisLi force-pushed the pr/datastat branch from e35460c to c4e0924 Compare November 13, 2025 09:23

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

OutisLi requested review from iProzd, njzjz and wanghan-iapcm November 14, 2025 02:32

iProzd approved these changes Nov 14, 2025

View reviewed changes

wanghan-iapcm approved these changes Nov 14, 2025

View reviewed changes

njzjz reviewed Nov 14, 2025

View reviewed changes

deepmd/utils/data.py Outdated Show resolved Hide resolved

OutisLi added 10 commits November 14, 2025 18:17

perf: accelarate data loading through using memmap inget_numb_batch

1fd3199

perf: use rglob("type.raw") to find systems instead of find * and the…

ec18c0b

…n find type.raw. 10x accelarate

perf: use multithread to accelarate stat computing and loading

0b4c151

bug fix

4eb076d

bug fix

dae2028

Revert "perf: use multithread to accelarate stat computing and loading"

293def1

fix: special case of single frame dataset

1dcbc88

minor update

7c2077c

fix: repeated path collection

f3a8341

refactor: simplify frame count calculation in DeepmdData initialization

55fa967

OutisLi added 4 commits November 14, 2025 18:17

refactor: update type hint for _get_nframes method to use Union

8f7d6c9

refactor: streamline batch count calculation by removing unnecessary …

ffd9232

…data loading

revert changes in expand_sys_str since the modification cannot proces…

6cb6435

…s symlink

refactor: remove redundant path assignment in _get_nframes method

c5bfec2

OutisLi force-pushed the pr/datastat branch from 0a4736b to c5bfec2 Compare November 14, 2025 10:17

njzjz approved these changes Nov 14, 2025

View reviewed changes

njzjz added this pull request to the merge queue Nov 14, 2025

Merged via the queue into deepmodeling:devel with commit c346332 Nov 14, 2025
60 checks passed

OutisLi deleted the pr/datastat branch November 15, 2025 01:19

feat: Performance Optimization: Data Loading and Statistics Acceleration #5040

feat: Performance Optimization: Data Loading and Statistics Acceleration #5040

Uh oh!

Conversation

OutisLi commented Nov 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes Summary

1. Multi-threaded Statistics Computation (deepmd/pt/utils/stat.py)

2. Efficient System Path Lookup (deepmd/common.py)

3. Memory-mapped Data Loading (deepmd/utils/data.py)

4. Parallel Statistics File Loading (deepmd/utils/env_mat_stat.py)

Performance Impact

Compatibility

Summary by CodeRabbit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OutisLi commented Nov 10, 2025 •

edited by coderabbitai bot

Loading

1. Multi-threaded Statistics Computation (`deepmd/pt/utils/stat.py`)

2. Efficient System Path Lookup (`deepmd/common.py`)

3. Memory-mapped Data Loading (`deepmd/utils/data.py`)

4. Parallel Statistics File Loading (`deepmd/utils/env_mat_stat.py`)

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

codecov bot commented Nov 14, 2025 •

edited

Loading