-
Notifications
You must be signed in to change notification settings - Fork 576
feat: Performance Optimization: Data Loading and Statistics Acceleration #5040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces several performance optimizations for loading and processing data in the DeepMD-kit framework:
- Adds a
parentproperty to theDPPathabstract class and its implementations - Optimizes file system traversal by directly searching for
type.rawfiles instead of iterating over all directories - Implements parallel loading of statistics files using
ThreadPoolExecutor - Optimizes frame count retrieval by reading only
.npyfile headers instead of loading entire arrays - Parallelizes dataset processing for statistics gathering
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| deepmd/utils/path.py | Adds abstract parent property to DPPath and implements it in DPOSPath and DPH5Path |
| deepmd/common.py | Optimizes expand_sys_str to use direct file globbing instead of directory iteration |
| deepmd/utils/env_mat_stat.py | Parallelizes statistics file loading with ThreadPoolExecutor and adds error handling |
| deepmd/utils/data.py | Optimizes _get_nframes to read only .npy headers instead of loading full arrays |
| deepmd/pt/utils/stat.py | Refactors make_stat_input to process datasets in parallel using ThreadPoolExecutor |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughModified Changes
Sequence DiagramsequenceDiagram
participant get_numb_batch
participant _get_nframes as _get_nframes(set_name)
participant NPYReader as NPY Header Reader
rect rgb(220, 240, 255)
Note over get_numb_batch,_get_nframes: NEW: Header-only approach
get_numb_batch->>_get_nframes: _get_nframes(set_name)
alt DPH5Path detected
_get_nframes->>_get_nframes: nframes = path.root[path._name].shape[0]
else Standard path
_get_nframes->>NPYReader: Read coord.npy header only
NPYReader->>NPYReader: Parse NPY v1.x/2.x/3.x format
NPYReader-->>_get_nframes: nframes from header shape[0]
end
_get_nframes-->>get_numb_batch: Return nframes
get_numb_batch->>get_numb_batch: ret = max(1, nframes // batch_size)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)deepmd/utils/data.py (1)
🪛 Ruff (0.14.4)deepmd/utils/data.py602-602: Avoid specifying long messages outside the exception class (TRY003) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
🔇 Additional comments (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
deepmd/utils/data.py (1)
585-600: Efficient header-only reading implementation with minor style fix.The implementation correctly reads numpy file headers across versions 1.x, 2.x, and 3.x to extract frame count without loading data. This is a solid optimization.
Prefix unused unpacked variables with underscores to follow Python conventions:
with open(str(path), "rb") as f: version = np.lib.format.read_magic(f) if version[0] == 1: - shape, fortran_order, dtype = np.lib.format.read_array_header_1_0(f) + shape, _fortran_order, _dtype = np.lib.format.read_array_header_1_0(f) elif version[0] in [2, 3]: - shape, fortran_order, dtype = np.lib.format.read_array_header_2_0(f) + shape, _fortran_order, _dtype = np.lib.format.read_array_header_2_0(f) else: raise ValueError(f"Unsupported .npy file version: {version}")deepmd/utils/env_mat_stat.py (1)
228-242: Consider more specific exception handling.The helper function correctly validates shape and logs warnings for failures. However, catching bare
Exceptioncould mask critical errors likeKeyboardInterruptorMemoryError.Catch more specific exceptions to avoid hiding critical issues:
@staticmethod def _load_stat_file(file_path: DPPath) -> tuple[str, StatItem]: """Helper function for parallel loading of stat files.""" try: arr = file_path.load_numpy() if arr.shape == (3,): return file_path.name, StatItem( number=arr[0], sum=arr[1], squared_sum=arr[2] ) else: log.warning(f"Skipping malformed stat file: {file_path.name}") return file_path.name, None - except Exception as e: + except (IOError, OSError, ValueError, RuntimeError) as e: log.warning(f"Failed to load stat file {file_path.name}: {e}") return file_path.name, None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
deepmd/common.py(1 hunks)deepmd/pt/utils/stat.py(3 hunks)deepmd/utils/data.py(2 hunks)deepmd/utils/env_mat_stat.py(4 hunks)deepmd/utils/path.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/pt/utils/stat.pydeepmd/utils/path.pydeepmd/utils/data.pydeepmd/utils/env_mat_stat.pydeepmd/common.py
🧬 Code graph analysis (3)
deepmd/pt/utils/stat.py (1)
deepmd/pt/utils/utils.py (1)
dict_to_device(279-289)
deepmd/utils/env_mat_stat.py (1)
deepmd/utils/path.py (10)
glob(80-92)glob(218-232)glob(382-404)name(145-146)name(271-273)name(478-480)DPPath(28-163)load_numpy(50-57)load_numpy(185-193)load_numpy(345-353)
deepmd/common.py (1)
deepmd/utils/path.py (9)
parent(162-163)parent(276-278)parent(483-488)rglob(95-108)rglob(234-248)rglob(406-420)is_file(111-112)is_file(250-252)is_file(443-447)
🪛 Ruff (0.14.3)
deepmd/utils/data.py
596-596: Unpacked variable fortran_order is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
596-596: Unpacked variable dtype is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
598-598: Avoid specifying long messages outside the exception class
(TRY003)
deepmd/utils/env_mat_stat.py
161-161: Avoid specifying long messages outside the exception class
(TRY003)
240-240: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
- GitHub Check: CodeQL analysis (python)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Analyze (python)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Test C++ (true)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test C++ (false)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
🔇 Additional comments (8)
deepmd/common.py (1)
205-208: LGTM! Efficient path discovery using parent property.The refactored logic directly searches for
type.rawfiles and leverages the newparentproperty to collect directories, replacing the previous approach of filtering all paths. This aligns with the ~10× speedup claim in the PR objectives.deepmd/utils/data.py (1)
344-349: Excellent memory optimization using header-only reads.The change to use
_get_nframesfor batch count computation eliminates the need to load entire coordinate arrays into memory, significantly reducing memory footprint for large datasets.deepmd/utils/path.py (3)
160-164: LGTM! Well-designed abstract property.The abstract
parentproperty establishes a consistent interface for path navigation across all DPPath implementations, enabling optimizations like the one indeepmd/common.pyline 205.
275-279: LGTM! Clean implementation for filesystem paths.The implementation correctly delegates to pathlib's
parentproperty while preserving the mode parameter and returning the same concrete type.
482-489: LGTM! Correct HDF5 path parent navigation.The implementation correctly computes the parent path in the HDF5 hierarchy by removing the last path component, with proper handling of the root ("/") case.
deepmd/utils/env_mat_stat.py (1)
147-169: Excellent parallelization of statistics file loading.The parallel loading with ThreadPoolExecutor accelerates I/O-bound statistics loading. The empty files check and None-filtering provide robust error handling while allowing processing to continue despite individual file failures.
deepmd/pt/utils/stat.py (2)
42-66: Excellent parallelization of data packing for statistics.The refactored
make_stat_inputeffectively parallelizes I/O-intensive dataloader processing using 256 workers. The clear separation of per-dataset work into_process_one_datasetmaintains code clarity while enabling significant performance improvements.
69-131: LGTM! Well-structured helper for parallel dataset processing.The
_process_one_datasethelper cleanly encapsulates the per-dataset logic, including CPU context, batch iteration, fparam handling, and data accumulation. The implementation preserves all original processing logic while enabling parallelization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
deepmd/utils/data.py (1)
590-605: Critical:_get_nframesfails for HDF5 paths.Opening a
DPH5Pathusingstr(path)will fail because the string representation includes a#separator (e.g.,"file.h5#/path/to/data"), which is not a valid filesystem path. This method is called at line 142 on all directories inself.dirs, which may includeDPH5Pathinstances, causing runtime failures.Apply this diff to handle both path types:
def _get_nframes(self, set_name: DPPath) -> int: # get nframes if not isinstance(set_name, DPPath): set_name = DPPath(set_name) path = set_name / "coord.npy" - # Read only the header to get shape - with open(str(path), "rb") as f: - version = np.lib.format.read_magic(f) - if version[0] == 1: - shape, fortran_order, dtype = np.lib.format.read_array_header_1_0(f) - elif version[0] in [2, 3]: - shape, fortran_order, dtype = np.lib.format.read_array_header_2_0(f) - else: - raise ValueError(f"Unsupported .npy file version: {version}") - nframes = shape[0] + # For HDF5, we must load the array to get shape; for filesystem, read only the header + if isinstance(path, DPH5Path): + nframes = path.load_numpy().shape[0] + else: + with open(str(path), "rb") as f: + version = np.lib.format.read_magic(f) + if version[0] == 1: + shape, _, _ = np.lib.format.read_array_header_1_0(f) + elif version[0] in [2, 3]: + shape, _, _ = np.lib.format.read_array_header_2_0(f) + else: + raise ValueError(f"Unsupported .npy file version: {version}") + nframes = shape[0] return nframes
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
deepmd/pt/utils/stat.py(3 hunks)deepmd/utils/data.py(2 hunks)deepmd/utils/env_mat_stat.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.pydeepmd/pt/utils/stat.pydeepmd/utils/env_mat_stat.py
🧬 Code graph analysis (3)
deepmd/utils/data.py (1)
deepmd/utils/path.py (1)
DPH5Path(295-508)
deepmd/pt/utils/stat.py (1)
deepmd/pt/utils/utils.py (1)
dict_to_device(279-289)
deepmd/utils/env_mat_stat.py (1)
deepmd/utils/path.py (10)
glob(80-92)glob(218-232)glob(382-404)name(145-146)name(271-273)name(478-480)DPPath(28-163)load_numpy(50-57)load_numpy(185-193)load_numpy(345-353)
🪛 Ruff (0.14.3)
deepmd/utils/data.py
601-601: Unpacked variable fortran_order is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
601-601: Unpacked variable dtype is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
603-603: Avoid specifying long messages outside the exception class
(TRY003)
deepmd/utils/env_mat_stat.py
162-162: Avoid specifying long messages outside the exception class
(TRY003)
241-241: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (clang, clang)
🔇 Additional comments (4)
deepmd/utils/env_mat_stat.py (2)
148-169: LGTM!The parallel loading implementation is well-structured. The worker count calculation
min(64, (os.cpu_count() or 1) * 4)is reasonable for I/O-bound operations and addresses previous concerns about excessive workers. The error handling via_load_stat_filereturningNonefor failed loads is appropriate.
229-243: Error handling is appropriate for parallel loading.Catching
Exceptionat line 241 is flagged by static analysis, but it's acceptable here since this helper is designed to be fault-tolerant during parallel file loading. ReturningNoneon error allows the caller to filter out failed loads while continuing with successful ones.deepmd/pt/utils/stat.py (2)
43-67: Parallel processing structure is sound.The refactoring to use
ThreadPoolExecutoris well-implemented. The worker countmin(128, (os.cpu_count() or 1) * 6)is aggressive but justified for I/O-intensive data loading workloads. The comment on line 63 correctly identifies this as I/O-bound.
70-132: LGTM!The helper function correctly encapsulates per-dataset processing for parallel execution. The use of
torch.device("cpu")context (line 88) is appropriate for data loading. Line 124 properly checks list length before accessing the first element, addressing previous review concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
deepmd/utils/data.py (3)
142-147: Optimize HDF5 frame counting to avoid loading full datasets.The fix correctly handles
DPH5Pathinstances, but callingself._load_set(item)["coord"].shape[0]loads the entire coordinate array just to read its shape. HDF5 datasets expose shape metadata without requiring data transfer.Apply this optimization to read shape directly for HDF5 paths:
frames_list = [ - self._load_set(item)["coord"].shape[0] - if isinstance(item, DPH5Path) + ((item / "coord.npy").root[(item / "coord.npy")._name].shape[0] + if isinstance(item, DPH5Path) + else self._get_nframes(item)) - else self._get_nframes(item) for item in self.dirs ]This avoids the memory and I/O overhead of loading potentially gigabyte-sized coordinate arrays during initialization.
349-359: Optimize HDF5 frame counting to avoid loading full datasets.Similar to the issue at lines 142-147, calling
self._load_set(set_name)forDPH5Pathinstances loads the entire coordinate array when only the shape is needed.Apply this optimization:
set_name = self.dirs[set_idx] if isinstance(set_name, DPH5Path): - data = self._load_set(set_name) - nframes = data["coord"].shape[0] + coord_path = set_name / "coord.npy" + nframes = coord_path.root[coord_path._name].shape[0] else: # Directly obtain the number of frames to avoid loading the entire dataset nframes = self._get_nframes(set_name)This avoids unnecessary I/O when computing batch counts for HDF5-backed datasets.
600-608: Address static analysis warnings for unused variables.The unpacked variables
fortran_orderanddtypeat lines 604 and 606 are never used, triggering Ruff warnings (RUF059).As per coding guidelines, apply this fix to silence the warnings:
version = np.lib.format.read_magic(f) if version[0] == 1: - shape, fortran_order, dtype = np.lib.format.read_array_header_1_0(f) + shape, _, _ = np.lib.format.read_array_header_1_0(f) elif version[0] in [2, 3]: - shape, fortran_order, dtype = np.lib.format.read_array_header_2_0(f) + shape, _, _ = np.lib.format.read_array_header_2_0(f) else: raise ValueError(f"Unsupported .npy file version: {version}")This follows the dummy variable pattern to indicate intentionally unused values.
Based on coding guidelines
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (1)
DPH5Path(295-508)
🪛 Ruff (0.14.3)
deepmd/utils/data.py
606-606: Unpacked variable fortran_order is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
606-606: Unpacked variable dtype is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
608-608: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
deepmd/utils/data.py (3)
142-147: Critical bug fixed, but consider optimizing HDF5 shape access.Good work fixing the crash for HDF5 paths! The type check correctly prevents passing
DPH5Pathto_get_nframes. However, callingself._load_set(item)loads the entire dataset into memory just to get the coordinate shape, which can be expensive during initialization for large HDF5 files.Consider accessing the HDF5 dataset shape directly without loading data:
frames_list = [ - self._load_set(item)["coord"].shape[0] - if isinstance(item, DPH5Path) + (item / "coord.npy").root[(item / "coord.npy")._name].shape[0] + if isinstance(item, DPH5Path) else self._get_nframes(item) for item in self.dirs ]Or extract this into a helper method for clarity:
def _get_nframes_h5(self, h5path: DPH5Path) -> int: """Get number of frames from HDF5 coord dataset without loading data.""" coord_path = h5path / "coord.npy" return coord_path.root[coord_path._name].shape[0]
349-359: Same HDF5 efficiency concern as line 142.The type-based branching correctly handles both path types, and the comment on line 354 is helpful. However, the same memory efficiency issue applies here: loading the entire dataset just to get
coord.shape[0]is expensive for large HDF5 files.Apply the same HDF5 shape optimization suggested for lines 142-147, or extract a shared helper method to avoid code duplication:
def _get_nframes_any(self, path: DPPath) -> int: """Get number of frames from any path type without loading full data.""" if isinstance(path, DPH5Path): coord_path = path / "coord.npy" return coord_path.root[coord_path._name].shape[0] else: return self._get_nframes(path)Then use it in both
__init__(line 145) and here.
595-610: Excellent header-only optimization with minor style nitpicks.The header-only read is a significant performance improvement for large datasets. The version handling correctly supports numpy format versions 1.x through 3.x.
A few optional style improvements:
Simplify line 609 logic: Since
np.lib.format.read_array_header_*always returns a tuple, theisinstancecheck is unnecessary:- nframes = shape[0] if (len(shape) if isinstance(shape, tuple) else 0) > 1 else 1 + nframes = shape[0] if len(shape) > 1 else 1Address static analysis hints (lines 604, 606): Prefix unused unpacked variables with underscore per Ruff RUF059:
- shape, fortran_order, dtype = np.lib.format.read_array_header_1_0(f) + shape, _fortran_order, _dtype = np.lib.format.read_array_header_1_0(f)And similarly for line 606.
Shorten error message (line 608) per Ruff TRY003: Consider moving to a constant or using a shorter inline message.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (1)
DPH5Path(295-508)
🪛 Ruff (0.14.3)
deepmd/utils/data.py
606-606: Unpacked variable fortran_order is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
606-606: Unpacked variable dtype is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
608-608: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Analyze (python)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cuda, cuda)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
deepmd/utils/data.py (2)
139-144: Good fix for the past review concern, but consider HDF5 optimization.The type check correctly addresses the critical bug identified in previous reviews. However, there's an optimization inconsistency:
- Filesystem paths: Read only the header via
_get_nframes(memory efficient)- HDF5 paths: Load the entire dataset via
_load_setto get shape (memory intensive)For HDF5, you could access the shape directly without loading the full array by reaching into the dataset object. Consider adding a method to the
DPPathinterface that returns array shape without loading, then use it here:frames_list = [ self._get_array_shape(item / "coord.npy")[0] for item in self.dirs ]Where
_get_array_shapewould:
- For filesystem: use the current header-reading logic
- For HDF5: access
dataset.shapedirectlyThis would make the optimization consistent across both storage backends.
346-356: Same HDF5 optimization opportunity as in__init__.The type check correctly handles both path types, but the HDF5 branch loads the entire dataset just to obtain
nframes. Consider the same shape-access optimization suggested for lines 139-144 to avoid loading full datasets during batch size calculation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
deepmd/common.py(1 hunks)deepmd/utils/data.py(3 hunks)deepmd/utils/path.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deepmd/utils/path.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/common.pydeepmd/utils/data.py
🧬 Code graph analysis (2)
deepmd/common.py (1)
deepmd/utils/path.py (9)
parent(162-163)parent(276-278)parent(483-488)rglob(95-108)rglob(234-248)rglob(406-420)is_file(111-112)is_file(250-252)is_file(443-447)
deepmd/utils/data.py (1)
deepmd/utils/path.py (1)
DPH5Path(295-508)
🪛 Ruff (0.14.4)
deepmd/utils/data.py
603-603: Unpacked variable fortran_order is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
603-603: Unpacked variable dtype is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
605-605: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test C++ (false)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Test C++ (true)
🔇 Additional comments (1)
deepmd/common.py (1)
205-205: LGTM! Efficient path discovery optimization.The refactored approach directly targets
type.rawfiles and extracts their parent directories, eliminating unnecessary iteration and filtering. This aligns with the reported ~10× speedup for system path lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/utils/data.py (1)
340-350: Suggest refactoring: Use_get_nframesconsistently for both path types.The current implementation loads the entire dataset for HDF5 paths (
self._load_set(set_name)) but uses the efficient header-only read for regular paths. Since_get_nframes(lines 589-591) now handles HDF5 paths efficiently by accessing the shape attribute without loading data, you can simplify this method and improve performance for HDF5 datasets.Apply this diff to use
_get_nframesconsistently:def get_numb_batch(self, batch_size: int, set_idx: int) -> int: """Get the number of batches in a set.""" set_name = self.dirs[set_idx] - if isinstance(set_name, DPH5Path): - data = self._load_set(set_name) - nframes = data["coord"].shape[0] - else: - # Directly obtain the number of frames to avoid loading the entire dataset - nframes = self._get_nframes(set_name) + # Directly obtain the number of frames to avoid loading the entire dataset + nframes = self._get_nframes(set_name) ret = nframes // batch_size if ret == 0: ret = 1 return ret
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (2)
DPH5Path(295-508)DPPath(28-163)
🪛 Ruff (0.14.4)
deepmd/utils/data.py
606-606: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Test C++ (true)
- GitHub Check: Test C++ (false)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (rocm, rocm)
🔇 Additional comments (2)
deepmd/utils/data.py (2)
138-138: LGTM: Frame counting now works correctly for all path types.The past review concern about DPH5Path handling has been properly addressed. The updated
_get_nframesmethod (lines 586-608) now handles both HDF5 paths and regular filesystem paths efficiently, so this line works correctly for all items inself.dirs.
586-608: Well-implemented header-only frame counting with good type handling.The implementation efficiently reads only the .npy header to extract shape information, avoiding memory-intensive full array loads. The method correctly handles:
- Both DPH5Path (via HDF5 shape access) and regular filesystem paths
- NumPy format versions 1.x, 2.x, and 3.x
- Proper conversion of string paths to DPPath instances
Past review feedback has been addressed: unused variables are prefixed with underscores (lines 598, 602) and the error message is concise (line 606).
Minor note: The static analysis hint (TRY003) on line 606 is pedantic—the error message is appropriately brief. As per coding guidelines, please run
ruff check .andruff format .before committing to ensure all linting requirements are satisfied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/utils/data.py (1)
583-605: Well-implemented header-only frame counting with proper HDF5 support.The
_get_nframesimplementation correctly handles both HDF5 and filesystem paths, reading only the.npyheader to extract shape information without loading full arrays. All previously suggested refactors have been applied.The static analysis hint (TRY003) about the error message at line 603 appears to be a false positive—the message is already concise and clear.
If you prefer to silence the TRY003 hint, you could define a custom exception, but this is entirely optional:
class UnsupportedNumpyVersionError(ValueError): """Raised when an unsupported .npy file version is encountered.""" pass # Then at line 603: raise UnsupportedNumpyVersionError(f"version {version}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (2)
DPPath(28-163)DPH5Path(295-508)
🪛 Ruff (0.14.4)
deepmd/utils/data.py
603-603: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (c-cpp)
🔇 Additional comments (3)
deepmd/utils/data.py (3)
17-17: LGTM: Import addition is appropriate.The
Unionimport is correctly used for the_get_nframessignature to accept bothDPPathandstrtypes.
139-139: Critical issue resolved: HDF5 path handling is now correct.The previous critical bug where
_get_nframeswas called without type checking for HDF5 paths has been successfully fixed. The_get_nframesmethod (lines 583-605) now properly handles bothDPH5Pathand filesystem paths.
341-347: Excellent optimization: Avoids loading full dataset.The refactored
get_numb_batchnow uses_get_nframesto obtain frame counts without loading the entire dataset, delivering the memory efficiency improvements described in the PR objectives.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #5040 +/- ##
=======================================
Coverage 84.18% 84.18%
=======================================
Files 709 709
Lines 70217 70219 +2
Branches 3620 3618 -2
=======================================
+ Hits 59114 59117 +3
+ Misses 9936 9935 -1
Partials 1167 1167 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…n find type.raw. 10x accelarate
Overview
This PR introduces performance optimizations for data loading and statistics computation in deepmd-kit. The changes focus on multi-threading parallelization, memory-mapped I/O, and efficient filesystem operations.
Changes Summary
1. Multi-threaded Statistics Computation (
deepmd/pt/utils/stat.py)ThreadPoolExecutorfor parallel processing of multiple datasetsmake_stat_inputto use thread pool with 256 workers_process_one_datasethelper function for individual dataset processing2. Efficient System Path Lookup (
deepmd/common.py)expand_sys_strto userglob("type.raw")instead ofrglob("*")+ filteringparentproperty toDPOSPathandDPH5Pathclasses indeepmd/utils/path.py3. Memory-mapped Data Loading (
deepmd/utils/data.py)_get_nframesmethod to read numpy file headers without loading dataget_numb_batchto use the new method instead of loading entire datasetnp.lib.format.read_magicandread_array_header_*to extract shape information4. Parallel Statistics File Loading (
deepmd/utils/env_mat_stat.py)ThreadPoolExecutorfor parallel loading of stat files_load_stat_filestatic method with error handlingPerformance Impact
Compatibility
✅ Backward Compatible: All API interfaces remain unchanged
✅ Data Format: No changes to data file formats
✅ Functionality: All existing features work normally
Summary by CodeRabbit