-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LitData Refactor PR1: Get individual functions for data pipelines #90
Conversation
WalkthroughThe pull request introduces significant enhancements to the Changes
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🔇 Additional comments (6)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #90 +/- ##
==========================================
+ Coverage 96.64% 97.49% +0.85%
==========================================
Files 23 37 +14
Lines 1818 3515 +1697
==========================================
+ Hits 1757 3427 +1670
- Misses 61 88 +27 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range and nitpick comments (5)
tests/data/test_normalization.py (1)
37-42
: LGTM! Consider parameterizing the image dimensions.The test function
test_apply_normalization
is well-implemented and covers the essential aspects of normalization. It ensures that the output is correctly normalized and has the expected data type.To further improve the test, consider parameterizing the image dimensions to test different image sizes. This can be done using pytest's parametrize decorator.
Here's an example of how you can parameterize the image dimensions:
import pytest @pytest.mark.parametrize("height,width", [(200, 200), (100, 100), (300, 400)]) def test_apply_normalization(height, width): """Test `apply_normalization` function.""" img = torch.randint(0, 255, (3, height, width)) res = apply_normalization(img) assert torch.max(res) <= 1.0 and torch.min(res) == 0.0 assert res.dtype == torch.float32sleap_nn/data/normalization.py (1)
44-48
: LGTM! Consider adding a type hint for the return value.The function correctly normalizes the input image tensor by checking if it's a floating-point type and if not, converting it to
float32
and scaling the pixel values to the range[0, 1]
.To further improve the code quality and readability, consider adding a type hint for the return value.
Apply this diff to add a type hint for the return value:
-def apply_normalization(image: torch.Tensor): +def apply_normalization(image: torch.Tensor) -> torch.Tensor: """Normalize image tensor.""" if not torch.is_floating_point(image): image = image.to(torch.float32) / 255.0 return imagetests/data/test_providers.py (1)
96-103
: Great job adding a test for theprocess_lf
function!The test follows the pytest conventions and covers the essential functionality by validating the shapes of the output tensors. It also checks for the presence of NaN values in the expected instances, which is a good edge case to test.
To further improve the test, consider adding assertions to compare the content of the output tensors, not just their shapes. For example, you could compare the "image" tensor with the original image from the loaded labels.
import numpy as np def test_process_lf(minimal_instance): labels = sio.load_slp(minimal_instance) lf = labels[0] ex = process_lf(lf, 0, 4) assert ex["image"].shape == torch.Size([1, 1, 384, 384]) assert ex["instances"].shape == torch.Size([1, 4, 2, 2]) assert torch.isnan(ex["instances"][:, 2:, :, :]).all() # Compare the content of the "image" tensor with the original image assert np.array_equal(ex["image"].squeeze().numpy(), lf.image)sleap_nn/data/providers.py (1)
31-91
: LGTM! Consider simplifying the nestedif
statements.The
process_lf
function is well-structured and modularizes the processing of a labeled frame effectively. It handles the filtering of user instances, image format conversion, and instance stacking correctly. Appending NaN values to match the maximum number of instances is a good approach to ensure consistent tensor dimensions for batching. The output dictionary is well-structured and contains all the necessary information for downstream tasks.To improve readability, consider simplifying the nested
if
statements at lines 52-53 into a singleif
statement as suggested by the static analysis tool:- if user_instances_only: - if lf.user_instances is not None and len(lf.user_instances) > 0: + if user_instances_only and lf.user_instances is not None and len(lf.user_instances) > 0: lf.instances = lf.user_instancesTools
Ruff
52-53: Use a single
if
statement instead of nestedif
statements(SIM102)
sleap_nn/data/edge_maps.py (1)
250-323
: Excellent implementation of thegenerate_pafs
function!The function follows a clear and logical sequence of steps to generate part-affinity fields (PAFs) from input instances. It handles edge cases, such as filtering instances outside the image bounds and validating tensor shapes, which ensures robustness.
To address the static analysis hints and align with best practices, consider moving the
attrs.field
andattrs.converters.optional
calls outside the argument defaults. You can either perform the calls within the function or read the defaults from module-level singleton variables.Additionally, to enhance the function's usability and maintainability, consider providing more detailed documentation for the input parameters and return values. Include information about the expected tensor shapes, the purpose of each parameter, and the format of the returned PAFs tensor.
Tools
Ruff
255-257: Do not perform function call
attrs.field
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
256-256: Do not perform function call
attrs.converters.optional
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (18)
- sleap_nn/data/augmentation.py (1 hunks)
- sleap_nn/data/confidence_maps.py (1 hunks)
- sleap_nn/data/edge_maps.py (1 hunks)
- sleap_nn/data/instance_centroids.py (3 hunks)
- sleap_nn/data/instance_cropping.py (2 hunks)
- sleap_nn/data/normalization.py (1 hunks)
- sleap_nn/data/providers.py (2 hunks)
- sleap_nn/data/resizing.py (4 hunks)
- sleap_nn/inference/predictors.py (3 hunks)
- sleap_nn/inference/topdown.py (3 hunks)
- tests/data/test_augmentation.py (2 hunks)
- tests/data/test_confmaps.py (2 hunks)
- tests/data/test_edge_maps.py (2 hunks)
- tests/data/test_instance_centroids.py (3 hunks)
- tests/data/test_instance_cropping.py (2 hunks)
- tests/data/test_normalization.py (2 hunks)
- tests/data/test_providers.py (2 hunks)
- tests/data/test_resizing.py (2 hunks)
Files skipped from review due to trivial changes (3)
- sleap_nn/data/instance_centroids.py
- sleap_nn/inference/predictors.py
- sleap_nn/inference/topdown.py
Additional context used
Ruff
tests/data/test_instance_cropping.py
12-12:
sleap_nn.data.resizing.PadToStride
imported but unusedRemove unused import:
sleap_nn.data.resizing.PadToStride
(F401)
sleap_nn/data/providers.py
52-53: Use a single
if
statement instead of nestedif
statements(SIM102)
sleap_nn/data/edge_maps.py
255-257: Do not perform function call
attrs.field
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
256-256: Do not perform function call
attrs.converters.optional
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
sleap_nn/data/augmentation.py
210-210: f-string without any placeholders
Remove extraneous
f
prefix(F541)
Additional comments not posted (19)
tests/data/test_instance_centroids.py (2)
10-35
: LGTM!The new test function
test_generate_centroids
is well-implemented and covers the important scenarios:
- It tests the
generate_centroids
function with a minimal instance and a partial instance containing NaN values.- The assertions correctly validate the generated centroids against the expected tensor outputs.
- It uses the
minimal_instance
fixture to load the test data.Great job on adding this comprehensive test case!
Line range hint
37-72
: LGTM!The existing test function
test_instance_centroids
has been correctly updated to use the newgenerate_centroids
function:
- The function name
find_centroids
has been replaced withgenerate_centroids
in multiple test cases.- The test cases cover the important scenarios: undefined anchor index, defined anchor index, and defined anchor index with missing data.
- The assertions correctly validate the generated centroids against the expected tensor outputs.
- It uses the
minimal_instance
fixture to load the test data.The updates maintain the comprehensive test coverage for the centroid generation functionality.
tests/data/test_instance_cropping.py (1)
89-103
: LGTM!The
test_generate_crops
function is a well-structured test that covers the essential aspects of thegenerate_crops
function. It processes a minimal instance, generates centroids, and calls thegenerate_crops
function with the required arguments. The test asserts the expected shapes of the output, ensuring the correctness of thegenerate_crops
function.tests/data/test_augmentation.py (2)
44-65
: LGTM!The test function
test_apply_intensity_augmentation
is well-implemented and covers the essential aspects of theapply_intensity_augmentation
method. The test uses a real-world example by loading labels from a file, which is a good practice. The assertions are comprehensive and cover the important properties of the output, such as the data type, shape, and that the augmentation has been applied.
67-97
: LGTM!The test function
test_apply_geometric_augmentation
is well-implemented and covers the essential aspects of theapply_geometric_augmentation
method. The test uses a real-world example by loading labels from a file, which is a good practice. The assertions are comprehensive and cover the important properties of the output, such as the data type, shape, and that the augmentation has been applied. The test also includes a negative test case to ensure that the method raises aValueError
when invalid parameters are provided, which is a good practice.tests/data/test_resizing.py (3)
85-93
: LGTM!The test function
test_apply_resizer
is well-implemented and covers the essential functionality of theapply_resizer
function. It correctly validates that the image is resized to the expected dimensions and that the instances are scaled accordingly.
96-106
: LGTM!The test function
test_apply_pad_to_stride
is well-implemented and covers the essential functionality of theapply_pad_to_stride
function. It correctly validates that the image is padded to the expected dimensions based on the providedmax_stride
value.
109-131
: LGTM!The test function
test_apply_sizematcher
is well-implemented and covers the essential functionality of theapply_sizematcher
function. It correctly validates that the image is size-matched to the expected dimensions based on the providedmax_height
andmax_width
values. Additionally, it tests the error handling by asserting that exceptions are raised when the provided dimensions are smaller than the current image dimensions.tests/data/test_confmaps.py (2)
128-137
: LGTM!The test correctly validates the functionality of
generate_confmaps
. It loads a label file, processes it, callsgenerate_confmaps
with the correct arguments, and checks the expected shape of the output confidence maps.
140-157
: LGTM!The test correctly validates the functionality of
generate_multiconfmaps
for both centroid and non-centroid cases. It loads a label file, processes it, callsgenerate_multiconfmaps
with the correct arguments for each case, and checks the expected shape of the output confidence maps. The test covers the basic functionality well.tests/data/test_edge_maps.py (1)
178-189
: LGTM!The
test_generate_pafs
function is a well-structured test that covers the key functionality of thegenerate_pafs
function. It follows the Arrange-Act-Assert pattern and uses real data to ensure that the function produces the expected output shape.Great job on adding this comprehensive test! It will help catch potential issues and ensure the correctness of the PAF generation logic.
sleap_nn/data/instance_cropping.py (1)
109-154
: LGTM! Thegenerate_crops
function is well-implemented and enhances the functionality of the module.The function follows a clear logical flow and is well-documented with a comprehensive docstring. It correctly handles the input tensors and performs the necessary computations to generate the cropped image and adjust the keypoints and centroid. The use of appropriate variable names and comments makes the code easy to understand.
Returning a dictionary with all the relevant information is a good design choice, as it makes it convenient to use the results in subsequent processing steps.
Overall, the
generate_crops
function is a valuable addition to thesleap_nn/data/instance_cropping.py
module, enabling the cropping of instances based on their centroids.sleap_nn/data/resizing.py (3)
Line range hint
36-67
: LGTM!The function renaming and the added conditional check are great improvements. The new name
apply_pad_to_stride
is more descriptive and the conditional check enhances efficiency by avoiding unnecessary padding operations.
88-103
: Looks good!The new
apply_resizer
function is a great addition. It provides a useful functionality for flexible image manipulation and correctly handles the scaling of both the image and the keypoints. The function also has clear and concise documentation.
106-134
: Nice work!The new
apply_sizematcher
function is a valuable addition. It provides a useful functionality for ensuring that images meet certain size requirements and includes robust error handling. The function also has clear and concise documentation.sleap_nn/data/confidence_maps.py (2)
11-46
: LGTM!The
generate_confmaps
function is well-implemented and properly documented. It handles different input shapes correctly, uses helper functions effectively, and follows a clear logical flow. The code changes look good to me.
49-94
: LGTM!The
generate_multiconfmaps
function is well-implemented and properly documented. It correctly handles generating confidence maps for either keypoints or centroids based on theis_centroids
flag. The function uses helper functions effectively and follows a clear logical flow. The code changes look good to me.sleap_nn/data/augmentation.py (2)
13-106
: Excellent implementation of intensity-based augmentations!The
apply_intensity_augmentation
function is well-structured and follows best practices. It leverages the Kornia library's augmentation classes to apply various intensity-based transformations to the image and instances. The function handles the reshaping of the instances tensor correctly to ensure compatibility with the augmentations.The use of probabilities and parameters for each augmentation type provides flexibility in controlling the augmentations. The docstring is comprehensive and clearly explains the purpose, arguments, and return values of the function.
Overall, this function is a great addition to the augmentation capabilities of the library.
109-231
: Excellent implementation of geometric augmentations!The
apply_geometric_augmentation
function is well-structured and follows best practices, similar to theapply_intensity_augmentation
function. It leverages the Kornia library's augmentation classes to apply various geometric transformations to the image and instances. The function handles the reshaping of the instances tensor correctly to ensure compatibility with the augmentations.The use of probabilities and parameters for each augmentation type provides flexibility in controlling the augmentations. The docstring is comprehensive and clearly explains the purpose, arguments, and return values of the function.
The function also includes appropriate error handling for the random cropping parameters, raising a ValueError if the random crop height and width are not greater than 0 when random cropping is enabled.
Overall, this function is a great addition to the augmentation capabilities of the library.
Tools
Ruff
210-210: f-string without any placeholders
Remove extraneous
f
prefix(F541)
… all model types (#91) * Add function to get data chunks * LitData Refactor PR3: Add custom StreamingDataset (#92) * Add custom streamingdatasets * LitData Refactor PR4: Integrate LitData with ModelTrainer class (#94) * Add flag for augmentation * Modify exception * Fix tests * Add litdata to trainer * Modify test * Add tests for data loaderS * Fix tests * Remove files in trainer * Remove val chunks dir * Remove shutil.rmtree * Remove shutil.rmtree * Skip ubuntu test * fix skip ubuntu test * Fix changes * Save training config before fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 29
🧹 Outside diff range and nitpick comments (22)
environment_cpu.yml (1)
36-36
: Add a newline at the end of the file.To improve file formatting and adhere to best practices, please add a newline character at the end of the file.
You can fix this by adding an empty line at the end of the file:
- pip: - "--editable=.[dev]" - litdata +
🧰 Tools
🪛 yamllint
[error] 36-36: no new line character at the end of file
(new-line-at-end-of-file)
environment.yml (2)
37-37
: Approve addition of 'litdata' with a suggestion.The addition of 'litdata' as a dependency aligns with the PR objectives for refactoring data processing modules. However, to ensure reproducibility, it's recommended to specify the version of 'litdata'.
Consider updating the line to include a version specification:
- - litdata + - litdata==X.Y.Z # Replace X.Y.Z with the desired version🧰 Tools
🪛 yamllint
[error] 37-37: no new line character at the end of file
(new-line-at-end-of-file)
37-37
: Add a new line at the end of the file.To adhere to best practices and improve compatibility with version control systems, please add a new line at the end of the file.
Add an empty line after the last line of the file.
🧰 Tools
🪛 yamllint
[error] 37-37: no new line character at the end of file
(new-line-at-end-of-file)
tests/data/test_providers.py (1)
96-104
: LGTM: Test function covers essential aspects, with room for improvement.The
test_process_lf
function effectively tests the basic functionality of theprocess_lf
function. It checks the shapes of the output, the presence of NaN values for partial annotations, and the data type of the image. These are all important aspects to verify.However, consider the following suggestions to enhance the test coverage:
- Add a test case to verify the actual content of the processed image, not just its shape and data type.
- Include tests for edge cases, such as empty labels or labels with all NaN values.
- Test with different input scenarios, like varying numbers of instances or different image sizes.
Would you like assistance in implementing these additional test cases?
tests/data/test_instance_cropping.py (2)
89-104
: LGTM: Well-structured test forgenerate_crops
functionality.The new test function
test_generate_crops
is well-implemented and aligns with the PR objectives. It effectively tests the refactored individual functions for data processing. The assertions cover the essential aspects of the cropped examples.Consider adding an assertion to verify the content of the cropped instance, not just its shape. This could help catch any potential issues with the cropping logic itself. For example:
assert torch.allclose(cropped_ex["instance"], expected_instance_tensor, atol=1e-6)Where
expected_instance_tensor
is the expected result after cropping.
Test Coverage Gaps Identified
Several functions in the data processing modules lack corresponding test functions:
sleap_nn/data/utils.py:def ensure_list(x: Any) -> List[Any]:
sleap_nn/data/utils.py:def make_grid_vectors(
sleap_nn/data/utils.py:def expand_to_rank(
sleap_nn/data/utils.py:def gaussian_pdf(x: torch.Tensor, sigma: float) -> torch.Tensor:
sleap_nn/data/resizing.py:def find_padding_for_stride(
sleap_nn/data/resizing.py:def apply_pad_to_stride(image: torch.Tensor, max_stride: int) -> torch.Tensor:
sleap_nn/data/resizing.py:def resize_image(image: torch.Tensor, scale: float):
sleap_nn/data/resizing.py:def apply_resizer(image: torch.Tensor, instances: torch.Tensor, scale: float = 1.0):
sleap_nn/data/providers.py:def get_max_instances(labels: sio.Labels):
sleap_nn/data/providers.py:def process_lf(
...and others as identified by the script.
It is recommended to add test cases for these functions to ensure comprehensive test coverage and maintain the reliability of the data processing pipeline.
🔗 Analysis chain
Line range hint
1-104
: Overall, the changes align well with the PR objectives.The additions to this test file effectively support the refactoring of operations into individual functions. The new test function
test_generate_crops
provides coverage for the newly refactored functionality, which is crucial for ensuring the reliability of the data processing pipeline.These changes contribute to the overall goal of enhancing the code's structure and maintainability, setting a good foundation for future improvements to the data processing modules.
To ensure that all refactored functions are properly tested, let's verify if there are any other newly added functions that might need test coverage:
This will help identify any gaps in test coverage for the refactored functions.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for new functions in the data processing modules that might need tests # Search for new function definitions in the data processing modules echo "New functions in data processing modules:" rg --type python '^\s*def\s+\w+\s*\(' src/sleap_nn/data/ # Search for corresponding test functions echo "\nCorresponding test functions:" rg --type python '^\s*def\s+test_\w+\s*\(' tests/data/Length of output: 343
Script:
#!/bin/bash # Description: Check for new functions in the data processing modules and their corresponding test functions using file extensions. # Search for new function definitions in the data processing modules echo "New functions in data processing modules:" fd --extension py --search-path src/sleap_nn/data/ --type file | xargs rg '^\s*def\s+\w+\s*\(' # Search for corresponding test functions echo "\nCorresponding test functions:" fd --extension py --search-path tests/data/ --type file | xargs rg '^\s*def\s+test_\w+\s*\('Length of output: 36245
Script:
#!/bin/bash # Description: Check for new functions in the data processing modules and their corresponding test functions using corrected paths. # Search for new function definitions in the data processing modules echo "New functions in data processing modules:" fd --extension py --search-path sleap_nn/data/ --type file | xargs rg '^\s*def\s+\w+\s*\(' # Search for corresponding test functions echo "\nCorresponding test functions:" fd --extension py --search-path tests/data/ --type file | xargs rg '^\s*def\s+test_\w+\s*\('Length of output: 10173
tests/data/test_get_data_chunks.py (5)
11-62
: LGTM: Comprehensive test for bottomup_data_chunks.The test function thoroughly checks the
bottomup_data_chunks
method for both grayscale and RGB cases. It verifies the presence of expected keys and the shapes of output arrays.Consider adding a test case with
max_instances
set to a value less than the actual number of instances to ensure the function handles this scenario correctly.
65-125
: LGTM: Thorough test for centered_instance_data_chunks.The test function comprehensively checks the
centered_instance_data_chunks
method for both grayscale and RGB cases. It verifies the presence of expected keys and the shapes of output arrays.Consider adding assertions to verify the values of
instance_bbox
andcentroid
to ensure they are calculated correctly.
128-187
: LGTM: Comprehensive test for centroid_data_chunks.The test function thoroughly checks the
centroid_data_chunks
method for both grayscale and RGB cases. It verifies the presence of expected keys and the shapes of output arrays, including the 'centroids' key.Consider adding a test case with different
anchor_ind
values to ensure the function handles various anchor indices correctly.
190-239
: LGTM: Well-structured test for single_instance_data_chunks.The test function effectively checks the
single_instance_data_chunks
method for both grayscale and RGB cases. It appropriately modifies the input data to contain only a single instance and verifies the presence of expected keys and the shapes of output arrays.Consider adding a test case where the input data already contains only one instance to ensure the function handles this scenario correctly without modification.
1-239
: Overall: Comprehensive test suite with room for enhancement.This test file provides a solid foundation for testing the data chunking functions. It consistently covers both grayscale and RGB cases for each function, ensuring basic functionality and output structure.
To further improve the test suite:
- Consider adding edge cases, such as empty input data or maximum number of instances.
- Implement parametrized tests to cover a wider range of input scenarios efficiently.
- Add tests for error handling, ensuring functions raise appropriate exceptions for invalid inputs.
- Consider testing with different image sizes to ensure proper scaling and cropping behavior.
docs/config.md (5)
18-18
: LGTM! Consider adjusting indentation for consistency.The addition of the
user_instances_only
parameter is valuable, providing users with control over the types of instances used in training. The description is clear and informative.To maintain consistency with the document's formatting, consider adjusting the indentation to match other list items at this level (2 spaces instead of 4).
🧰 Tools
🪛 Markdownlint
18-18: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
19-19
: Enhance description and correct default value format.The addition of the
chunk_size
parameter is useful for controlling data processing. However, the description could be more informative, and the default value format needs correction.
- Consider expanding the description to explain the purpose and impact of this parameter. For example: "Size of each data chunk (in MB). Larger chunks may process faster but require more memory."
- Remove the quotes around the default value to maintain consistency with the integer type:
*Default*: 100
.- Adjust the indentation to 2 spaces for consistency with other list items at this level.
🧰 Tools
🪛 Markdownlint
19-19: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
62-64
: Enhance descriptions for random crop parameters and adjust formatting.The addition of random crop parameters is a valuable enhancement to the data augmentation capabilities. However, the descriptions could be more informative, and the formatting needs adjustment.
- Expand the descriptions to provide more context:
random_crop_p
: (float) Probability of applying random crop augmentation. Range: 0.0 to 1.0.random_crop_height
: (int) Desired height of the random crop in pixels.random_crop_width
: (int) Desired width of the random crop in pixels.- Adjust the indentation to 6 spaces to align with other items in the
preprocessing
subsection.- Consider adding a brief explanation of how these parameters work together, e.g., "When applied, a random crop of size (random_crop_height, random_crop_width) will be taken from the input image."
🧰 Tools
🪛 Markdownlint
62-62: Expected: 6; Actual: 12
Unordered list indentation(MD007, ul-indent)
63-63: Expected: 6; Actual: 12
Unordered list indentation(MD007, ul-indent)
64-64: Expected: 6; Actual: 12
Unordered list indentation(MD007, ul-indent)
Line range hint
201-209
: LGTM! Consider adding links to PyTorch documentation.The updates to the
trainer_config
section, particularly for data loaders and model checkpointing, significantly improve the clarity and usefulness of the documentation. The added flexibility and detailed explanations will be valuable for users.To further enhance the documentation, consider adding direct links to the relevant PyTorch and Lightning documentation:
- For DataLoader:
[Torch's DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
- For ModelCheckpoint:
[Lightning's ModelCheckpoint](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html)
This will allow users to easily access the full documentation for these components if needed.
🧰 Tools
🪛 Markdownlint
15-15: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
16-16: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
17-17: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
18-18: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
19-19: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
21-21: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
22-22: Expected: 4; Actual: 8
Unordered list indentation(MD007, ul-indent)
Line range hint
1-230
: Overall, excellent improvements to the configuration documentation.The changes to this file significantly enhance the clarity and flexibility of the
sleap_nn.ModelTrainer
configuration. The new parameters for data processing, such asuser_instances_only
andchunk_size
, align well with the PR objectives of refactoring the data pipeline. The expanded documentation for data loaders and model checkpointing provides users with valuable information for fine-tuning their training process.To further improve the documentation:
- Consider adding a brief introduction at the beginning of the file explaining the purpose and importance of this configuration file in the context of the SLEAP-NN project.
- If possible, include examples of common configuration scenarios to help users get started quickly.
- Implement a consistent indentation scheme throughout the document (e.g., 2 spaces for all list items at the same level) to improve readability.
- Consider adding cross-references between related parameters in different sections to help users understand the relationships between various configuration options.
🧰 Tools
🪛 Markdownlint
15-15: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
16-16: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
17-17: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
18-18: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
19-19: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
21-21: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
22-22: Expected: 4; Actual: 8
Unordered list indentation(MD007, ul-indent)
sleap_nn/data/providers.py (1)
31-94
: Consider adding unit tests for the newprocess_lf
function.Adding unit tests will help ensure that
process_lf
works correctly under different scenarios and will facilitate future maintenance.Would you like assistance in generating unit tests for this function?
🧰 Tools
🪛 Ruff
52-53: Use a single
if
statement instead of nestedif
statements(SIM102)
sleap_nn/data/streaming_datasets.py (4)
121-121
: Fix typo in class docstringThere's a typo in the docstring of the
CenteredInstanceStreamingDataset
class. "Cenetered" should be corrected to "Centered".Apply this diff to fix the typo:
- """StreamingDataset for CeneteredInstance pipeline. + """StreamingDataset for CenteredInstance pipeline.
23-24
: Clarify the use of.bin
files in the docstringThe mention of
.bin
files might be confusing if they are not a standard part of the data pipeline or if users are unfamiliar with their purpose.Consider elaborating or generalizing the reference to data storage:
- data sample stored in `.bin` files. + data sample stored in the dataset.
145-145
: Annotatecrop_hw
parameter type for clarityThe type annotation for
crop_hw
isTuple[int]
, which implies a tuple of variable length containing integers. Ifcrop_hw
should specifically be a tuple of two integers (height and width), it would be clearer to specifyTuple[int, int]
.Apply this diff to specify the expected tuple size:
- crop_hw: Tuple[int], + crop_hw: Tuple[int, int],
223-226
: Consistent docstring formatting and clarityTo maintain consistency across all class docstrings, ensure that descriptions and parameter explanations are uniform and clear. This improves readability and aids users in understanding each dataset's purpose.
Consider reviewing and aligning the docstrings of all classes.
sleap_nn/training/model_trainer.py (1)
476-487
: Address the TODO and uncomment code for cleaning up training files.The code for deleting training and validation files is commented out with a TODO note indicating test failures. Leaving this code commented may result in unused temporary files consuming disk space.
Would you like assistance in diagnosing the test failures related to this code segment? I can help troubleshoot and provide a solution to safely delete the temporary files without affecting the tests.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (17)
- docs/config.md (2 hunks)
- docs/config_bottomup.yaml (1 hunks)
- docs/config_centroid.yaml (2 hunks)
- docs/config_topdown_centered_instance.yaml (1 hunks)
- environment.yml (1 hunks)
- environment_cpu.yml (1 hunks)
- environment_mac.yml (1 hunks)
- sleap_nn/data/get_data_chunks.py (1 hunks)
- sleap_nn/data/providers.py (2 hunks)
- sleap_nn/data/streaming_datasets.py (1 hunks)
- sleap_nn/training/model_trainer.py (9 hunks)
- tests/data/test_get_data_chunks.py (1 hunks)
- tests/data/test_instance_cropping.py (2 hunks)
- tests/data/test_providers.py (2 hunks)
- tests/data/test_streaming_datasets.py (1 hunks)
- tests/fixtures/datasets.py (1 hunks)
- tests/training/test_model_trainer.py (14 hunks)
🧰 Additional context used
🪛 Markdownlint
docs/config.md
18-18: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
19-19: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
62-62: Expected: 6; Actual: 12
Unordered list indentation(MD007, ul-indent)
63-63: Expected: 6; Actual: 12
Unordered list indentation(MD007, ul-indent)
64-64: Expected: 6; Actual: 12
Unordered list indentation(MD007, ul-indent)
🪛 yamllint
environment.yml
[warning] 36-36: wrong indentation: expected 6 but found 4
(indentation)
[error] 37-37: no new line character at the end of file
(new-line-at-end-of-file)
environment_cpu.yml
[warning] 35-35: wrong indentation: expected 6 but found 4
(indentation)
[error] 36-36: no new line character at the end of file
(new-line-at-end-of-file)
environment_mac.yml
[warning] 34-34: wrong indentation: expected 6 but found 4
(indentation)
[error] 35-35: no new line character at the end of file
(new-line-at-end-of-file)
🪛 Ruff
sleap_nn/data/providers.py
52-53: Use a single
if
statement instead of nestedif
statements(SIM102)
sleap_nn/data/streaming_datasets.py
5-5:
typing.List
imported but unusedRemove unused import
(F401)
5-5:
typing.Optional
imported but unusedRemove unused import
(F401)
sleap_nn/training/model_trainer.py
9-9:
shutil
imported but unusedRemove unused import:
shutil
(F401)
12-12:
torch.utils.data.DataLoader
imported but unusedRemove unused import:
torch.utils.data.DataLoader
(F401)
41-41: Redefinition of unused
sio
from line 11Remove definition:
sio
(F811)
43-43:
sleap_nn.data.cycler.CyclerIterDataPipe
imported but unusedRemove unused import:
sleap_nn.data.cycler.CyclerIterDataPipe
(F401)
110-112: Within an
except
clause, raise exceptions withraise ... from err
orraise ... from None
to distinguish them from errors in exception handling(B904)
tests/data/test_instance_cropping.py
12-12:
sleap_nn.data.resizing.PadToStride
imported but unusedRemove unused import:
sleap_nn.data.resizing.PadToStride
(F401)
🔇 Additional comments (30)
environment_mac.yml (1)
35-35
: LGTM: Addition of 'litdata' dependencyThe addition of 'litdata' as a pip dependency is appropriate and aligns with the PR objectives to refactor data processing modules.
🧰 Tools
🪛 yamllint
[error] 35-35: no new line character at the end of file
(new-line-at-end-of-file)
environment_cpu.yml (1)
35-36
: LGTM! Indentation fixed and new package added.The changes look good:
- The indentation for the "--editable=.[dev]" package has been corrected to 6 spaces, aligning with the YAML structure.
- The new "litdata" package has been added, which aligns with the PR objectives for data pipeline refactoring.
🧰 Tools
🪛 yamllint
[warning] 35-35: wrong indentation: expected 6 but found 4
(indentation)
[error] 36-36: no new line character at the end of file
(new-line-at-end-of-file)
docs/config_bottomup.yaml (1)
5-5
: Approve new configuration option with suggestions.The addition of
user_instances_only: True
aligns with the PR objectives of refactoring data processing modules. This option will likely affect how data is processed in the__iter__()
method.Suggestions:
- Consider adding a comment to explain the purpose and impact of this option for better maintainability.
- Ensure that this change is properly handled in the corresponding Python code that processes this configuration.
To verify the impact of this change, let's check for its usage in the codebase:
This will help ensure that the new option is properly handled in the code that processes this configuration.
✅ Verification successful
Verified addition of
user_instances_only
configuration option.The
user_instances_only: True
setting is properly integrated and utilized across the codebase, aligning with the PR objectives. Consider adding comments todocs/config_bottomup.yaml
to explain the purpose and impact of this option for better maintainability.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for usage of 'user_instances_only' in Python files # Search for 'user_instances_only' in Python files echo "Searching for 'user_instances_only' usage:" rg --type python 'user_instances_only' # Search for data configuration parsing echo "\nSearching for data configuration parsing:" rg --type python 'data_config'Length of output: 329
Script:
#!/bin/bash # Description: Retry searching for 'user_instances_only' in Python files with corrected file type # Option 1: Use correct file type identifier 'py' with ripgrep echo "Searching for 'user_instances_only' usage (using type 'py'):" rg --type py 'user_instances_only' # Option 2: Search using file extensions without specifying file type echo "\nSearching for 'user_instances_only' usage (using file extensions):" rg --type-add 'py:*.py' --type py 'user_instances_only' # Option 1: Use correct file type identifier 'py' with ripgrep echo "Searching for 'user_instances_only' usage (using type 'py'):" rg --type py 'user_instances_only' # Option 2: Search using file extensions without specifying file type echo "\nSearching for 'user_instances_only' usage (using file extensions):" rg --glob '*.py' 'user_instances_only' # Additionally, search for 'data_config' in Python files echo "\nSearching for data configuration parsing (using type 'py'):" rg --type py 'data_config' echo "\nSearching for data configuration parsing (using file extensions):" rg --glob '*.py' 'data_config'Length of output: 61488
docs/config_topdown_centered_instance.yaml (2)
Line range hint
1-124
: Consider adding documentation and clarifying commented sections.While the addition of
user_instances_only
is a good improvement, consider the following suggestions:
- Add a comment explaining the purpose and impact of the
user_instances_only
parameter for better maintainability.- The commented out sections for alternative backbone configurations (ConvNeXt and Swin Transformer) suggest ongoing work. Consider adding a TODO comment explaining the status of these alternatives or remove them if they're no longer relevant.
Here's a suggested addition for documentation:
data_config: provider: LabelsReader train_labels_path: minimal_instance.pkg.slp val_labels_path: minimal_instance.pkg.slp # Only use instances defined by users, ignoring auto-generated ones user_instances_only: TrueTo check the relevance of the commented sections, run:
#!/bin/bash # Description: Check for usage of alternative backbones echo "Searching for ConvNeXt usage:" rg --type python "ConvNeXt" -A 5 echo "Searching for Swin Transformer usage:" rg --type python "Swin_T|swint" -A 5🧰 Tools
🪛 yamllint
[error] 7-7: trailing spaces
(trailing-spaces)
[error] 8-8: trailing spaces
(trailing-spaces)
5-5
: LGTM! Verify impact on data processing pipeline.The addition of
user_instances_only: True
aligns with the PR objectives of refactoring data processing modules. This parameter will likely change how instances are handled during data processing.To ensure this change is properly integrated, please verify:
- The
LabelsReader
provider supports this new parameter.- The impact on data augmentation and preprocessing steps.
- Any necessary updates to the model or trainer configurations.
Run the following script to check for related configurations:
✅ Verification successful
Verified: Integration of
user_instances_only
ParameterThe addition of
user_instances_only: True
has been successfully integrated across relevant modules. TheLabelsReader
andmodel_trainer.py
handle the new parameter appropriately, ensuring consistent behavior in data processing.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for related configurations and usage of user instances # Test 1: Search for LabelsReader implementation echo "Searching for LabelsReader implementation:" rg --type python "class LabelsReader" -A 10 # Test 2: Search for user instances handling in data processing echo "Searching for user instances handling:" rg --type python "user_instances_only|user_defined_instances" -A 5 # Test 3: Check for any TODOs or FIXMEs related to instance handling echo "Checking for TODOs or FIXMEs related to instance handling:" rg --type python "TODO|FIXME" -A 3 | rg "instance"Length of output: 382
Script:
#!/bin/bash # Description: Check for related configurations and usage of user instances # Test 1: Search for LabelsReader implementation echo "Searching for LabelsReader implementation:" rg --type py "class LabelsReader" -A 10 # Test 2: Search for user instances handling in data processing echo "Searching for user instances handling:" rg --type py "user_instances_only|user_defined_instances" -A 5 # Test 3: Check for any TODOs or FIXMEs related to instance handling echo "Checking for TODOs or FIXMEs related to instance handling:" rg --type py "TODO|FIXME" -A 3 | rg "instance"Length of output: 12466
tests/data/test_providers.py (1)
3-3
: LGTM: Import statement is correctly placed and relevant.The addition of
process_lf
to the import statement is appropriate and aligns with the new test function introduced in this file.docs/config_centroid.yaml (4)
Line range hint
44-44
: Acknowledge indentation correctionThe indentation of
pre_trained_weights
undermodel_config
has been corrected to align with other keys in this section. This improves the structure and readability of the configuration file.This change enhances the consistency of the YAML structure. Good catch!
🧰 Tools
🪛 yamllint
[error] 7-7: trailing spaces
(trailing-spaces)
[error] 8-8: trailing spaces
(trailing-spaces)
Line range hint
1-160
: Summary of configuration changesThis PR introduces several changes to the configuration file:
- Added
user_instances_only: True
in the data_config section.- Restructured the random crop configuration in the augmentation_config section.
- Corrected the indentation of
pre_trained_weights
in the model_config section.These changes could have significant implications for the data processing pipeline and model training.
To ensure these changes align with the PR objectives and overall project goals:
- Confirm that these changes are intentional and necessary for the "LitData Refactor PR1" objectives.
- Verify that corresponding changes have been made in the codebase to handle these configuration updates.
- Consider updating the project documentation to reflect these configuration changes and their impact on the data pipeline and model training process.
#!/bin/bash # Search for files that might need updates based on these configuration changes rg --type python 'user_instances_only|random_crop|crop_height|crop_width|pre_trained_weights'🧰 Tools
🪛 yamllint
[error] 7-7: trailing spaces
(trailing-spaces)
[error] 8-8: trailing spaces
(trailing-spaces)
38-40
: Review changes to random crop configurationThe structure of the random crop configuration has been changed:
- The
random_crop
subsection has been removed.random_crop_p
,crop_height
, andcrop_width
are now at the same level as other augmentation parameters.Additionally,
random_crop_p
is set to 0, which effectively disables random cropping.Please confirm:
- Is this structural change intentional? How does it affect the augmentation pipeline?
- Is disabling random cropping (by setting
random_crop_p: 0
) intended?- Are the
crop_height
andcrop_width
values of 160 appropriate for your use case?
5-5
: Verify the intention and impact ofuser_instances_only
The addition of
user_instances_only: True
in the data_config section could significantly affect the data pipeline. This parameter suggests that only user-defined instances will be used in the data processing.Please confirm:
- Is this change intentional and aligned with the PR objectives?
- What is the expected impact on the data pipeline and model training?
- Are there any corresponding changes in the data processing code to handle this new parameter?
✅ Verification successful
Addition of
user_instances_only: True
VerifiedThe
user_instances_only
parameter is consistently used across relevant modules to filter user-defined instances. Its addition indocs/config_centroid.yaml
aligns with its implementation in the codebase.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for usage of 'user_instances_only' in Python files rg --type python 'user_instances_only'Length of output: 69
Script:
#!/bin/bash # Search for usage of 'user_instances_only' in Python files rg --type py 'user_instances_only'Length of output: 2607
tests/data/test_instance_cropping.py (2)
4-4
: LGTM: New imports added for enhanced testing.The new imports (
generate_centroids
,generate_crops
,apply_normalization
,process_lf
) are correctly added to support the new test function. These additions improve the test coverage and align with the PR objectives of refactoring operations into individual functions.Also applies to: 8-8, 11-11, 13-13
87-88
: LGTM: Improved readability with proper spacing.The addition of two blank lines between test functions improves the code's readability and adheres to PEP 8 style guidelines for separating top-level functions.
tests/data/test_get_data_chunks.py (1)
1-8
: LGTM: Import statements are appropriate.The import statements are well-structured and import the necessary modules and functions for the tests. The use of
import sleap_io as sio
is a good convention for brevity.tests/training/test_model_trainer.py (9)
13-13
: LGTM: sys module import addedThe addition of the sys module import is appropriate. It's likely being used for system-specific operations or checks in the test file.
91-92
: LGTM: Cleanup operations addedThe addition of
shutil.rmtree
calls to remove thetrain_chunks
andval_chunks
directories after tests is a good practice. This ensures proper cleanup of temporary files and directories, preventing potential interference between tests and reducing disk usage.Also applies to: 165-166, 188-189, 217-218
300-301
: LGTM: Consistent cleanup addedThe addition of
shutil.rmtree
calls to remove thetrain_chunks
andval_chunks
directories is consistent with the cleanup operations added in other parts of the file. This ensures proper cleanup after the test.
344-345
: LGTM: Additional cleanup addedThe addition of another set of
shutil.rmtree
calls to remove thetrain_chunks
andval_chunks
directories is consistent with the cleanup operations added earlier in the same function. This ensures thorough cleanup after different parts of the test.
376-377
: LGTM: Cleanup operations added to test_centroid_modelThe addition of
shutil.rmtree
calls to remove thetrain_chunks
andval_chunks
directories in thetest_centroid_model
function is consistent with the cleanup operations added in other test functions. This ensures proper cleanup after the test.
420-421
: LGTM: Cleanup operations added to test_single_instance_modelThe addition of
shutil.rmtree
calls to remove thetrain_chunks
andval_chunks
directories in thetest_single_instance_model
function is consistent with the cleanup operations added in other test functions. This ensures proper cleanup after the test.
457-458
: LGTM: Cleanup operations added to test_bottomup_model (first instance)The addition of
shutil.rmtree
calls to remove thetrain_chunks
andval_chunks
directories in the first part of thetest_bottomup_model
function is consistent with the cleanup operations added in other test functions. This ensures proper cleanup after the first part of the test.
491-493
: LGTM: Improved readability and cleanup in test_bottomup_modelThe addition of an empty line improves the readability of the code by separating the test assertions from the cleanup operations. The
shutil.rmtree
calls to remove thetrain_chunks
andval_chunks
directories are consistent with the cleanup operations added in other parts of the file, ensuring proper cleanup after the second part of the test.
Line range hint
1-493
: Overall improvements to test suite cleanup and handling of platform-specific issuesThe changes in this file consistently improve the test suite by adding cleanup operations after each test, which is a good practice. The addition of the
sys
module import and the skip condition for a potentially flaky test on Linux systems addresses platform-specific issues, although the root cause of the flakiness should be investigated further.Key improvements:
- Consistent cleanup of temporary directories (
train_chunks
andval_chunks
) after tests.- Improved test isolation and reduced disk usage due to proper cleanup.
- Handling of platform-specific test issues (Linux-specific test skip).
Suggestions for future work:
- Investigate and address the root cause of the flaky test on Linux systems.
- Consider adding a cleanup fixture to handle the repetitive cleanup operations more efficiently.
Overall, these changes enhance the reliability and maintainability of the test suite.
sleap_nn/data/providers.py (1)
3-3
: Import statement addition is acceptable.The import of
Any
from thetyping
module is appropriate.sleap_nn/data/streaming_datasets.py (7)
99-102
: Verify theis_centroids
parameter valueIn the
generate_multiconfmaps
function call within theBottomUpStreamingDataset
, theis_centroids
parameter is set toFalse
. Please verify that this aligns with the intended behavior for the BottomUp model, as incorrect settings may affect confidence map generation.
181-183
: Ensure correct calculation of instance bounding boxesIn the re-cropping logic, make sure that the
make_centered_bboxes
function and subsequent operations correctly calculate the bounding boxes, especially considering batch dimensions and tensor shapes. This is crucial for accurate cropping of instances.
187-190
: Check centering of instances and centroidsWhen adjusting the
instance
andcentroid
tensors by subtractingpoint
, ensure that this operation correctly centers the keypoints and centroids relative to the cropped images.
267-269
: Confirm correct usage ofapply_intensity_augmentation
In the
CentroidStreamingDataset
, verify thatapply_intensity_augmentation
is correctly applied toex["image"]
andex["centroids"]
. Since centroids represent point coordinates, intensity augmentations may not affect them. Ensure this call behaves as expected.
349-351
: Validate augmentation functions forSingleInstanceStreamingDataset
In the
SingleInstanceStreamingDataset
, confirm that the augmentation functions are compatible with single-instance data and that they are applied correctly toex["image"]
andex["instances"]
.
370-376
: Ensure correct generation of confidence mapsIn the
SingleInstanceStreamingDataset
, verify that thegenerate_confmaps
function is called with appropriate parameters and that the output aligns with the expected input shape for the model.
270-274
: 🛠️ Refactor suggestionAvoid redundant augmentation on centroids
Intensity augmentations typically modify pixel values and may not require adjustments to centroid coordinates. If
apply_intensity_augmentation
does not alter coordinates, passingex["centroids"]
might be unnecessary.Consider modifying the augmentation call:
- ex["image"], ex["centroids"] = apply_intensity_augmentation( + ex["image"], _ = apply_intensity_augmentation(Ensure that geometric augmentations still correctly adjust centroid positions.
Likely invalid or redundant comment.
This is the first PR for #80. Here, we breakdown the operations in iter() of all data processing modules into individual functions that would be useful for implementing the get_chunks() (second PR for #80 ).
Summary by CodeRabbit
Release Notes
New Features
user_instances_only
,chunk_size
, and additional parameters for data augmentation.litdata
added to environment configuration files.Bug Fixes
Tests
process_lf
function and updated existing tests to enhance coverage.