LitData Refactor PR1: Get individual functions for data pipelines #90

gitttt-1234 · 2024-09-12T17:32:52Z

This is the first PR for #80. Here, we breakdown the operations in iter() of all data processing modules into individual functions that would be useful for implementing the get_chunks() (second PR for #80 ).

Summary by CodeRabbit

Release Notes

New Features
- Enhanced configuration options for data processing, including user_instances_only, chunk_size, and additional parameters for data augmentation.
- Introduced new functions for generating data chunks from labeled frames, improving data handling for training.
- Added custom streaming datasets for various model pipelines, optimizing data retrieval processes.
- New dependency litdata added to environment configuration files.
Bug Fixes
- Corrected the indentation for dependencies in environment configuration files.
Tests
- Implemented unit tests for new data chunking functions and streaming dataset classes, ensuring reliability and performance.
- Added tests for the process_lf function and updated existing tests to enhance coverage.

coderabbitai · 2024-09-12T17:33:09Z

Walkthrough

The pull request introduces significant enhancements to the sleap_nn library, particularly in the configuration documentation and data processing modules. New parameters have been added to the configuration files, including options for data augmentation and user-defined instances. Additionally, new functions for generating data chunks and processing labeled frames have been implemented. The changes also include the introduction of new streaming dataset classes and modifications to the model trainer, enhancing the overall data handling and training process.

Changes

Files	Change Summary
`docs/config.md`	Updated configuration documentation for `sleap_nn.ModelTrainer`, adding new parameters in `data_config` and `trainer_config`, enhancing clarity and functionality.
`docs/config_bottomup.yaml`, `.../centroid.yaml`, `.../topdown_centered_instance.yaml`	Added `user_instances_only: True` in `data_config` section to specify user-defined instances for data processing.
`environment.yml`, `environment_cpu.yml`, `environment_mac.yml`	Added new dependency `litdata` in the `pip` section and corrected indentation for `ndx-pose` in `environment_cpu.yml`.
`sleap_nn/data/get_data_chunks.py`	Introduced functions for generating data chunks for training, including `bottomup_data_chunks`, `centered_instance_data_chunks`, `centroid_data_chunks`, and `single_instance_data_chunks`.
`sleap_nn/data/providers.py`	Added `process_lf` function to process labeled frames, returning relevant attributes and handling user instances.
`sleap_nn/data/streaming_datasets.py`	Introduced custom streaming dataset classes: `BottomUpStreamingDataset`, `CenteredInstanceStreamingDataset`, `CentroidStreamingDataset`, and `SingleInstanceStreamingDataset`, implementing data retrieval and processing methods.
`sleap_nn/training/model_trainer.py`	Enhanced `ModelTrainer` class with `_get_data_chunks` method, refactored data loader creation to utilize streaming datasets, and improved model checkpoint handling.
`tests/data/test_get_data_chunks.py`	Added unit tests for data chunking functions to validate output structures and dimensions.
`tests/data/test_instance_cropping.py`	Added `test_generate_crops` function to test the cropping functionality based on processed data.
`tests/data/test_providers.py`	Introduced `test_process_lf` function to validate the `process_lf` functionality.
`tests/data/test_streaming_datasets.py`	Added unit tests for streaming dataset classes to ensure correct data retrieval and structure.
`tests/fixtures/datasets.py`	Modified `augmentation_config` to include `contrast_p` parameter set to 1.0.
`tests/training/test_model_trainer.py`	Updated tests for `ModelTrainer`, simplifying the structure by removing redundant assertions and standardizing cleanup procedures.

Possibly related PRs

Refactor model pipeline #51: Modifications to the docs/config.md file include updates to the data_config section relevant to the ModelTrainer class, similar to changes in this PR.
Add BottomUp model pipeline #52: Introduces a new pipeline for BottomUp models and modifies the docs/config.md to support new configurations, relating to the enhancements in this PR.
Refactor model config #61: The refactoring of model configuration and removal of the inference_config section relates to updates in trainer_config and data_config, indicating a broader restructuring of configuration files.

Poem

🐇
In the meadow of code, new paths we find,
With chunks of data, all neatly aligned.
User instances dance, as parameters grow,
Streaming datasets ready, let the training flow!
Hooray for the changes, let progress be seen,
In the world of Sleap, we leap like a dream!
🐇

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between ad8b4d7 and c545857.

📒 Files selected for processing (1)

sleap_nn/data/get_data_chunks.py (1 hunks)

🔇 Additional comments (6)

sleap_nn/data/get_data_chunks.py (6)

1-17: LGTM!

The imports are well-organized and follow the standard conventions. The docstring provides a clear overview of the module's purpose.

20-65: Excellent work on the bottomup_data_chunks function!

The function is well-documented with a clear docstring explaining its purpose, arguments, and return value. The code is structured logically and follows best practices for data processing.

A few observations:

The function applies normalization, color conversion (RGB or grayscale), and resizing using the apply_sizematcher function.

The user_instances_only parameter allows filtering labels to only user instances if desired.

The function returns a dictionary containing the processed image, instances, frame index, video index, original image size, and number of instances.

Overall, the implementation looks solid and ready to be used in the training pipeline.

68-138: Great implementation of the centered_instance_data_chunks function!

The function follows a similar structure to bottomup_data_chunks but with additional functionality for generating crops around instances based on centroids. Here are a few key points:

The function takes additional parameters crop_size and anchor_ind for specifying the crop size and anchor index for centroid calculation.

It uses the generate_centroids function to calculate centroids based on the anchor index.

The function yields results for each instance using a generator, allowing efficient processing of large datasets.

It includes the frame index, video index, number of instances, and original image size in the yielded dictionary.

The code is well-organized, and the docstring provides clear documentation of the function's purpose and arguments.

125-126: ****

The previous comment about avoiding reassigning sample["instances"] and centroids to single elements is still valid. Consider removing or adjusting the reassignment based on your intended logic.

141-195: Excellent work on the centroid_data_chunks function!

The function is similar to bottomup_data_chunks but includes the additional functionality of generating centroids based on the anchor index. The code is well-structured and follows best practices.

A few observations:

The function uses the generate_centroids function to calculate centroids based on the anchor index.

It includes the calculated centroids in the returned dictionary.

The docstring provides clear documentation of the function's purpose, arguments, and return value.

Overall, the implementation looks solid and ready to be integrated into the training pipeline.

198-243: Great job on the single_instance_data_chunks function!

The function is similar to bottomup_data_chunks but focuses on processing a single instance from the labeled frame. The code is well-organized and follows best practices.

A few observations:

The function sets max_instances to 1 when calling process_lf to ensure only a single instance is processed.

It applies normalization, color conversion (RGB or grayscale), and resizing using the apply_sizematcher function.

The docstring provides clear documentation of the function's purpose, arguments, and return value.

The implementation looks good and is ready to be used in scenarios where processing a single instance is required.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2024-09-12T17:37:36Z

Codecov Report

Attention: Patch coverage is 99.16435% with 3 lines in your changes missing coverage. Please review.

Project coverage is 97.49%. Comparing base (f093ce2) to head (c545857).
Report is 12 commits behind head on main.

Files with missing lines	Patch %	Lines
sleap_nn/training/model_trainer.py	94.44%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #90      +/-   ##
==========================================
+ Coverage   96.64%   97.49%   +0.85%     
==========================================
  Files          23       37      +14     
  Lines        1818     3515    +1697     
==========================================
+ Hits         1757     3427    +1670     
- Misses         61       88      +27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 2

Outside diff range and nitpick comments (5)

tests/data/test_normalization.py (1)
37-42: LGTM! Consider parameterizing the image dimensions.

The test function test_apply_normalization is well-implemented and covers the essential aspects of normalization. It ensures that the output is correctly normalized and has the expected data type.

To further improve the test, consider parameterizing the image dimensions to test different image sizes. This can be done using pytest's parametrize decorator.

Here's an example of how you can parameterize the image dimensions:
import pytest

@pytest.mark.parametrize("height,width", [(200, 200), (100, 100), (300, 400)])
def test_apply_normalization(height, width):
    """Test `apply_normalization` function."""
    img = torch.randint(0, 255, (3, height, width))
    res = apply_normalization(img)
    assert torch.max(res) <= 1.0 and torch.min(res) == 0.0
    assert res.dtype == torch.float32
sleap_nn/data/normalization.py (1)
44-48: LGTM! Consider adding a type hint for the return value.

The function correctly normalizes the input image tensor by checking if it's a floating-point type and if not, converting it to float32 and scaling the pixel values to the range [0, 1].

To further improve the code quality and readability, consider adding a type hint for the return value.

Apply this diff to add a type hint for the return value:
-def apply_normalization(image: torch.Tensor):
+def apply_normalization(image: torch.Tensor) -> torch.Tensor:
     """Normalize image tensor."""
     if not torch.is_floating_point(image):
         image = image.to(torch.float32) / 255.0
     return image
tests/data/test_providers.py (1)
96-103: Great job adding a test for the process_lf function!

The test follows the pytest conventions and covers the essential functionality by validating the shapes of the output tensors. It also checks for the presence of NaN values in the expected instances, which is a good edge case to test.

To further improve the test, consider adding assertions to compare the content of the output tensors, not just their shapes. For example, you could compare the "image" tensor with the original image from the loaded labels.
import numpy as np

def test_process_lf(minimal_instance):
    labels = sio.load_slp(minimal_instance)
    lf = labels[0]
    ex = process_lf(lf, 0, 4)

    assert ex["image"].shape == torch.Size([1, 1, 384, 384])
    assert ex["instances"].shape == torch.Size([1, 4, 2, 2])
    assert torch.isnan(ex["instances"][:, 2:, :, :]).all()

    # Compare the content of the "image" tensor with the original image
    assert np.array_equal(ex["image"].squeeze().numpy(), lf.image)
sleap_nn/data/providers.py (1)
31-91: LGTM! Consider simplifying the nested if statements.

The process_lf function is well-structured and modularizes the processing of a labeled frame effectively. It handles the filtering of user instances, image format conversion, and instance stacking correctly. Appending NaN values to match the maximum number of instances is a good approach to ensure consistent tensor dimensions for batching. The output dictionary is well-structured and contains all the necessary information for downstream tasks.

To improve readability, consider simplifying the nested if statements at lines 52-53 into a single if statement as suggested by the static analysis tool:
-    if user_instances_only:
-        if lf.user_instances is not None and len(lf.user_instances) > 0:
+    if user_instances_only and lf.user_instances is not None and len(lf.user_instances) > 0:
            lf.instances = lf.user_instances
Tools

Ruff

52-53: Use a single if statement instead of nested if statements

(SIM102)
sleap_nn/data/edge_maps.py (1)

250-323: Excellent implementation of the generate_pafs function!

The function follows a clear and logical sequence of steps to generate part-affinity fields (PAFs) from input instances. It handles edge cases, such as filtering instances outside the image bounds and validating tensor shapes, which ensures robustness.

To address the static analysis hints and align with best practices, consider moving the attrs.field and attrs.converters.optional calls outside the argument defaults. You can either perform the calls within the function or read the defaults from module-level singleton variables.

Additionally, to enhance the function's usability and maintainability, consider providing more detailed documentation for the input parameters and return values. Include information about the expected tensor shapes, the purpose of each parameter, and the format of the returned PAFs tensor.

Tools

Ruff

255-257: Do not perform function call attrs.field in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

256-256: Do not perform function call attrs.converters.optional in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c3be90b and af6a7a4.

Files selected for processing (18)

sleap_nn/data/augmentation.py (1 hunks)
sleap_nn/data/confidence_maps.py (1 hunks)
sleap_nn/data/edge_maps.py (1 hunks)
sleap_nn/data/instance_centroids.py (3 hunks)
sleap_nn/data/instance_cropping.py (2 hunks)
sleap_nn/data/normalization.py (1 hunks)
sleap_nn/data/providers.py (2 hunks)
sleap_nn/data/resizing.py (4 hunks)
sleap_nn/inference/predictors.py (3 hunks)
sleap_nn/inference/topdown.py (3 hunks)
tests/data/test_augmentation.py (2 hunks)
tests/data/test_confmaps.py (2 hunks)
tests/data/test_edge_maps.py (2 hunks)
tests/data/test_instance_centroids.py (3 hunks)
tests/data/test_instance_cropping.py (2 hunks)
tests/data/test_normalization.py (2 hunks)
tests/data/test_providers.py (2 hunks)
tests/data/test_resizing.py (2 hunks)

Files skipped from review due to trivial changes (3)

sleap_nn/data/instance_centroids.py
sleap_nn/inference/predictors.py
sleap_nn/inference/topdown.py

Additional context used

Ruff

tests/data/test_instance_cropping.py

12-12: sleap_nn.data.resizing.PadToStride imported but unused

Remove unused import: sleap_nn.data.resizing.PadToStride

(F401)

sleap_nn/data/providers.py

52-53: Use a single if statement instead of nested if statements

(SIM102)

sleap_nn/data/edge_maps.py

255-257: Do not perform function call attrs.field in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

256-256: Do not perform function call attrs.converters.optional in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

sleap_nn/data/augmentation.py

210-210: f-string without any placeholders

Remove extraneous f prefix

(F541)

Additional comments not posted (19)

tests/data/test_instance_centroids.py (2)

10-35: LGTM!

The new test function test_generate_centroids is well-implemented and covers the important scenarios:

It tests the generate_centroids function with a minimal instance and a partial instance containing NaN values.

The assertions correctly validate the generated centroids against the expected tensor outputs.

It uses the minimal_instance fixture to load the test data.

Great job on adding this comprehensive test case!

Line range hint 37-72: LGTM!

The existing test function test_instance_centroids has been correctly updated to use the new generate_centroids function:

The function name find_centroids has been replaced with generate_centroids in multiple test cases.

The test cases cover the important scenarios: undefined anchor index, defined anchor index, and defined anchor index with missing data.

The assertions correctly validate the generated centroids against the expected tensor outputs.

It uses the minimal_instance fixture to load the test data.

The updates maintain the comprehensive test coverage for the centroid generation functionality.

tests/data/test_instance_cropping.py (1)

89-103: LGTM!

The test_generate_crops function is a well-structured test that covers the essential aspects of the generate_crops function. It processes a minimal instance, generates centroids, and calls the generate_crops function with the required arguments. The test asserts the expected shapes of the output, ensuring the correctness of the generate_crops function.

tests/data/test_augmentation.py (2)

44-65: LGTM!

The test function test_apply_intensity_augmentation is well-implemented and covers the essential aspects of the apply_intensity_augmentation method. The test uses a real-world example by loading labels from a file, which is a good practice. The assertions are comprehensive and cover the important properties of the output, such as the data type, shape, and that the augmentation has been applied.

67-97: LGTM!

The test function test_apply_geometric_augmentation is well-implemented and covers the essential aspects of the apply_geometric_augmentation method. The test uses a real-world example by loading labels from a file, which is a good practice. The assertions are comprehensive and cover the important properties of the output, such as the data type, shape, and that the augmentation has been applied. The test also includes a negative test case to ensure that the method raises a ValueError when invalid parameters are provided, which is a good practice.

tests/data/test_resizing.py (3)

85-93: LGTM!

The test function test_apply_resizer is well-implemented and covers the essential functionality of the apply_resizer function. It correctly validates that the image is resized to the expected dimensions and that the instances are scaled accordingly.

96-106: LGTM!

The test function test_apply_pad_to_stride is well-implemented and covers the essential functionality of the apply_pad_to_stride function. It correctly validates that the image is padded to the expected dimensions based on the provided max_stride value.

109-131: LGTM!

The test function test_apply_sizematcher is well-implemented and covers the essential functionality of the apply_sizematcher function. It correctly validates that the image is size-matched to the expected dimensions based on the provided max_height and max_width values. Additionally, it tests the error handling by asserting that exceptions are raised when the provided dimensions are smaller than the current image dimensions.

tests/data/test_confmaps.py (2)

128-137: LGTM!

The test correctly validates the functionality of generate_confmaps. It loads a label file, processes it, calls generate_confmaps with the correct arguments, and checks the expected shape of the output confidence maps.

140-157: LGTM!

The test correctly validates the functionality of generate_multiconfmaps for both centroid and non-centroid cases. It loads a label file, processes it, calls generate_multiconfmaps with the correct arguments for each case, and checks the expected shape of the output confidence maps. The test covers the basic functionality well.

tests/data/test_edge_maps.py (1)

178-189: LGTM!

The test_generate_pafs function is a well-structured test that covers the key functionality of the generate_pafs function. It follows the Arrange-Act-Assert pattern and uses real data to ensure that the function produces the expected output shape.

Great job on adding this comprehensive test! It will help catch potential issues and ensure the correctness of the PAF generation logic.

sleap_nn/data/instance_cropping.py (1)

109-154: LGTM! The generate_crops function is well-implemented and enhances the functionality of the module.

The function follows a clear logical flow and is well-documented with a comprehensive docstring. It correctly handles the input tensors and performs the necessary computations to generate the cropped image and adjust the keypoints and centroid. The use of appropriate variable names and comments makes the code easy to understand.

Returning a dictionary with all the relevant information is a good design choice, as it makes it convenient to use the results in subsequent processing steps.

Overall, the generate_crops function is a valuable addition to the sleap_nn/data/instance_cropping.py module, enabling the cropping of instances based on their centroids.

sleap_nn/data/resizing.py (3)

Line range hint 36-67: LGTM!

The function renaming and the added conditional check are great improvements. The new name apply_pad_to_stride is more descriptive and the conditional check enhances efficiency by avoiding unnecessary padding operations.

88-103: Looks good!

The new apply_resizer function is a great addition. It provides a useful functionality for flexible image manipulation and correctly handles the scaling of both the image and the keypoints. The function also has clear and concise documentation.

106-134: Nice work!

The new apply_sizematcher function is a valuable addition. It provides a useful functionality for ensuring that images meet certain size requirements and includes robust error handling. The function also has clear and concise documentation.

sleap_nn/data/confidence_maps.py (2)

11-46: LGTM!

The generate_confmaps function is well-implemented and properly documented. It handles different input shapes correctly, uses helper functions effectively, and follows a clear logical flow. The code changes look good to me.

49-94: LGTM!

The generate_multiconfmaps function is well-implemented and properly documented. It correctly handles generating confidence maps for either keypoints or centroids based on the is_centroids flag. The function uses helper functions effectively and follows a clear logical flow. The code changes look good to me.

sleap_nn/data/augmentation.py (2)

13-106: Excellent implementation of intensity-based augmentations!

The apply_intensity_augmentation function is well-structured and follows best practices. It leverages the Kornia library's augmentation classes to apply various intensity-based transformations to the image and instances. The function handles the reshaping of the instances tensor correctly to ensure compatibility with the augmentations.

The use of probabilities and parameters for each augmentation type provides flexibility in controlling the augmentations. The docstring is comprehensive and clearly explains the purpose, arguments, and return values of the function.

Overall, this function is a great addition to the augmentation capabilities of the library.

109-231: Excellent implementation of geometric augmentations!

The apply_geometric_augmentation function is well-structured and follows best practices, similar to the apply_intensity_augmentation function. It leverages the Kornia library's augmentation classes to apply various geometric transformations to the image and instances. The function handles the reshaping of the instances tensor correctly to ensure compatibility with the augmentations.

The use of probabilities and parameters for each augmentation type provides flexibility in controlling the augmentations. The docstring is comprehensive and clearly explains the purpose, arguments, and return values of the function.

The function also includes appropriate error handling for the random cropping parameters, raising a ValueError if the random crop height and width are not greater than 0 when random cropping is enabled.

Overall, this function is a great addition to the augmentation capabilities of the library.

Tools

Ruff

210-210: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/data/test_instance_cropping.py

sleap_nn/data/augmentation.py

… all model types (#91) * Add function to get data chunks * LitData Refactor PR3: Add custom StreamingDataset (#92) * Add custom streamingdatasets * LitData Refactor PR4: Integrate LitData with ModelTrainer class (#94) * Add flag for augmentation * Modify exception * Fix tests * Add litdata to trainer * Modify test * Add tests for data loaderS * Fix tests * Remove files in trainer * Remove val chunks dir * Remove shutil.rmtree * Remove shutil.rmtree * Skip ubuntu test * fix skip ubuntu test * Fix changes * Save training config before fit

coderabbitai

Actionable comments posted: 29

🧹 Outside diff range and nitpick comments (22)

environment_cpu.yml (1)
36-36: Add a newline at the end of the file.

To improve file formatting and adhere to best practices, please add a newline character at the end of the file.

You can fix this by adding an empty line at the end of the file:
  - pip:
    - "--editable=.[dev]"
    - litdata
+
🧰 Tools

🪛 yamllint

[error] 36-36: no new line character at the end of file

(new-line-at-end-of-file)
environment.yml (2)
37-37: Approve addition of 'litdata' with a suggestion.

The addition of 'litdata' as a dependency aligns with the PR objectives for refactoring data processing modules. However, to ensure reproducibility, it's recommended to specify the version of 'litdata'.

Consider updating the line to include a version specification:
-    - litdata
+    - litdata==X.Y.Z  # Replace X.Y.Z with the desired version
🧰 Tools

🪛 yamllint

[error] 37-37: no new line character at the end of file

(new-line-at-end-of-file)

37-37: Add a new line at the end of the file.

To adhere to best practices and improve compatibility with version control systems, please add a new line at the end of the file.

Add an empty line after the last line of the file.

🧰 Tools

🪛 yamllint

[error] 37-37: no new line character at the end of file

(new-line-at-end-of-file)
tests/data/test_providers.py (1)

96-104: LGTM: Test function covers essential aspects, with room for improvement.

The test_process_lf function effectively tests the basic functionality of the process_lf function. It checks the shapes of the output, the presence of NaN values for partial annotations, and the data type of the image. These are all important aspects to verify.

However, consider the following suggestions to enhance the test coverage:

Add a test case to verify the actual content of the processed image, not just its shape and data type.

Include tests for edge cases, such as empty labels or labels with all NaN values.

Test with different input scenarios, like varying numbers of instances or different image sizes.

Would you like assistance in implementing these additional test cases?
tests/data/test_instance_cropping.py (2)
89-104: LGTM: Well-structured test for generate_crops functionality.

The new test function test_generate_crops is well-implemented and aligns with the PR objectives. It effectively tests the refactored individual functions for data processing. The assertions cover the essential aspects of the cropped examples.

Consider adding an assertion to verify the content of the cropped instance, not just its shape. This could help catch any potential issues with the cropping logic itself. For example:
assert torch.allclose(cropped_ex["instance"], expected_instance_tensor, atol=1e-6)
Where expected_instance_tensor is the expected result after cropping.

Test Coverage Gaps Identified

Several functions in the data processing modules lack corresponding test functions:

sleap_nn/data/utils.py:def ensure_list(x: Any) -> List[Any]:

sleap_nn/data/utils.py:def make_grid_vectors(

sleap_nn/data/utils.py:def expand_to_rank(

sleap_nn/data/utils.py:def gaussian_pdf(x: torch.Tensor, sigma: float) -> torch.Tensor:

sleap_nn/data/resizing.py:def find_padding_for_stride(

sleap_nn/data/resizing.py:def apply_pad_to_stride(image: torch.Tensor, max_stride: int) -> torch.Tensor:

sleap_nn/data/resizing.py:def resize_image(image: torch.Tensor, scale: float):

sleap_nn/data/resizing.py:def apply_resizer(image: torch.Tensor, instances: torch.Tensor, scale: float = 1.0):

sleap_nn/data/providers.py:def get_max_instances(labels: sio.Labels):

sleap_nn/data/providers.py:def process_lf(

...and others as identified by the script.

It is recommended to add test cases for these functions to ensure comprehensive test coverage and maintain the reliability of the data processing pipeline.

🔗 Analysis chain

Line range hint 1-104: Overall, the changes align well with the PR objectives.

The additions to this test file effectively support the refactoring of operations into individual functions. The new test function test_generate_crops provides coverage for the newly refactored functionality, which is crucial for ensuring the reliability of the data processing pipeline.

These changes contribute to the overall goal of enhancing the code's structure and maintainability, setting a good foundation for future improvements to the data processing modules.

To ensure that all refactored functions are properly tested, let's verify if there are any other newly added functions that might need test coverage:

This will help identify any gaps in test coverage for the refactored functions.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for new functions in the data processing modules that might need tests

# Search for new function definitions in the data processing modules
echo "New functions in data processing modules:"
rg --type python '^\s*def\s+\w+\s*\(' src/sleap_nn/data/

# Search for corresponding test functions
echo "\nCorresponding test functions:"
rg --type python '^\s*def\s+test_\w+\s*\(' tests/data/
Length of output: 343

Script:
#!/bin/bash
# Description: Check for new functions in the data processing modules and their corresponding test functions using file extensions.

# Search for new function definitions in the data processing modules
echo "New functions in data processing modules:"
fd --extension py --search-path src/sleap_nn/data/ --type file | xargs rg '^\s*def\s+\w+\s*\('

# Search for corresponding test functions
echo "\nCorresponding test functions:"
fd --extension py --search-path tests/data/ --type file | xargs rg '^\s*def\s+test_\w+\s*\('
Length of output: 36245

Script:
#!/bin/bash
# Description: Check for new functions in the data processing modules and their corresponding test functions using corrected paths.

# Search for new function definitions in the data processing modules
echo "New functions in data processing modules:"
fd --extension py --search-path sleap_nn/data/ --type file | xargs rg '^\s*def\s+\w+\s*\('

# Search for corresponding test functions
echo "\nCorresponding test functions:"
fd --extension py --search-path tests/data/ --type file | xargs rg '^\s*def\s+test_\w+\s*\('
Length of output: 10173
tests/data/test_get_data_chunks.py (5)

11-62: LGTM: Comprehensive test for bottomup_data_chunks.

The test function thoroughly checks the bottomup_data_chunks method for both grayscale and RGB cases. It verifies the presence of expected keys and the shapes of output arrays.

Consider adding a test case with max_instances set to a value less than the actual number of instances to ensure the function handles this scenario correctly.

65-125: LGTM: Thorough test for centered_instance_data_chunks.

The test function comprehensively checks the centered_instance_data_chunks method for both grayscale and RGB cases. It verifies the presence of expected keys and the shapes of output arrays.

Consider adding assertions to verify the values of instance_bbox and centroid to ensure they are calculated correctly.

128-187: LGTM: Comprehensive test for centroid_data_chunks.

The test function thoroughly checks the centroid_data_chunks method for both grayscale and RGB cases. It verifies the presence of expected keys and the shapes of output arrays, including the 'centroids' key.

Consider adding a test case with different anchor_ind values to ensure the function handles various anchor indices correctly.

190-239: LGTM: Well-structured test for single_instance_data_chunks.

The test function effectively checks the single_instance_data_chunks method for both grayscale and RGB cases. It appropriately modifies the input data to contain only a single instance and verifies the presence of expected keys and the shapes of output arrays.

Consider adding a test case where the input data already contains only one instance to ensure the function handles this scenario correctly without modification.

1-239: Overall: Comprehensive test suite with room for enhancement.

This test file provides a solid foundation for testing the data chunking functions. It consistently covers both grayscale and RGB cases for each function, ensuring basic functionality and output structure.

To further improve the test suite:

Consider adding edge cases, such as empty input data or maximum number of instances.

Implement parametrized tests to cover a wider range of input scenarios efficiently.

Add tests for error handling, ensuring functions raise appropriate exceptions for invalid inputs.

Consider testing with different image sizes to ensure proper scaling and cropping behavior.

docs/config.md (5)

18-18: LGTM! Consider adjusting indentation for consistency.

The addition of the user_instances_only parameter is valuable, providing users with control over the types of instances used in training. The description is clear and informative.

To maintain consistency with the document's formatting, consider adjusting the indentation to match other list items at this level (2 spaces instead of 4).

🧰 Tools

🪛 Markdownlint

18-18: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

19-19: Enhance description and correct default value format.

The addition of the chunk_size parameter is useful for controlling data processing. However, the description could be more informative, and the default value format needs correction.

Consider expanding the description to explain the purpose and impact of this parameter. For example: "Size of each data chunk (in MB). Larger chunks may process faster but require more memory."

Remove the quotes around the default value to maintain consistency with the integer type: *Default*: 100.

Adjust the indentation to 2 spaces for consistency with other list items at this level.

🧰 Tools

🪛 Markdownlint

19-19: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

62-64: Enhance descriptions for random crop parameters and adjust formatting.

The addition of random crop parameters is a valuable enhancement to the data augmentation capabilities. However, the descriptions could be more informative, and the formatting needs adjustment.

Expand the descriptions to provide more context:

random_crop_p: (float) Probability of applying random crop augmentation. Range: 0.0 to 1.0.

random_crop_height: (int) Desired height of the random crop in pixels.

random_crop_width: (int) Desired width of the random crop in pixels.

Adjust the indentation to 6 spaces to align with other items in the preprocessing subsection.

Consider adding a brief explanation of how these parameters work together, e.g., "When applied, a random crop of size (random_crop_height, random_crop_width) will be taken from the input image."

🧰 Tools

🪛 Markdownlint

62-62: Expected: 6; Actual: 12
Unordered list indentation

(MD007, ul-indent)

63-63: Expected: 6; Actual: 12
Unordered list indentation

(MD007, ul-indent)

64-64: Expected: 6; Actual: 12
Unordered list indentation

(MD007, ul-indent)

Line range hint 201-209: LGTM! Consider adding links to PyTorch documentation.

The updates to the trainer_config section, particularly for data loaders and model checkpointing, significantly improve the clarity and usefulness of the documentation. The added flexibility and detailed explanations will be valuable for users.

To further enhance the documentation, consider adding direct links to the relevant PyTorch and Lightning documentation:

For DataLoader: [Torch's DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)

For ModelCheckpoint: [Lightning's ModelCheckpoint](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html)

This will allow users to easily access the full documentation for these components if needed.

🧰 Tools

🪛 Markdownlint

15-15: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

16-16: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

17-17: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

18-18: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

19-19: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

21-21: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

22-22: Expected: 4; Actual: 8
Unordered list indentation

(MD007, ul-indent)

Line range hint 1-230: Overall, excellent improvements to the configuration documentation.

The changes to this file significantly enhance the clarity and flexibility of the sleap_nn.ModelTrainer configuration. The new parameters for data processing, such as user_instances_only and chunk_size, align well with the PR objectives of refactoring the data pipeline. The expanded documentation for data loaders and model checkpointing provides users with valuable information for fine-tuning their training process.

To further improve the documentation:

Consider adding a brief introduction at the beginning of the file explaining the purpose and importance of this configuration file in the context of the SLEAP-NN project.

If possible, include examples of common configuration scenarios to help users get started quickly.

Implement a consistent indentation scheme throughout the document (e.g., 2 spaces for all list items at the same level) to improve readability.

Consider adding cross-references between related parameters in different sections to help users understand the relationships between various configuration options.

🧰 Tools

🪛 Markdownlint

15-15: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

16-16: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

17-17: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

18-18: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

19-19: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

21-21: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

22-22: Expected: 4; Actual: 8
Unordered list indentation

(MD007, ul-indent)

sleap_nn/data/providers.py (1)

31-94: Consider adding unit tests for the new process_lf function.

Adding unit tests will help ensure that process_lf works correctly under different scenarios and will facilitate future maintenance.

Would you like assistance in generating unit tests for this function?

🧰 Tools

🪛 Ruff

52-53: Use a single if statement instead of nested if statements

(SIM102)
sleap_nn/data/streaming_datasets.py (4)
121-121: Fix typo in class docstring

There's a typo in the docstring of the CenteredInstanceStreamingDataset class. "Cenetered" should be corrected to "Centered".

Apply this diff to fix the typo:
-    """StreamingDataset for CeneteredInstance pipeline.
+    """StreamingDataset for CenteredInstance pipeline.
23-24: Clarify the use of .bin files in the docstring

The mention of .bin files might be confusing if they are not a standard part of the data pipeline or if users are unfamiliar with their purpose.

Consider elaborating or generalizing the reference to data storage:
-        data sample stored in `.bin` files.
+        data sample stored in the dataset.
145-145: Annotate crop_hw parameter type for clarity

The type annotation for crop_hw is Tuple[int], which implies a tuple of variable length containing integers. If crop_hw should specifically be a tuple of two integers (height and width), it would be clearer to specify Tuple[int, int].

Apply this diff to specify the expected tuple size:
-            crop_hw: Tuple[int],
+            crop_hw: Tuple[int, int],
223-226: Consistent docstring formatting and clarity

To maintain consistency across all class docstrings, ensure that descriptions and parameter explanations are uniform and clear. This improves readability and aids users in understanding each dataset's purpose.

Consider reviewing and aligning the docstrings of all classes.
sleap_nn/training/model_trainer.py (1)

476-487: Address the TODO and uncomment code for cleaning up training files.

The code for deleting training and validation files is commented out with a TODO note indicating test failures. Leaving this code commented may result in unused temporary files consuming disk space.

Would you like assistance in diagnosing the test failures related to this code segment? I can help troubleshoot and provide a solution to safely delete the temporary files without affecting the tests.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between af6a7a4 and ad8b4d7.

📒 Files selected for processing (17)

docs/config.md (2 hunks)
docs/config_bottomup.yaml (1 hunks)
docs/config_centroid.yaml (2 hunks)
docs/config_topdown_centered_instance.yaml (1 hunks)
environment.yml (1 hunks)
environment_cpu.yml (1 hunks)
environment_mac.yml (1 hunks)
sleap_nn/data/get_data_chunks.py (1 hunks)
sleap_nn/data/providers.py (2 hunks)
sleap_nn/data/streaming_datasets.py (1 hunks)
sleap_nn/training/model_trainer.py (9 hunks)
tests/data/test_get_data_chunks.py (1 hunks)
tests/data/test_instance_cropping.py (2 hunks)
tests/data/test_providers.py (2 hunks)
tests/data/test_streaming_datasets.py (1 hunks)
tests/fixtures/datasets.py (1 hunks)
tests/training/test_model_trainer.py (14 hunks)

🧰 Additional context used

🪛 Markdownlint

docs/config.md

18-18: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

19-19: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

62-62: Expected: 6; Actual: 12
Unordered list indentation

(MD007, ul-indent)

63-63: Expected: 6; Actual: 12
Unordered list indentation

(MD007, ul-indent)

64-64: Expected: 6; Actual: 12
Unordered list indentation

(MD007, ul-indent)

🪛 yamllint

environment.yml

[warning] 36-36: wrong indentation: expected 6 but found 4

(indentation)

[error] 37-37: no new line character at the end of file

(new-line-at-end-of-file)

environment_cpu.yml

[warning] 35-35: wrong indentation: expected 6 but found 4

(indentation)

[error] 36-36: no new line character at the end of file

(new-line-at-end-of-file)

environment_mac.yml

[warning] 34-34: wrong indentation: expected 6 but found 4

(indentation)

[error] 35-35: no new line character at the end of file

(new-line-at-end-of-file)

🪛 Ruff

sleap_nn/data/providers.py

52-53: Use a single if statement instead of nested if statements

(SIM102)

sleap_nn/data/streaming_datasets.py

5-5: typing.List imported but unused

Remove unused import

(F401)

5-5: typing.Optional imported but unused

Remove unused import

(F401)

sleap_nn/training/model_trainer.py

9-9: shutil imported but unused

Remove unused import: shutil

(F401)

12-12: torch.utils.data.DataLoader imported but unused

Remove unused import: torch.utils.data.DataLoader

(F401)

41-41: Redefinition of unused sio from line 11

Remove definition: sio

(F811)

43-43: sleap_nn.data.cycler.CyclerIterDataPipe imported but unused

Remove unused import: sleap_nn.data.cycler.CyclerIterDataPipe

(F401)

110-112: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

tests/data/test_instance_cropping.py

12-12: sleap_nn.data.resizing.PadToStride imported but unused

Remove unused import: sleap_nn.data.resizing.PadToStride

(F401)

🔇 Additional comments (30)

environment_mac.yml (1)

35-35: LGTM: Addition of 'litdata' dependency

The addition of 'litdata' as a pip dependency is appropriate and aligns with the PR objectives to refactor data processing modules.

🧰 Tools

🪛 yamllint

[error] 35-35: no new line character at the end of file

(new-line-at-end-of-file)

environment_cpu.yml (1)

35-36: LGTM! Indentation fixed and new package added.

The changes look good:

The indentation for the "--editable=.[dev]" package has been corrected to 6 spaces, aligning with the YAML structure.

The new "litdata" package has been added, which aligns with the PR objectives for data pipeline refactoring.

🧰 Tools

🪛 yamllint

[warning] 35-35: wrong indentation: expected 6 but found 4

(indentation)

[error] 36-36: no new line character at the end of file

(new-line-at-end-of-file)
docs/config_bottomup.yaml (1)
5-5: Approve new configuration option with suggestions.

The addition of user_instances_only: True aligns with the PR objectives of refactoring data processing modules. This option will likely affect how data is processed in the __iter__() method.

Suggestions:

Consider adding a comment to explain the purpose and impact of this option for better maintainability.

Ensure that this change is properly handled in the corresponding Python code that processes this configuration.

To verify the impact of this change, let's check for its usage in the codebase:

This will help ensure that the new option is properly handled in the code that processes this configuration.

✅ Verification successful

Verified addition of user_instances_only configuration option.

The user_instances_only: True setting is properly integrated and utilized across the codebase, aligning with the PR objectives. Consider adding comments to docs/config_bottomup.yaml to explain the purpose and impact of this option for better maintainability.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for usage of 'user_instances_only' in Python files

# Search for 'user_instances_only' in Python files
echo "Searching for 'user_instances_only' usage:"
rg --type python 'user_instances_only'

# Search for data configuration parsing
echo "\nSearching for data configuration parsing:"
rg --type python 'data_config'
Length of output: 329

Script:
#!/bin/bash
# Description: Retry searching for 'user_instances_only' in Python files with corrected file type

# Option 1: Use correct file type identifier 'py' with ripgrep
echo "Searching for 'user_instances_only' usage (using type 'py'):"
rg --type py 'user_instances_only'

# Option 2: Search using file extensions without specifying file type
echo "\nSearching for 'user_instances_only' usage (using file extensions):"
rg --type-add 'py:*.py' --type py 'user_instances_only'

# Option 1: Use correct file type identifier 'py' with ripgrep
echo "Searching for 'user_instances_only' usage (using type 'py'):"
rg --type py 'user_instances_only'

# Option 2: Search using file extensions without specifying file type
echo "\nSearching for 'user_instances_only' usage (using file extensions):"
rg --glob '*.py' 'user_instances_only'

# Additionally, search for 'data_config' in Python files
echo "\nSearching for data configuration parsing (using type 'py'):"
rg --type py 'data_config'

echo "\nSearching for data configuration parsing (using file extensions):"
rg --glob '*.py' 'data_config'
Length of output: 61488
docs/config_topdown_centered_instance.yaml (2)
Line range hint 1-124: Consider adding documentation and clarifying commented sections.

While the addition of user_instances_only is a good improvement, consider the following suggestions:

Add a comment explaining the purpose and impact of the user_instances_only parameter for better maintainability.

The commented out sections for alternative backbone configurations (ConvNeXt and Swin Transformer) suggest ongoing work. Consider adding a TODO comment explaining the status of these alternatives or remove them if they're no longer relevant.

Here's a suggested addition for documentation:
data_config:
  provider: LabelsReader
  train_labels_path: minimal_instance.pkg.slp
  val_labels_path: minimal_instance.pkg.slp
  # Only use instances defined by users, ignoring auto-generated ones
  user_instances_only: True
To check the relevance of the commented sections, run:
#!/bin/bash
# Description: Check for usage of alternative backbones

echo "Searching for ConvNeXt usage:"
rg --type python "ConvNeXt" -A 5

echo "Searching for Swin Transformer usage:"
rg --type python "Swin_T|swint" -A 5
🧰 Tools

🪛 yamllint

[error] 7-7: trailing spaces

(trailing-spaces)

[error] 8-8: trailing spaces

(trailing-spaces)

5-5: LGTM! Verify impact on data processing pipeline.

The addition of user_instances_only: True aligns with the PR objectives of refactoring data processing modules. This parameter will likely change how instances are handled during data processing.

To ensure this change is properly integrated, please verify:

The LabelsReader provider supports this new parameter.

The impact on data augmentation and preprocessing steps.

Any necessary updates to the model or trainer configurations.

Run the following script to check for related configurations:

✅ Verification successful

Verified: Integration of user_instances_only Parameter

The addition of user_instances_only: True has been successfully integrated across relevant modules. The LabelsReader and model_trainer.py handle the new parameter appropriately, ensuring consistent behavior in data processing.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for related configurations and usage of user instances

# Test 1: Search for LabelsReader implementation
echo "Searching for LabelsReader implementation:"
rg --type python "class LabelsReader" -A 10

# Test 2: Search for user instances handling in data processing
echo "Searching for user instances handling:"
rg --type python "user_instances_only|user_defined_instances" -A 5

# Test 3: Check for any TODOs or FIXMEs related to instance handling
echo "Checking for TODOs or FIXMEs related to instance handling:"
rg --type python "TODO|FIXME" -A 3 | rg "instance"
Length of output: 382

Script:
#!/bin/bash
# Description: Check for related configurations and usage of user instances

# Test 1: Search for LabelsReader implementation
echo "Searching for LabelsReader implementation:"
rg --type py "class LabelsReader" -A 10

# Test 2: Search for user instances handling in data processing
echo "Searching for user instances handling:"
rg --type py "user_instances_only|user_defined_instances" -A 5

# Test 3: Check for any TODOs or FIXMEs related to instance handling
echo "Checking for TODOs or FIXMEs related to instance handling:"
rg --type py "TODO|FIXME" -A 3 | rg "instance"
Length of output: 12466
tests/data/test_providers.py (1)

3-3: LGTM: Import statement is correctly placed and relevant.

The addition of process_lf to the import statement is appropriate and aligns with the new test function introduced in this file.
docs/config_centroid.yaml (4)
Line range hint 44-44: Acknowledge indentation correction

The indentation of pre_trained_weights under model_config has been corrected to align with other keys in this section. This improves the structure and readability of the configuration file.

This change enhances the consistency of the YAML structure. Good catch!

🧰 Tools

🪛 yamllint

[error] 7-7: trailing spaces

(trailing-spaces)

[error] 8-8: trailing spaces

(trailing-spaces)

Line range hint 1-160: Summary of configuration changes

This PR introduces several changes to the configuration file:

Added user_instances_only: True in the data_config section.

Restructured the random crop configuration in the augmentation_config section.

Corrected the indentation of pre_trained_weights in the model_config section.

These changes could have significant implications for the data processing pipeline and model training.

To ensure these changes align with the PR objectives and overall project goals:

Confirm that these changes are intentional and necessary for the "LitData Refactor PR1" objectives.

Verify that corresponding changes have been made in the codebase to handle these configuration updates.

Consider updating the project documentation to reflect these configuration changes and their impact on the data pipeline and model training process.
#!/bin/bash
# Search for files that might need updates based on these configuration changes
rg --type python 'user_instances_only|random_crop|crop_height|crop_width|pre_trained_weights'
🧰 Tools

🪛 yamllint

[error] 7-7: trailing spaces

(trailing-spaces)

[error] 8-8: trailing spaces

(trailing-spaces)

38-40: Review changes to random crop configuration

The structure of the random crop configuration has been changed:

The random_crop subsection has been removed.

random_crop_p, crop_height, and crop_width are now at the same level as other augmentation parameters.

Additionally, random_crop_p is set to 0, which effectively disables random cropping.

Please confirm:

Is this structural change intentional? How does it affect the augmentation pipeline?

Is disabling random cropping (by setting random_crop_p: 0) intended?

Are the crop_height and crop_width values of 160 appropriate for your use case?

5-5: Verify the intention and impact of user_instances_only

The addition of user_instances_only: True in the data_config section could significantly affect the data pipeline. This parameter suggests that only user-defined instances will be used in the data processing.

Please confirm:

Is this change intentional and aligned with the PR objectives?

What is the expected impact on the data pipeline and model training?

Are there any corresponding changes in the data processing code to handle this new parameter?

✅ Verification successful

Addition of user_instances_only: True Verified

The user_instances_only parameter is consistently used across relevant modules to filter user-defined instances. Its addition in docs/config_centroid.yaml aligns with its implementation in the codebase.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for usage of 'user_instances_only' in Python files
rg --type python 'user_instances_only'
Length of output: 69

Script:
#!/bin/bash
# Search for usage of 'user_instances_only' in Python files
rg --type py 'user_instances_only'
Length of output: 2607
tests/data/test_instance_cropping.py (2)

4-4: LGTM: New imports added for enhanced testing.

The new imports (generate_centroids, generate_crops, apply_normalization, process_lf) are correctly added to support the new test function. These additions improve the test coverage and align with the PR objectives of refactoring operations into individual functions.

Also applies to: 8-8, 11-11, 13-13

87-88: LGTM: Improved readability with proper spacing.

The addition of two blank lines between test functions improves the code's readability and adheres to PEP 8 style guidelines for separating top-level functions.

tests/data/test_get_data_chunks.py (1)

1-8: LGTM: Import statements are appropriate.

The import statements are well-structured and import the necessary modules and functions for the tests. The use of import sleap_io as sio is a good convention for brevity.

tests/training/test_model_trainer.py (9)

13-13: LGTM: sys module import added

The addition of the sys module import is appropriate. It's likely being used for system-specific operations or checks in the test file.

91-92: LGTM: Cleanup operations added

The addition of shutil.rmtree calls to remove the train_chunks and val_chunks directories after tests is a good practice. This ensures proper cleanup of temporary files and directories, preventing potential interference between tests and reducing disk usage.

Also applies to: 165-166, 188-189, 217-218

300-301: LGTM: Consistent cleanup added

The addition of shutil.rmtree calls to remove the train_chunks and val_chunks directories is consistent with the cleanup operations added in other parts of the file. This ensures proper cleanup after the test.

344-345: LGTM: Additional cleanup added

The addition of another set of shutil.rmtree calls to remove the train_chunks and val_chunks directories is consistent with the cleanup operations added earlier in the same function. This ensures thorough cleanup after different parts of the test.

376-377: LGTM: Cleanup operations added to test_centroid_model

The addition of shutil.rmtree calls to remove the train_chunks and val_chunks directories in the test_centroid_model function is consistent with the cleanup operations added in other test functions. This ensures proper cleanup after the test.

420-421: LGTM: Cleanup operations added to test_single_instance_model

The addition of shutil.rmtree calls to remove the train_chunks and val_chunks directories in the test_single_instance_model function is consistent with the cleanup operations added in other test functions. This ensures proper cleanup after the test.

457-458: LGTM: Cleanup operations added to test_bottomup_model (first instance)

The addition of shutil.rmtree calls to remove the train_chunks and val_chunks directories in the first part of the test_bottomup_model function is consistent with the cleanup operations added in other test functions. This ensures proper cleanup after the first part of the test.

491-493: LGTM: Improved readability and cleanup in test_bottomup_model

The addition of an empty line improves the readability of the code by separating the test assertions from the cleanup operations. The shutil.rmtree calls to remove the train_chunks and val_chunks directories are consistent with the cleanup operations added in other parts of the file, ensuring proper cleanup after the second part of the test.

Line range hint 1-493: Overall improvements to test suite cleanup and handling of platform-specific issues

The changes in this file consistently improve the test suite by adding cleanup operations after each test, which is a good practice. The addition of the sys module import and the skip condition for a potentially flaky test on Linux systems addresses platform-specific issues, although the root cause of the flakiness should be investigated further.

Key improvements:

Consistent cleanup of temporary directories (train_chunks and val_chunks) after tests.

Improved test isolation and reduced disk usage due to proper cleanup.

Handling of platform-specific test issues (Linux-specific test skip).

Suggestions for future work:

Investigate and address the root cause of the flaky test on Linux systems.

Consider adding a cleanup fixture to handle the repetitive cleanup operations more efficiently.

Overall, these changes enhance the reliability and maintainability of the test suite.

sleap_nn/data/providers.py (1)

3-3: Import statement addition is acceptable.

The import of Any from the typing module is appropriate.
sleap_nn/data/streaming_datasets.py (7)
99-102: Verify the is_centroids parameter value

In the generate_multiconfmaps function call within the BottomUpStreamingDataset, the is_centroids parameter is set to False. Please verify that this aligns with the intended behavior for the BottomUp model, as incorrect settings may affect confidence map generation.

181-183: Ensure correct calculation of instance bounding boxes

In the re-cropping logic, make sure that the make_centered_bboxes function and subsequent operations correctly calculate the bounding boxes, especially considering batch dimensions and tensor shapes. This is crucial for accurate cropping of instances.

187-190: Check centering of instances and centroids

When adjusting the instance and centroid tensors by subtracting point, ensure that this operation correctly centers the keypoints and centroids relative to the cropped images.

267-269: Confirm correct usage of apply_intensity_augmentation

In the CentroidStreamingDataset, verify that apply_intensity_augmentation is correctly applied to ex["image"] and ex["centroids"]. Since centroids represent point coordinates, intensity augmentations may not affect them. Ensure this call behaves as expected.

349-351: Validate augmentation functions for SingleInstanceStreamingDataset

In the SingleInstanceStreamingDataset, confirm that the augmentation functions are compatible with single-instance data and that they are applied correctly to ex["image"] and ex["instances"].

370-376: Ensure correct generation of confidence maps

In the SingleInstanceStreamingDataset, verify that the generate_confmaps function is called with appropriate parameters and that the output aligns with the expected input shape for the model.

270-274: 🛠️ Refactor suggestion

Avoid redundant augmentation on centroids

Intensity augmentations typically modify pixel values and may not require adjustments to centroid coordinates. If apply_intensity_augmentation does not alter coordinates, passing ex["centroids"] might be unnecessary.

Consider modifying the augmentation call:
-                    ex["image"], ex["centroids"] = apply_intensity_augmentation(
+                    ex["image"], _ = apply_intensity_augmentation(
Ensure that geometric augmentations still correctly adjust centroid positions.

Likely invalid or redundant comment.

environment_mac.yml

environment.yml

tests/fixtures/datasets.py

tests/training/test_model_trainer.py

tests/data/test_streaming_datasets.py

sleap_nn/training/model_trainer.py

Add individual datapipe functions

af6a7a4

gitttt-1234 requested a review from talmo September 12, 2024 17:33

gitttt-1234 linked an issue Sep 12, 2024 that may be closed by this pull request

Moving to LitData: Refactoring data pipeline #80

Closed

6 tasks

coderabbitai bot reviewed Sep 12, 2024

View reviewed changes

tests/data/test_instance_cropping.py Outdated Show resolved Hide resolved

sleap_nn/data/augmentation.py Show resolved Hide resolved

talmo approved these changes Sep 18, 2024

View reviewed changes

coderabbitai bot reviewed Oct 3, 2024

View reviewed changes

Fix type annotations

c545857

gitttt-1234 merged commit 2fdca47 into main Oct 3, 2024
7 checks passed

gitttt-1234 deleted the divya/datapipe-funcs branch October 3, 2024 18:39

This was referenced Oct 3, 2024

Add function to load trained weights for backbone model #95

Merged

Remove IterDataPipe from Inference pipeline #96

Merged

Move ld.optimize to a subprocess #100

Merged

coderabbitai bot mentioned this pull request Oct 30, 2024

Add option to specify .bin file directory in config #107

Merged

This was referenced Dec 5, 2024

Add option to load trained weights for Head layers #114

Merged

Fix Pafs shape #121

Merged

coderabbitai bot mentioned this pull request Feb 13, 2025

Move all params to config #146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LitData Refactor PR1: Get individual functions for data pipelines #90

LitData Refactor PR1: Get individual functions for data pipelines #90

gitttt-1234 commented Sep 12, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 12, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

codecov bot commented Sep 12, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

LitData Refactor PR1: Get individual functions for data pipelines #90

LitData Refactor PR1: Get individual functions for data pipelines #90

Conversation

gitttt-1234 commented Sep 12, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Sep 12, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

codecov bot commented Sep 12, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

gitttt-1234 commented Sep 12, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 12, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Sep 12, 2024 •

edited

Loading