Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Leaky splits #5189

Merged
merged 12 commits into from
Nov 26, 2024
Merged

[Docs] Leaky splits #5189

merged 12 commits into from
Nov 26, 2024

Conversation

jacobsela
Copy link
Contributor

@jacobsela jacobsela commented Nov 25, 2024

What changes are proposed in this pull request?

Added docs for new leaky-splits module

How is this patch tested? If it is not, please explain why.

Built docs and checked that they looked generally sane.

Release Notes

New brain method for detecting leaky splits! Try it out by importing from fiftyone.brain import compute_leaky_splits.
Documentation can be found in the brain docs.

Is this a user-facing change that should be mentioned in the release notes?

  • No. You can skip the rest of this section.
  • [ x] Yes. Give a description of this change to be included in the release
    notes for FiftyOne users.

(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)

see voxel51/fiftyone-brain#203

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • [ x] Documentation: FiftyOne documentation changes
  • Other

Summary by CodeRabbit

  • New Features

    • Introduced a new documentation section on "Leaky Splits" for identifying potential data leaks between dataset splits.
    • Included a visual representation of the leaky splits analysis.
  • Documentation

    • Updated existing documentation to include details on the compute_leaky_splits() method, input requirements, expected outputs, and leak detection mechanisms.

Copy link
Contributor

coderabbitai bot commented Nov 25, 2024

Walkthrough

The documentation for the FiftyOne Brain has been expanded to include a new section titled "Leaky Splits," which introduces the concept of identifying potential data leaks between dataset splits. This section details the compute_leaky_splits() method, including its input requirements, expected outputs, and how it can automatically clean datasets or tag identified leaks. A visual representation of the leaky splits analysis is also included, while the existing documentation remains unchanged.

Changes

File Path Change Summary
docs/source/brain.rst Added a new section on "Leaky Splits," detailing the compute_leaky_splits() method, its usage, and visual aids.
fiftyone/brain.py Introduced the method compute_leaky_splits(dataset, split_tags=None, split_field=None, split_views=None).

Possibly related PRs

  • Docs integration updates #4940: The changes in this PR involve documentation updates related to integrations, but they do not directly relate to the compute_leaky_splits() method or the concept of leaky splits introduced in the main PR.

Suggested reviewers

  • mwoodson1
  • brimoor

🐰 In the land of data, where splits may leak,
A method was born, oh so unique!
With thresholds and tags, it cleans with grace,
Ensuring our models find their true place.
So hop along, friends, let’s celebrate,
For leaky splits now have a new fate! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
docs/source/brain.rst (3)

1781-1808: Fix typo in split_field example.

The code examples effectively demonstrate different ways to define splits. However, there's a typo in the split_field example:

-split_field = ['split'] # holds split values e.g. 'train' or 'test'
+split_field = 'split' # holds split values e.g. 'train' or 'test'

The split_field parameter should be a string, not a list, as it represents a single field name.


1809-1818: Fix typos and formality in output description.

The input/output documentation is comprehensive, but there are a few issues to address:

-This is not only visually appealing and fun to look at, but can also give you an
+This is not only visually informative but can also provide you with an
-visiual and semantic similarity
+visual and semantic similarity

1819-1847: Add missing punctuation.

The technical documentation is thorough and well-explained. Fix the missing period:

-process
+process.

Otherwise, the explanation of similarity backends, models, embeddings, and thresholds is clear and comprehensive.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ee88926 and 5d835ef.

⛔ Files ignored due to path filters (1)
  • docs/source/images/brain/brain-leaky-splits.png is excluded by !**/*.png, !**/*.png
📒 Files selected for processing (1)
  • docs/source/brain.rst (2 hunks)
🔇 Additional comments (2)
docs/source/brain.rst (2)

1769-1780: Well-written introduction!

The introduction effectively explains the problem of data leaks between splits and their impact on model evaluation. The explanation is clear and technically accurate.


1848-1851: LGTM!

The image is properly included with appropriate alt text and alignment.

jacobmarks
jacobmarks previously approved these changes Nov 25, 2024
Copy link
Contributor

@jacobmarks jacobmarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the requested changes before merging. On the whole, LGTM 👍

docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
docs/source/brain.rst (2)

1792-1793: Add a concrete example dataset.

As suggested in the past review comments, providing a real example like COCO would make this more practical and easier to follow.


1821-1825: Fix typos and clarify model training statement.

There are a few issues in this paragraph:

  1. "visiual" should be "visual"
  2. The statement about models not trained on the dataset needs clarification
-Leakiness find leaks by embedding samples with a powerful
-model and finding very close samples in different splits in this space. Large,
-powerful models that were *not* trained on a dataset can provide insight into
-visiual and semantic similarity between images, without creating further leaks
-in the process.
+Leakiness finds leaks by embedding samples with a powerful
+model and finding very close samples in different splits in this space. Large,
+powerful models that were *not* trained on the target dataset can provide insight into
+visual and semantic similarity between images, without creating further leaks
+in the process.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5d835ef and 02d6e70.

📒 Files selected for processing (1)
  • docs/source/brain.rst (2 hunks)
🔇 Additional comments (1)
docs/source/brain.rst (1)

77-83: LGTM! Clear and consistent introduction.

The new bullet point follows the established pattern and effectively introduces the leaky splits feature.

docs/source/brain.rst Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
docs/source/brain.rst (4)

1775-1780: Improve clarity of the introduction.

The introduction could be more precise about the consequences of data leaks. Consider mentioning specific impacts like inflated validation metrics or overconfidence in model performance.

-Despite our best efforts, duplicates and other forms of non-IID samples
-show up in our data. When these samples end up in different splits, this
-can have consequences when evaluating a model. It can often be easy to
-overestimate model capability due to this issue.
+Despite our best efforts, duplicates and other forms of non-IID samples
+can appear in our data. When these samples end up in different splits (e.g., train/test),
+they violate the independence assumption and can lead to:
+- Inflated validation/test metrics
+- Overestimation of model generalization capability
+- Unreliable model evaluation results

1794-1807: Add docstrings to code examples.

The code examples would be more helpful with brief comments explaining what each approach does and when to use it.

-    # splits via tags
+    # Use when splits are defined using dataset tags
+    # Example: samples tagged with 'train' or 'test'
     split_tags = ['train', 'test']
     index, leaks = fob.compute_leaky_splits(dataset, split_tags=split_tags)

-    # splits via field
+    # Use when splits are stored in a dataset field
+    # Example: a 'split' field containing values like 'train' or 'test'
     split_field = ['split']
     index, leaks = fob.compute_leaky_splits(dataset, split_field=split_field)

-    # splits via views
+    # Use when splits are defined as dataset views
+    # Example: filtered views for train and test data
     split_views = {
         'train' : some_view
         'test' : some_other_view
     }

1821-1825: Clarify the model selection criteria.

The explanation about model selection should be more specific about what constitutes a "powerful model" and why using models not trained on the dataset is important.

-**What to expect**: Leakiness find leaks by embedding samples with a powerful
-model and finding very close samples in different splits in this space. Large,
-powerful models that were *not* trained on a dataset can provide insight into
-visiual and semantic similarity between images, without creating further leaks
-in the process.
+**What to expect**: Leakiness finds leaks by embedding samples using a foundation
+model (like CLIP or ResNet) and identifying very close samples across different
+splits in the embedding space. Using pre-trained models that were *not* trained
+on your specific dataset is crucial because:
+1. They can capture general visual and semantic similarities
+2. They avoid introducing additional data leakage
+3. They provide unbiased similarity metrics

1841-1848: Enhance threshold documentation.

The threshold documentation should include guidance on how to choose appropriate values and examples of typical ranges.

-**Thresholds**: The leaky-splits module uses a threshold to decide what samples
-are 'too close' and mark them as potential leaks. This threshold can be changed
-either by passing a value to the `threshold` argument of the `compute_leaky_splits()`
-method, or by using the
-:meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>`
-method. The best value for your use-case may vary depending on your dataset, as well
-as the embeddings used. A threshold that's too big will have a lot of false positives,
-a threshold that's too small will have a lot of false negatives.
+**Thresholds**: The leaky-splits module uses a threshold to identify potential leaks
+by measuring embedding distances between samples. You can configure this threshold:
+
+- Via the `threshold` argument in `compute_leaky_splits()`
+- Using the :meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>` method
+
+Choosing the right threshold:
+- Typical range: 0.1 to 0.5 for normalized embeddings
+- Higher values (>0.3): More aggressive leak detection, higher false positives
+- Lower values (<0.2): More conservative, higher false negatives
+- Start with 0.2 and adjust based on manual inspection of results
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 02d6e70 and 259e14e.

📒 Files selected for processing (1)
  • docs/source/brain.rst (2 hunks)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
docs/source/brain.rst (2)

1781-1808: Add more context to the code examples

While the code examples show different ways to specify splits, they would benefit from additional explanation of when to use each approach.

Consider adding explanatory text before each example:

 # splits via tags
+# Use when your samples are tagged with split names
 split_tags = ['train', 'test']
 index, leaks = fob.compute_leaky_splits(dataset, split_tags=split_tags)

 # splits via field
+# Use when split information is stored in a sample field
 split_field = ['split'] # holds split values e.g. 'train' or 'test'
 index, leaks = fob.compute_leaky_splits(dataset, split_field=split_field)

 # splits via views
+# Use when splits are defined by dataset views
 split_views = {
     'train' : some_view
     'test' : some_other_view
 }

1809-1827: Enhance the COCO example

The COCO example could be more educational with additional comments explaining each step.

Add more context:

+# Load the COCO test split
 coco = foz.load_zoo_dataset("coco-2017", split="test")
+# Clear existing tags to prepare for our split
 coco.untag_samples(coco.distinct("tags"))
 
+# Create a random 70/30 train/test split
 four.random_split(coco, {"train": 0.7, "test": 0.3})
+# Analyze the splits for potential leaks
 index, leaks = compute_leaky_splits(coco, split_tags=['train', 'test'])
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 259e14e and 19ebdad.

📒 Files selected for processing (1)
  • docs/source/brain.rst (2 hunks)
🔇 Additional comments (3)
docs/source/brain.rst (3)

1770-1773: LGTM! Well-structured section header

The section is properly placed in the documentation hierarchy and follows the established format.


1775-1780: LGTM! Clear introduction of the concept

The introduction effectively explains the problem of data leaks between splits and their impact on model evaluation.


1901-1904: ⚠️ Potential issue

Image file 'brain-leaky-splits.png' is 26MB - requires immediate compression

The image file size of 26MB is significantly large for version control. Large binary files in Git repositories:

  • Increase clone and fetch times
  • Consume excessive storage
  • Impact repository performance

Please compress this image while maintaining acceptable visual quality. Recommended actions:

  • Use image optimization tools (e.g., ImageOptim, TinyPNG)
  • Reduce dimensions if unnecessarily large
  • Consider converting to a more efficient format (e.g., WebP)
  • Aim for a compressed size under 1MB
✅ Verification successful

Large image file confirmed - requires optimization to reduce 26MB size

The verification confirms the image file is 26MB, which is excessively large for version control. This will negatively impact repository performance and clone times.

Please take the following actions:

  • Compress the image while maintaining acceptable visual quality using tools like ImageOptim or TinyPNG
  • Consider reducing dimensions if they exceed documentation needs
  • Convert to a more efficient format like WebP if supported by your documentation toolchain
  • Target a compressed size under 1MB for optimal repository performance
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check the size of the image file
ls -lh docs/source/images/brain/brain-leaky-splits.png | awk '{print $5}'

Length of output: 79

docs/source/brain.rst Outdated Show resolved Hide resolved
@brimoor brimoor changed the title leaky splits docs [Docs] Leaky splits Nov 26, 2024
@jacobmarks jacobmarks self-requested a review November 26, 2024 15:09
jacobmarks
jacobmarks previously approved these changes Nov 26, 2024
Copy link
Contributor

@jacobmarks jacobmarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM 👍, but please consider making the suggested enhancements before merging

docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Show resolved Hide resolved
coco.untag_samples(coco.distinct("tags"))

four.random_split(coco, {"train": 0.7, "test": 0.3})
index, leaks = compute_leaky_splits(coco, split_tags=['train', 'test'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe print out the number of leaks? Right now it isn't 100% clear the complete value this provides. You're underselling it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way to convince the reader is either through having them run it and seeing the visual one they open the app or having them read the blogpost. I think just the number doesn't tell the reader much because how can they trust that the algorithm doesn't have a lot of false positives?

docs/source/brain.rst Show resolved Hide resolved
can be used to find potential leaks between dataset splits. These splits can
be misleading when evaluating a model, giving an overly optimistic measure
for the quality of training.

.. note::

Check out the :ref:`tutorials page <tutorials>` for detailed examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you adding a tutorial on data leakage to the tutorials section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I? What would this look like?

docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
docs/source/brain.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (3)
docs/source/brain.rst (3)

1775-1780: Enhance the introduction with more specific details.

Consider expanding the introduction to:

  1. Explain what "non-IID" means for readers unfamiliar with the term
  2. Provide concrete examples of how leaky splits can affect model evaluation
  3. Quantify the potential impact on model performance metrics

1781-1808: Add explanatory comments to the code examples.

The code examples would be more helpful with:

  1. Comments explaining when to use each split specification method
  2. More descriptive variable names (e.g., some_viewtrain_subset_view)
  3. Expected output format description

1820-1824: Improve the example's clarity and add performance metrics.

Consider enhancing the example by:

  1. Using dataset instead of coco for consistency with other examples
  2. Adding a print statement to show the number of leaks found
  3. Explaining why these specific split ratios were chosen
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 19ebdad and 0900594.

⛔ Files ignored due to path filters (1)
  • docs/source/images/brain/brain-leaky-splits.png is excluded by !**/*.png, !**/*.png
📒 Files selected for processing (1)
  • docs/source/brain.rst (2 hunks)
🔇 Additional comments (1)
docs/source/brain.rst (1)

1901-1904: ⚠️ Potential issue

Compress the brain-leaky-splits.png image.

The image file is 26MB which is too large for version control. Please compress it to reduce repository bloat while maintaining acceptable visual quality.

Comment on lines +1836 to +1839
Before evaluating your model on your test set, consider getting a version of it
with the leaks removed. This can be easily done with the built in method
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix method name inconsistency.

The documentation refers to no_leaks_view() but the actual method name is get_no_leaks_view().

-:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`
+:meth:`get_no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.get_no_leaks_view>`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Before evaluating your model on your test set, consider getting a version of it
with the leaks removed. This can be easily done with the built in method
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`.
Before evaluating your model on your test set, consider getting a version of it
with the leaks removed. This can be easily done with the built in method
:meth:`get_no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.get_no_leaks_view>`.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (2)
docs/source/brain.rst (2)

1775-1780: Enhance the introduction with concrete examples.

Consider adding specific examples of what constitutes a "leak" between splits to make the concept more tangible for users. For example, mention cases like identical/near-identical images, same scene from different angles, or temporally adjacent frames from a video sequence.


1781-1811: Add expected output and return value explanations to code examples.

The code examples clearly show different ways to specify splits, but would benefit from:

  1. Comments showing example output format
  2. Explanation of what the returned index and leaks objects contain
  3. Example values for the split field/tags
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0900594 and e948e4d.

📒 Files selected for processing (1)
  • docs/source/brain.rst (2 hunks)
🔇 Additional comments (1)
docs/source/brain.rst (1)

1841-1844: ⚠️ Potential issue

Fix method name inconsistency.

The method name no_leaks_view() is referenced here, but it's inconsistent with the actual method name in the codebase.

-:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`
+:meth:`get_no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.get_no_leaks_view>`

Likely invalid or redundant comment.

Comment on lines +1812 to +1832
Here is a sample snippet to run this on the `COCO <https://cocodataset.org/#home>`_ dataset.
Try it for yourself and see what you may find.

.. code-block:: python
:linenos:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.random as four
from fiftyone.brain import compute_leaky_splits

# load coco
dataset = foz.load_zoo_dataset("coco-2017", split="test")

# set up splits via tags
dataset.untag_samples(dataset.distinct("tags"))
four.random_split(dataset, {"train": 0.7, "test": 0.3})

# compute leaks
index, leaks = compute_leaky_splits(dataset, split_tags=['train', 'test'])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Expand the COCO example with real-world insights.

The COCO example would be more valuable with:

  1. Expected output format and typical findings
  2. Common types of leaks found in COCO
  3. Interpretation guidelines for the results
  4. Performance impact examples (e.g., accuracy difference with/without leaks)

Comment on lines +1833 to +1864
Once you have these leaks, it is wise to look through them. You may gain some insight
into the source of the leaks.

.. code-block:: python
:linenos:

session = fo.launch_app(leaks)

Before evaluating your model on your test set, consider getting a version of it
with the leaks removed. This can be easily done with the built in method
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`.

.. code-block:: python
:linenos:

# if you already have it
test_set = some_view

# can also be found with the variable `split_views` from the index
# make sure to put in the right string based on the field/tag/key in view dict
# passed when building the index
test_set = index.split_views['test']

test_set_no_leaks = index.no_leaks_view(test_set) # return a view with leaks removed
session.view = test_set_no_leaks

# do evaluations on test_set_no_leaks rather than test_set

Performance on the clean test set will can be closer to the performance of the
model in the wild. If you found some leaks in your dataset, consider comparing
performance on the base test set against the clean test set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and edge cases documentation.

The section should include:

  1. How to handle cases when no leaks are found
  2. Error scenarios (e.g., invalid split specifications)
  3. Performance considerations for large datasets
  4. Best practices for threshold selection

Comment on lines +1897 to +1905
**Thresholds**: The leaky-splits module uses a threshold to decide what samples
are 'too close' and mark them as potential leaks. This threshold can be changed
either by passing a value to the `threshold` argument of the `compute_leaky_splits()`
method, or by using the
:meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>`
method. The best value for your use-case may vary depending on your dataset, as well
as the embeddings used. A threshold that's too big will have a lot of false positives,
a threshold that's too small will have a lot of false negatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add threshold guidance and examples.

The threshold explanation would benefit from:

  1. Recommended threshold ranges for common use cases
  2. Examples of how different thresholds affect results
  3. A method to automatically determine optimal threshold values

Comment on lines +1906 to +1909
.. image:: /images/brain/brain-leaky-splits.png
:alt: leaky-splits
:align: center

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

⚠️ Potential issue

Image file size is smaller than reported but still needs optimization

The actual size of 'brain-leaky-splits.png' is 7.6MB, not 26MB as initially stated. While this is better than previously thought, it's still larger than optimal for version control. The recommendation to compress remains valid but with less urgency:

  • Current size: 7.6MB
  • Target size: <1MB
  • Use image optimization tools while maintaining visual quality
  • Consider WebP format if browser support allows
🔗 Analysis chain

Image file 'brain-leaky-splits.png' is 26MB - requires immediate compression

The image file size of 26MB is significantly large for version control. Large binary files in Git repositories:

  • Increase clone and fetch times
  • Consume excessive storage
  • Impact repository performance

Please compress this image while maintaining acceptable visual quality. Recommended actions:

  • Use image optimization tools (e.g., ImageOptim, TinyPNG)
  • Reduce dimensions if unnecessarily large
  • Consider converting to a more efficient format (e.g., WebP)
  • Aim for a compressed size under 1MB
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check the size of the image file
ls -lh docs/source/images/brain/brain-leaky-splits.png | awk '{print $5}'

Length of output: 80

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, appreciate you adding this! 🥇

@jacobsela jacobsela merged commit 738b128 into release/v1.1.0 Nov 26, 2024
13 checks passed
@jacobsela jacobsela deleted the leaky-splits-docs branch November 26, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants