-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] Leaky splits #5189
[Docs] Leaky splits #5189
Conversation
WalkthroughThe documentation for the FiftyOne Brain has been expanded to include a new section titled "Leaky Splits," which introduces the concept of identifying potential data leaks between dataset splits. This section details the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
docs/source/brain.rst (3)
1781-1808
: Fix typo in split_field example.The code examples effectively demonstrate different ways to define splits. However, there's a typo in the split_field example:
-split_field = ['split'] # holds split values e.g. 'train' or 'test' +split_field = 'split' # holds split values e.g. 'train' or 'test'The
split_field
parameter should be a string, not a list, as it represents a single field name.
1809-1818
: Fix typos and formality in output description.The input/output documentation is comprehensive, but there are a few issues to address:
-This is not only visually appealing and fun to look at, but can also give you an +This is not only visually informative but can also provide you with an -visiual and semantic similarity +visual and semantic similarity
1819-1847
: Add missing punctuation.The technical documentation is thorough and well-explained. Fix the missing period:
-process +process.Otherwise, the explanation of similarity backends, models, embeddings, and thresholds is clear and comprehensive.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (1)
docs/source/images/brain/brain-leaky-splits.png
is excluded by!**/*.png
,!**/*.png
📒 Files selected for processing (1)
docs/source/brain.rst
(2 hunks)
🔇 Additional comments (2)
docs/source/brain.rst (2)
1769-1780
: Well-written introduction!
The introduction effectively explains the problem of data leaks between splits and their impact on model evaluation. The explanation is clear and technically accurate.
1848-1851
: LGTM!
The image is properly included with appropriate alt text and alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the requested changes before merging. On the whole, LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
docs/source/brain.rst (2)
1792-1793
: Add a concrete example dataset.As suggested in the past review comments, providing a real example like COCO would make this more practical and easier to follow.
1821-1825
: Fix typos and clarify model training statement.There are a few issues in this paragraph:
- "visiual" should be "visual"
- The statement about models not trained on the dataset needs clarification
-Leakiness find leaks by embedding samples with a powerful -model and finding very close samples in different splits in this space. Large, -powerful models that were *not* trained on a dataset can provide insight into -visiual and semantic similarity between images, without creating further leaks -in the process. +Leakiness finds leaks by embedding samples with a powerful +model and finding very close samples in different splits in this space. Large, +powerful models that were *not* trained on the target dataset can provide insight into +visual and semantic similarity between images, without creating further leaks +in the process.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (1)
docs/source/brain.rst
(2 hunks)
🔇 Additional comments (1)
docs/source/brain.rst (1)
77-83
: LGTM! Clear and consistent introduction.
The new bullet point follows the established pattern and effectively introduces the leaky splits feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
docs/source/brain.rst (4)
1775-1780
: Improve clarity of the introduction.The introduction could be more precise about the consequences of data leaks. Consider mentioning specific impacts like inflated validation metrics or overconfidence in model performance.
-Despite our best efforts, duplicates and other forms of non-IID samples -show up in our data. When these samples end up in different splits, this -can have consequences when evaluating a model. It can often be easy to -overestimate model capability due to this issue. +Despite our best efforts, duplicates and other forms of non-IID samples +can appear in our data. When these samples end up in different splits (e.g., train/test), +they violate the independence assumption and can lead to: +- Inflated validation/test metrics +- Overestimation of model generalization capability +- Unreliable model evaluation results
1794-1807
: Add docstrings to code examples.The code examples would be more helpful with brief comments explaining what each approach does and when to use it.
- # splits via tags + # Use when splits are defined using dataset tags + # Example: samples tagged with 'train' or 'test' split_tags = ['train', 'test'] index, leaks = fob.compute_leaky_splits(dataset, split_tags=split_tags) - # splits via field + # Use when splits are stored in a dataset field + # Example: a 'split' field containing values like 'train' or 'test' split_field = ['split'] index, leaks = fob.compute_leaky_splits(dataset, split_field=split_field) - # splits via views + # Use when splits are defined as dataset views + # Example: filtered views for train and test data split_views = { 'train' : some_view 'test' : some_other_view }
1821-1825
: Clarify the model selection criteria.The explanation about model selection should be more specific about what constitutes a "powerful model" and why using models not trained on the dataset is important.
-**What to expect**: Leakiness find leaks by embedding samples with a powerful -model and finding very close samples in different splits in this space. Large, -powerful models that were *not* trained on a dataset can provide insight into -visiual and semantic similarity between images, without creating further leaks -in the process. +**What to expect**: Leakiness finds leaks by embedding samples using a foundation +model (like CLIP or ResNet) and identifying very close samples across different +splits in the embedding space. Using pre-trained models that were *not* trained +on your specific dataset is crucial because: +1. They can capture general visual and semantic similarities +2. They avoid introducing additional data leakage +3. They provide unbiased similarity metrics
1841-1848
: Enhance threshold documentation.The threshold documentation should include guidance on how to choose appropriate values and examples of typical ranges.
-**Thresholds**: The leaky-splits module uses a threshold to decide what samples -are 'too close' and mark them as potential leaks. This threshold can be changed -either by passing a value to the `threshold` argument of the `compute_leaky_splits()` -method, or by using the -:meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>` -method. The best value for your use-case may vary depending on your dataset, as well -as the embeddings used. A threshold that's too big will have a lot of false positives, -a threshold that's too small will have a lot of false negatives. +**Thresholds**: The leaky-splits module uses a threshold to identify potential leaks +by measuring embedding distances between samples. You can configure this threshold: + +- Via the `threshold` argument in `compute_leaky_splits()` +- Using the :meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>` method + +Choosing the right threshold: +- Typical range: 0.1 to 0.5 for normalized embeddings +- Higher values (>0.3): More aggressive leak detection, higher false positives +- Lower values (<0.2): More conservative, higher false negatives +- Start with 0.2 and adjust based on manual inspection of results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
docs/source/brain.rst (2)
1781-1808
: Add more context to the code examplesWhile the code examples show different ways to specify splits, they would benefit from additional explanation of when to use each approach.
Consider adding explanatory text before each example:
# splits via tags +# Use when your samples are tagged with split names split_tags = ['train', 'test'] index, leaks = fob.compute_leaky_splits(dataset, split_tags=split_tags) # splits via field +# Use when split information is stored in a sample field split_field = ['split'] # holds split values e.g. 'train' or 'test' index, leaks = fob.compute_leaky_splits(dataset, split_field=split_field) # splits via views +# Use when splits are defined by dataset views split_views = { 'train' : some_view 'test' : some_other_view }
1809-1827
: Enhance the COCO exampleThe COCO example could be more educational with additional comments explaining each step.
Add more context:
+# Load the COCO test split coco = foz.load_zoo_dataset("coco-2017", split="test") +# Clear existing tags to prepare for our split coco.untag_samples(coco.distinct("tags")) +# Create a random 70/30 train/test split four.random_split(coco, {"train": 0.7, "test": 0.3}) +# Analyze the splits for potential leaks index, leaks = compute_leaky_splits(coco, split_tags=['train', 'test'])
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (1)
docs/source/brain.rst
(2 hunks)
🔇 Additional comments (3)
docs/source/brain.rst (3)
1770-1773
: LGTM! Well-structured section header
The section is properly placed in the documentation hierarchy and follows the established format.
1775-1780
: LGTM! Clear introduction of the concept
The introduction effectively explains the problem of data leaks between splits and their impact on model evaluation.
1901-1904
:
Image file 'brain-leaky-splits.png' is 26MB - requires immediate compression
The image file size of 26MB is significantly large for version control. Large binary files in Git repositories:
- Increase clone and fetch times
- Consume excessive storage
- Impact repository performance
Please compress this image while maintaining acceptable visual quality. Recommended actions:
- Use image optimization tools (e.g., ImageOptim, TinyPNG)
- Reduce dimensions if unnecessarily large
- Consider converting to a more efficient format (e.g., WebP)
- Aim for a compressed size under 1MB
✅ Verification successful
Large image file confirmed - requires optimization to reduce 26MB size
The verification confirms the image file is 26MB, which is excessively large for version control. This will negatively impact repository performance and clone times.
Please take the following actions:
- Compress the image while maintaining acceptable visual quality using tools like ImageOptim or TinyPNG
- Consider reducing dimensions if they exceed documentation needs
- Convert to a more efficient format like WebP if supported by your documentation toolchain
- Target a compressed size under 1MB for optimal repository performance
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check the size of the image file
ls -lh docs/source/images/brain/brain-leaky-splits.png | awk '{print $5}'
Length of output: 79
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM 👍, but please consider making the suggested enhancements before merging
docs/source/brain.rst
Outdated
coco.untag_samples(coco.distinct("tags")) | ||
|
||
four.random_split(coco, {"train": 0.7, "test": 0.3}) | ||
index, leaks = compute_leaky_splits(coco, split_tags=['train', 'test']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe print out the number of leaks? Right now it isn't 100% clear the complete value this provides. You're underselling it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best way to convince the reader is either through having them run it and seeing the visual one they open the app or having them read the blogpost. I think just the number doesn't tell the reader much because how can they trust that the algorithm doesn't have a lot of false positives?
can be used to find potential leaks between dataset splits. These splits can | ||
be misleading when evaluating a model, giving an overly optimistic measure | ||
for the quality of training. | ||
|
||
.. note:: | ||
|
||
Check out the :ref:`tutorials page <tutorials>` for detailed examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you adding a tutorial on data leakage to the tutorials section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I? What would this look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (3)
docs/source/brain.rst (3)
1775-1780
: Enhance the introduction with more specific details.Consider expanding the introduction to:
- Explain what "non-IID" means for readers unfamiliar with the term
- Provide concrete examples of how leaky splits can affect model evaluation
- Quantify the potential impact on model performance metrics
1781-1808
: Add explanatory comments to the code examples.The code examples would be more helpful with:
- Comments explaining when to use each split specification method
- More descriptive variable names (e.g.,
some_view
→train_subset_view
)- Expected output format description
1820-1824
: Improve the example's clarity and add performance metrics.Consider enhancing the example by:
- Using
dataset
instead ofcoco
for consistency with other examples- Adding a print statement to show the number of leaks found
- Explaining why these specific split ratios were chosen
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (1)
docs/source/images/brain/brain-leaky-splits.png
is excluded by!**/*.png
,!**/*.png
📒 Files selected for processing (1)
docs/source/brain.rst
(2 hunks)
🔇 Additional comments (1)
docs/source/brain.rst (1)
1901-1904
:
Compress the brain-leaky-splits.png image.
The image file is 26MB which is too large for version control. Please compress it to reduce repository bloat while maintaining acceptable visual quality.
Before evaluating your model on your test set, consider getting a version of it | ||
with the leaks removed. This can be easily done with the built in method | ||
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix method name inconsistency.
The documentation refers to no_leaks_view()
but the actual method name is get_no_leaks_view()
.
-:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`
+:meth:`get_no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.get_no_leaks_view>`
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Before evaluating your model on your test set, consider getting a version of it | |
with the leaks removed. This can be easily done with the built in method | |
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`. | |
Before evaluating your model on your test set, consider getting a version of it | |
with the leaks removed. This can be easily done with the built in method | |
:meth:`get_no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.get_no_leaks_view>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (2)
docs/source/brain.rst (2)
1775-1780
: Enhance the introduction with concrete examples.Consider adding specific examples of what constitutes a "leak" between splits to make the concept more tangible for users. For example, mention cases like identical/near-identical images, same scene from different angles, or temporally adjacent frames from a video sequence.
1781-1811
: Add expected output and return value explanations to code examples.The code examples clearly show different ways to specify splits, but would benefit from:
- Comments showing example output format
- Explanation of what the returned
index
andleaks
objects contain- Example values for the split field/tags
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (1)
docs/source/brain.rst
(2 hunks)
🔇 Additional comments (1)
docs/source/brain.rst (1)
1841-1844
:
Fix method name inconsistency.
The method name no_leaks_view()
is referenced here, but it's inconsistent with the actual method name in the codebase.
-:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`
+:meth:`get_no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.get_no_leaks_view>`
Likely invalid or redundant comment.
Here is a sample snippet to run this on the `COCO <https://cocodataset.org/#home>`_ dataset. | ||
Try it for yourself and see what you may find. | ||
|
||
.. code-block:: python | ||
:linenos: | ||
|
||
import fiftyone as fo | ||
import fiftyone.zoo as foz | ||
import fiftyone.utils.random as four | ||
from fiftyone.brain import compute_leaky_splits | ||
|
||
# load coco | ||
dataset = foz.load_zoo_dataset("coco-2017", split="test") | ||
|
||
# set up splits via tags | ||
dataset.untag_samples(dataset.distinct("tags")) | ||
four.random_split(dataset, {"train": 0.7, "test": 0.3}) | ||
|
||
# compute leaks | ||
index, leaks = compute_leaky_splits(dataset, split_tags=['train', 'test']) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Expand the COCO example with real-world insights.
The COCO example would be more valuable with:
- Expected output format and typical findings
- Common types of leaks found in COCO
- Interpretation guidelines for the results
- Performance impact examples (e.g., accuracy difference with/without leaks)
Once you have these leaks, it is wise to look through them. You may gain some insight | ||
into the source of the leaks. | ||
|
||
.. code-block:: python | ||
:linenos: | ||
|
||
session = fo.launch_app(leaks) | ||
|
||
Before evaluating your model on your test set, consider getting a version of it | ||
with the leaks removed. This can be easily done with the built in method | ||
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`. | ||
|
||
.. code-block:: python | ||
:linenos: | ||
|
||
# if you already have it | ||
test_set = some_view | ||
|
||
# can also be found with the variable `split_views` from the index | ||
# make sure to put in the right string based on the field/tag/key in view dict | ||
# passed when building the index | ||
test_set = index.split_views['test'] | ||
|
||
test_set_no_leaks = index.no_leaks_view(test_set) # return a view with leaks removed | ||
session.view = test_set_no_leaks | ||
|
||
# do evaluations on test_set_no_leaks rather than test_set | ||
|
||
Performance on the clean test set will can be closer to the performance of the | ||
model in the wild. If you found some leaks in your dataset, consider comparing | ||
performance on the base test set against the clean test set. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and edge cases documentation.
The section should include:
- How to handle cases when no leaks are found
- Error scenarios (e.g., invalid split specifications)
- Performance considerations for large datasets
- Best practices for threshold selection
**Thresholds**: The leaky-splits module uses a threshold to decide what samples | ||
are 'too close' and mark them as potential leaks. This threshold can be changed | ||
either by passing a value to the `threshold` argument of the `compute_leaky_splits()` | ||
method, or by using the | ||
:meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>` | ||
method. The best value for your use-case may vary depending on your dataset, as well | ||
as the embeddings used. A threshold that's too big will have a lot of false positives, | ||
a threshold that's too small will have a lot of false negatives. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add threshold guidance and examples.
The threshold explanation would benefit from:
- Recommended threshold ranges for common use cases
- Examples of how different thresholds affect results
- A method to automatically determine optimal threshold values
.. image:: /images/brain/brain-leaky-splits.png | ||
:alt: leaky-splits | ||
:align: center | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Image file size is smaller than reported but still needs optimization
The actual size of 'brain-leaky-splits.png' is 7.6MB, not 26MB as initially stated. While this is better than previously thought, it's still larger than optimal for version control. The recommendation to compress remains valid but with less urgency:
- Current size: 7.6MB
- Target size: <1MB
- Use image optimization tools while maintaining visual quality
- Consider WebP format if browser support allows
🔗 Analysis chain
Image file 'brain-leaky-splits.png' is 26MB - requires immediate compression
The image file size of 26MB is significantly large for version control. Large binary files in Git repositories:
- Increase clone and fetch times
- Consume excessive storage
- Impact repository performance
Please compress this image while maintaining acceptable visual quality. Recommended actions:
- Use image optimization tools (e.g., ImageOptim, TinyPNG)
- Reduce dimensions if unnecessarily large
- Consider converting to a more efficient format (e.g., WebP)
- Aim for a compressed size under 1MB
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check the size of the image file
ls -lh docs/source/images/brain/brain-leaky-splits.png | awk '{print $5}'
Length of output: 80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, appreciate you adding this! 🥇
What changes are proposed in this pull request?
Added docs for new leaky-splits module
How is this patch tested? If it is not, please explain why.
Built docs and checked that they looked generally sane.
Release Notes
New brain method for detecting leaky splits! Try it out by importing
from fiftyone.brain import compute_leaky_splits
.Documentation can be found in the brain docs.
Is this a user-facing change that should be mentioned in the release notes?
notes for FiftyOne users.
(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)
see voxel51/fiftyone-brain#203
What areas of FiftyOne does this PR affect?
fiftyone
Python library changesSummary by CodeRabbit
New Features
Documentation
compute_leaky_splits()
method, input requirements, expected outputs, and leak detection mechanisms.