-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checking for null fields in dataset schema #4596
Conversation
Warning Rate limit exceeded@minhtuev has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 19 minutes and 48 seconds before requesting another review. How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe recent changes enhance the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SerializableDocument
participant Logger
User->>SerializableDocument: Call from_dict(input_dict)
SerializableDocument->>SerializableDocument: _simple_filter(input_dict)
alt None values detected
SerializableDocument->>Logger: Log warning
end
SerializableDocument->>SerializableDocument: Process filtered data
SerializableDocument-->>User: Return serialized document
sequenceDiagram
participant User
participant BaseField
participant Logger
User->>BaseField: Call to_python(mongo_list)
BaseField->>BaseField: Filter out None values
alt None values detected
BaseField->>Logger: Log warning
end
BaseField-->>User: Return list of Python objects
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
Nice! Is it possible to add a test that asserts the fix? The change looks correct to me, just a sanity check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minhtuev @benjaminpkane I don't think this is quite the error that users have reported. I believe they are seeing cases where dataset._doc.sample_fields[i] is None
.
Here we're attempting to gracefully continue in cases where dataset._doc.samples_fields[i].fields[j] is None
.
I think the first step here is to manage to reproduce how the corruption is occurring in the first place.
For example, I bet that updating this PR to gracefully omit dataset._doc.sample_fields[i]
if None would not actually do much good. It would allow load_dataset()
to work, but then I bet the user would immediately get a "found unknown field 'foo'" error upon calling sample = dataset.first()
because whatever the None
value was we omitted from the schema was probably supposed to be defining an actual field.
Note that last year we made a fix in this area:
fiftyone/fiftyone/core/odm/mixins.py
Lines 251 to 253 in 53c4466
# This fixes https://github.com/voxel51/fiftyone/issues/3185 | |
# @todo improve list field updates in general so this isn't necessary | |
cls._reload_fields() |
As mentioned in the comments, this was done to prevent issues that could arise when concurrently modifying a dataset's schema in multiple processes: #3185
My bet is that those _reload_fields()
we added are not fully solving the issue.
Either way, I think we must find proactive fixes to prevent corruption here, as reactive ones that attempt to recover after data loss are unlikely to fully work.
@brimoor : oh nested fields, very interesting... @ehofesmann reported in the ticket that the customers circumvented this issue by going into the DB and deleted null items under |
A first order problem of loading from MongoDB feels unlikely to me. That could happen for any field, anywhere, at any time, right!? Noting special about the I do still think that "going into the DB and deleted null items under |
@brimoor: makes sense, I was able to reproduce this issue by adding "null" values to Any thought on this @ehofesmann ? :) |
Full traceback:
|
SampleFieldDocument
2. Revert previous change
@brimoor @benjaminpkane : verified that simply filtering out null values from the list fixes the issue ~ @coderabbitai review |
@minhtuev: The changes in the PR correctly filter out Here is the new test case to ensure the fix works correctly: import fiftyone as fo
import fiftyone.core.odm as foo
def test_load_dataset_with_null_sample_fields():
# Create a sample dataset document with None values in sample_fields
dataset_doc = {
"name": "test_dataset",
"sample_fields": [
{"name": "field1", "ftype": "string"},
None,
{"name": "field2", "ftype": "int"},
None,
],
"version": foo.VERSION,
}
# Convert the dictionary to a DatasetDocument
dataset_document = foo.DatasetDocument.from_dict(dataset_doc)
# Verify that None values are filtered out
assert len(dataset_document.sample_fields) == 2
assert dataset_document.sample_fields[0]["name"] == "field1"
assert dataset_document.sample_fields[1]["name"] == "field2"
# Load the dataset and verify it loads successfully
dataset = fo.Dataset(name="test_dataset")
dataset._doc = dataset_document
assert dataset.name == "test_dataset"
# Add this test to the test suite
def test_load_dataset():
test_load_dataset_with_null_sample_fields() This test creates a dataset document with Analysis chainScripts executedThe following scripts were executed for the analysis: Script: #!/bin/bash
# Fetch the changes made in the PR to review the fix.
gh pr diff 4596
# Search for existing tests related to loading a dataset to see if a new test needs to be added.
fd -e py -x grep -Hn 'def test_load_dataset' {}
Length of output: 2508 Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (1)
- fiftyone/core/odm/document.py (4 hunks)
Additional comments not posted (4)
fiftyone/core/odm/document.py (4)
11-11
: Initialize logging correctly.The logging module is correctly imported.
25-26
: Ensure logger initialization follows best practices.The logger is correctly initialized using
__name__
. This is a good practice as it allows the logger to inherit the configuration of the root logger.
460-460
: Ensure correct usage of_simple_filter
infrom_dict
method.The
_simple_filter
method is correctly invoked to filterNone
values before processing the input dictionary. This ensures data integrity during the deserialization process.
468-468
: Ensure correct usage of_simple_filter
infrom_dict
method.The
_simple_filter
method is correctly invoked again to filterNone
values after processing the input dictionary. This double-check ensures that noNone
values are present in the final dictionary.
8c73a14
to
d18765b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (1)
- fiftyone/core/odm/document.py (4 hunks)
Files skipped from review as they are similar to previous changes (1)
- fiftyone/core/odm/document.py
fiftyone/core/odm/document.py
Outdated
return cls._from_son(d) | ||
except Exception: | ||
pass | ||
|
||
# Construct any necessary extended JSON components like ObjectIds | ||
# @todo can we optimize this? | ||
d = json_util.loads(json_util.dumps(d)) | ||
|
||
d = cls._simple_filter(d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, list fields can contain None
values. For example:
import fiftyone as fo
sample = fo.Sample(filepath="image.jpg", list_field=[None, None])
dataset = fo.Dataset()
dataset.add_sample(sample)
sample.reload()
assert sample.list_field == [None, None]
So we can't change MongoEngineBaseDocument
as this is the base class for all documents, eg sample and frame documents, not just DatasetDocument
.
A more robust patch here would be to gracefully catch validation errors related to sample_fields
or frame_fields
being None in a DatasetDocument.from_dict()
super method, and, if found, immediately update these field(s) in the database so this patch doesn't get triggered permanently thereafter.
Ideally we'd still have a hypothesis for why this is happening and confirmation that this won't just lead to a downstream error as I described earlier:
For example, I bet that updating this PR to gracefully omit dataset._doc.sample_fields[i] if None would not actually do much good. It would allow load_dataset() to work, but then I bet the user would immediately get a "found unknown field 'foo'" error upon calling sample = dataset.first() because whatever the None value was we omitted from the schema was probably supposed to be defining an actual field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, now it all makes sense after digging into this code! We can certainly do both, with both patching for graceful handling and trying to figure out why dataset._doc.sample_fields[i]
contains None value sometimes.
@ehofesmann : do you know if the customers encounter any other problem after removing null
values from sample_fields
? Do they do any additional update to the record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's correct to assume that a none value field is a bug. This problem should be handled so None doesn't throw an error/is handled appropriately where the error is being thrown and not messing with the schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaixi-wang agreed with the argument from Brian above that filtering out None
value is not the solution. In general, do we have a sense when None
value is acceptable as a value and when it is not? Pymongo throws an exception here because it is not accepting None
as a value, so in this case the conversion fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should automatically be filtering out None fields as I've seen multiple instances where users filter by None to be able to identify groups of samples
@kaixi-wang : we were discussing doing a try-catch exception for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (2)
- fiftyone/core/fields.py (2 hunks)
- fiftyone/core/odm/document.py (3 hunks)
Files skipped from review as they are similar to previous changes (1)
- fiftyone/core/odm/document.py
Additional comments not posted (1)
fiftyone/core/fields.py (1)
1506-1525
: LGTM! Verify the logging setup.The method correctly filters out
None
values and logs a warning if any are found. Ensure that the logging mechanism is properly set up in the application to capture these warnings.Verification successful
LGTM!
The method correctly filters out
None
values and logs a warning if any are found. The logging mechanism is properly set up in the application through custom configurations usinglogging.getLogger
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the logging mechanism is properly set up to capture warnings. # Test: Search for logging setup in the codebase. Expect: Proper logging configuration. rg --type python 'logging.basicConfig'Length of output: 69
Script:
#!/bin/bash # Description: Verify that the logging mechanism is properly set up to capture warnings. # Test: Search for logging setup in the codebase. Expect: Proper logging configuration. rg --type py 'logging.basicConfig'Length of output: 34
Script:
#!/bin/bash # Description: Verify that the logging mechanism is properly set up to capture warnings. # Test: Search for logging configurations or usages of the logging module in the codebase. rg --type py 'import logging' -A 5Length of output: 27332
Script:
#!/bin/bash # Description: Verify that the logging mechanism is properly set up to capture warnings. # Test: Search for custom logging configurations, handlers, or setup functions in the codebase. rg --type py 'logging.getLogger' -A 5Length of output: 31014
8d46873
to
c40384c
Compare
c40384c
to
d805fb6
Compare
@coderabbitai full review |
Actions performedFull review triggered. |
Closing this PR since we discussed a different approach, check the new PR: |
What changes are proposed in this pull request?
Some customers reported that if a field in
sample_fields
list is None, it causes Dataset not to load. We can filter outNone
value field to address this issue.How is this patch tested? If it is not, please explain why.
Release Notes
Is this a user-facing change that should be mentioned in the release notes?
notes for FiftyOne users.
(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)
What areas of FiftyOne does this PR affect?
fiftyone
Python library changesSummary by CodeRabbit
New Features
None
values from lists and dictionaries during document serialization.None
values.Bug Fixes
None
values are detected and ignored during data filtering.