Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking for null fields in dataset schema #4596

Closed
wants to merge 5 commits into from

Conversation

minhtuev
Copy link
Contributor

@minhtuev minhtuev commented Jul 25, 2024

What changes are proposed in this pull request?

Some customers reported that if a field in sample_fields list is None, it causes Dataset not to load. We can filter out None value field to address this issue.

How is this patch tested? If it is not, please explain why.

  • Manual test
  • Unit test

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release
    notes for FiftyOne users.

(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

Summary by CodeRabbit

  • New Features

    • Introduced a new filtering method to enhance data integrity by removing None values from lists and dictionaries during document serialization.
    • Added a method to convert MongoDB data into Python objects while filtering out None values.
  • Bug Fixes

    • Improved logging capabilities to issue warnings when None values are detected and ignored during data filtering.

@minhtuev minhtuev requested a review from benjaminpkane July 25, 2024 22:06
Copy link
Contributor

coderabbitai bot commented Jul 25, 2024

Warning

Rate limit exceeded

@minhtuev has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 19 minutes and 48 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between ff96f24 and d805fb6.

Walkthrough

The recent changes enhance the SerializableDocument and BaseField classes in the FiftyOne library by introducing methods that filter out None values from data structures. A logging mechanism has been implemented to warn users when None values are encountered, improving data integrity during serialization and conversion processes. These adjustments ensure cleaner and more reliable data handling while providing visibility through logging.

Changes

File Change Summary
fiftyone/core/odm/document.py Added _simple_filter method to SerializableDocument class to filter None values; implemented logging for detected None values.
fiftyone/core/fields.py Introduced to_python method in BaseField for converting MongoDB dicts to Python objects, including logging for filtered None values.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SerializableDocument
    participant Logger

    User->>SerializableDocument: Call from_dict(input_dict)
    SerializableDocument->>SerializableDocument: _simple_filter(input_dict)
    alt None values detected
        SerializableDocument->>Logger: Log warning
    end
    SerializableDocument->>SerializableDocument: Process filtered data
    SerializableDocument-->>User: Return serialized document
Loading
sequenceDiagram
    participant User
    participant BaseField
    participant Logger

    User->>BaseField: Call to_python(mongo_list)
    BaseField->>BaseField: Filter out None values
    alt None values detected
        BaseField->>Logger: Log warning
    end
    BaseField-->>User: Return list of Python objects
Loading

Poem

In the code where documents play,
A rabbit hops with joy today!
With filters clean and warnings bright,
Data flows with pure delight.
Hooray for logs, they guide the way,
In our digital garden, let's sway! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@benjaminpkane
Copy link
Contributor

Nice! Is it possible to add a test that asserts the fix? The change looks correct to me, just a sanity check.

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minhtuev @benjaminpkane I don't think this is quite the error that users have reported. I believe they are seeing cases where dataset._doc.sample_fields[i] is None.

Here we're attempting to gracefully continue in cases where dataset._doc.samples_fields[i].fields[j] is None.

I think the first step here is to manage to reproduce how the corruption is occurring in the first place.

For example, I bet that updating this PR to gracefully omit dataset._doc.sample_fields[i] if None would not actually do much good. It would allow load_dataset() to work, but then I bet the user would immediately get a "found unknown field 'foo'" error upon calling sample = dataset.first() because whatever the None value was we omitted from the schema was probably supposed to be defining an actual field.

Note that last year we made a fix in this area:

# This fixes https://github.com/voxel51/fiftyone/issues/3185
# @todo improve list field updates in general so this isn't necessary
cls._reload_fields()

As mentioned in the comments, this was done to prevent issues that could arise when concurrently modifying a dataset's schema in multiple processes: #3185

My bet is that those _reload_fields() we added are not fully solving the issue.

Either way, I think we must find proactive fixes to prevent corruption here, as reactive ones that attempt to recover after data loss are unlikely to fully work.

@minhtuev
Copy link
Contributor Author

@brimoor : oh nested fields, very interesting... @ehofesmann reported in the ticket that the customers circumvented this issue by going into the DB and deleted null items under sample_fields, so Dataset can load again and there was no further problem (?) So I assume that it was a first-order issue of loading from MongoDB, not a second-order problem of concurrency. There is a set of scripts that the customers provide on how they were doing the actual ingestion of data into the DB. Happy to jump on a call to discuss this further!

@brimoor
Copy link
Contributor

brimoor commented Jul 26, 2024

A first order problem of loading from MongoDB feels unlikely to me. That could happen for any field, anywhere, at any time, right!? Noting special about the datasets collection in that regard. Therefore we'd be equally likely to see issue reports like "why is filepath None for this dataset?" or "why is my sample missing a tag I assigned to it?"

I do still think that "going into the DB and deleted null items under sample_fields" means deleting dataset._doc.sample_fields[i], not deleting dataset._doc.sample_fields[i].fields[j]. But @ehofesmann can please set us straight 😄

@minhtuev
Copy link
Contributor Author

minhtuev commented Jul 26, 2024

@brimoor: makes sense, I was able to reproduce this issue by adding "null" values to sample_fields in the Dataset record in MongoDB. Not sure how these null values were in the customer's dataset record in the first place, but as a patch, we can try to gracefully handle null value for sample_fields[i].

Any thought on this @ehofesmann ? :)

image

@minhtuev
Copy link
Contributor Author

Full traceback:

In [1]: import fiftyone as fo
   ...:
   ...: import fiftyone.zoo as foz

In [2]: dataset2 = fo.load_dataset("open-images-v7-validation-200")
---------------------------------------------------------------------------
InvalidDocumentError                      Traceback (most recent call last)
Cell In[2], line 1
----> 1 dataset2 = fo.load_dataset("open-images-v7-validation-200")

File ~/workspace/fiftyone-teams/fiftyone/core/dataset.py:203, in load_dataset(name, snapshot)
    200     head_name = dataset_doc["name"]
    201     return _load_snapshot_dataset(name, head_name, snapshot)
--> 203 return Dataset(name, _create=False)

File ~/workspace/fiftyone-teams/fiftyone/core/singletons.py:36, in DatasetSingleton.__call__(cls, name, _create, *args, **kwargs)
     29 if (
     30     _create
     31     or instance is None
     32     or instance.deleted
     33     or instance.name is None
     34 ):
     35     instance = cls.__new__(cls)
---> 36     instance.__init__(name=name, _create=_create, *args, **kwargs)
     37     name = instance.name  # `__init__` may have changed `name`
     38 else:

File ~/workspace/fiftyone-teams/fiftyone/core/dataset.py:371, in Dataset.__init__(self, name, persistent, overwrite, _create, _virtual, _head_name, _snapshot_name, **kwargs)
    365 else:
    366     self.__permission = (
    367         dataset_permissions.get_dataset_permissions_for_current_user(
    368             name
    369         )
    370     )
--> 371     doc, sample_doc_cls, frame_doc_cls = _load_dataset(
    372         self, name, virtual=_virtual
    373     )
    375 self._doc = doc
    376 self._sample_doc_cls = sample_doc_cls

File ~/workspace/fiftyone-teams/fiftyone/core/dataset.py:7778, in _load_dataset(obj, name, virtual)
   7771 if version != focn.VERSION:
   7772     raise ValueError(
   7773         "Failed to load dataset '%s' from v%s using client v%s. "
   7774         "You may need to upgrade your client"
   7775         % (name, version, focn.VERSION)
   7776     ) from e
-> 7778 raise e

File ~/workspace/fiftyone-teams/fiftyone/core/dataset.py:7764, in _load_dataset(obj, name, virtual)
   7761     fomi.migrate_dataset_if_necessary(name)
   7763 try:
-> 7764     return _do_load_dataset(obj, name)
   7765 except Exception as e:
   7766     try:

File ~/workspace/fiftyone-teams/fiftyone/core/dataset.py:7787, in _do_load_dataset(obj, name)
   7785 if not res:
   7786     raise ValueError("Dataset '%s' not found" % name)
-> 7787 dataset_doc = foo.DatasetDocument.from_dict(res)
   7789 sample_collection_name = dataset_doc.sample_collection_name
   7790 frame_collection_name = dataset_doc.frame_collection_name

File ~/workspace/fiftyone-teams/fiftyone/core/odm/document.py:454, in MongoEngineBaseDocument.from_dict(cls, d, extended)
    449 d = json_util.loads(json_util.dumps(d))
    451 # from pdb import set_trace; set_trace()
    452
    453 # pylint: disable=no-member
--> 454 return cls._from_son(d)

File /opt/anaconda3/envs/vteam310/lib/python3.10/site-packages/mongoengine/base/document.py:830, in BaseDocument._from_son(cls, son, _auto_dereference, created)
    825     errors = "\n".join([f"Field '{k}' - {v}" for k, v in errors_dict.items()])
    826     msg = "Invalid data to create a `{}` instance.\n{}".format(
    827         cls._class_name,
    828         errors,
    829     )
--> 830     raise InvalidDocumentError(msg)
    832 # In STRICT documents, remove any keys that aren't in cls._fields
    833 if cls.STRICT:

InvalidDocumentError: Invalid data to create a `DatasetDocument` instance.
Field 'sample_fields' - 'NoneType' object has no attribute 'get'

@minhtuev minhtuev changed the title Checking for null fields in SampleFieldDocument Checking for null fields in dataset schema Jul 26, 2024
@minhtuev minhtuev requested a review from brimoor July 26, 2024 21:48
@minhtuev minhtuev marked this pull request as ready for review July 26, 2024 21:53
@minhtuev
Copy link
Contributor Author

@brimoor @benjaminpkane : verified that simply filtering out null values from the list fixes the issue ~

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Jul 26, 2024

@minhtuev: The changes in the PR correctly filter out None values from the sample_fields list, ensuring that the dataset loads successfully. I will add a test case to assert this fix.

Here is the new test case to ensure the fix works correctly:

import fiftyone as fo
import fiftyone.core.odm as foo

def test_load_dataset_with_null_sample_fields():
    # Create a sample dataset document with None values in sample_fields
    dataset_doc = {
        "name": "test_dataset",
        "sample_fields": [
            {"name": "field1", "ftype": "string"},
            None,
            {"name": "field2", "ftype": "int"},
            None,
        ],
        "version": foo.VERSION,
    }

    # Convert the dictionary to a DatasetDocument
    dataset_document = foo.DatasetDocument.from_dict(dataset_doc)

    # Verify that None values are filtered out
    assert len(dataset_document.sample_fields) == 2
    assert dataset_document.sample_fields[0]["name"] == "field1"
    assert dataset_document.sample_fields[1]["name"] == "field2"

    # Load the dataset and verify it loads successfully
    dataset = fo.Dataset(name="test_dataset")
    dataset._doc = dataset_document
    assert dataset.name == "test_dataset"

# Add this test to the test suite
def test_load_dataset():
    test_load_dataset_with_null_sample_fields()

This test creates a dataset document with None values in the sample_fields list, converts it to a DatasetDocument, and verifies that the None values are filtered out. It then loads the dataset and verifies that it loads successfully.

Analysis chain
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Fetch the changes made in the PR to review the fix.
gh pr diff 4596

# Search for existing tests related to loading a dataset to see if a new test needs to be added.
fd -e py -x grep -Hn 'def test_load_dataset' {}

Length of output: 2508

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 53c4466 and 8c73a14.

Files selected for processing (1)
  • fiftyone/core/odm/document.py (4 hunks)
Additional comments not posted (4)
fiftyone/core/odm/document.py (4)

11-11: Initialize logging correctly.

The logging module is correctly imported.


25-26: Ensure logger initialization follows best practices.

The logger is correctly initialized using __name__. This is a good practice as it allows the logger to inherit the configuration of the root logger.


460-460: Ensure correct usage of _simple_filter in from_dict method.

The _simple_filter method is correctly invoked to filter None values before processing the input dictionary. This ensures data integrity during the deserialization process.


468-468: Ensure correct usage of _simple_filter in from_dict method.

The _simple_filter method is correctly invoked again to filter None values after processing the input dictionary. This double-check ensures that no None values are present in the final dictionary.

fiftyone/core/odm/document.py Outdated Show resolved Hide resolved
@minhtuev minhtuev force-pushed the bugfix/check-null-fields branch from 8c73a14 to d18765b Compare July 26, 2024 22:09
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8c73a14 and d18765b.

Files selected for processing (1)
  • fiftyone/core/odm/document.py (4 hunks)
Files skipped from review as they are similar to previous changes (1)
  • fiftyone/core/odm/document.py

return cls._from_son(d)
except Exception:
pass

# Construct any necessary extended JSON components like ObjectIds
# @todo can we optimize this?
d = json_util.loads(json_util.dumps(d))

d = cls._simple_filter(d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, list fields can contain None values. For example:

import fiftyone as fo

sample = fo.Sample(filepath="image.jpg", list_field=[None, None])

dataset = fo.Dataset()
dataset.add_sample(sample)

sample.reload()

assert sample.list_field == [None, None]

So we can't change MongoEngineBaseDocument as this is the base class for all documents, eg sample and frame documents, not just DatasetDocument.

A more robust patch here would be to gracefully catch validation errors related to sample_fields or frame_fields being None in a DatasetDocument.from_dict() super method, and, if found, immediately update these field(s) in the database so this patch doesn't get triggered permanently thereafter.

Ideally we'd still have a hypothesis for why this is happening and confirmation that this won't just lead to a downstream error as I described earlier:

For example, I bet that updating this PR to gracefully omit dataset._doc.sample_fields[i] if None would not actually do much good. It would allow load_dataset() to work, but then I bet the user would immediately get a "found unknown field 'foo'" error upon calling sample = dataset.first() because whatever the None value was we omitted from the schema was probably supposed to be defining an actual field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, now it all makes sense after digging into this code! We can certainly do both, with both patching for graceful handling and trying to figure out why dataset._doc.sample_fields[i] contains None value sometimes.

@ehofesmann : do you know if the customers encounter any other problem after removing null values from sample_fields? Do they do any additional update to the record?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's correct to assume that a none value field is a bug. This problem should be handled so None doesn't throw an error/is handled appropriately where the error is being thrown and not messing with the schema

Copy link
Contributor Author

@minhtuev minhtuev Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaixi-wang agreed with the argument from Brian above that filtering out None value is not the solution. In general, do we have a sense when None value is acceptable as a value and when it is not? Pymongo throws an exception here because it is not accepting None as a value, so in this case the conversion fails.

Copy link
Contributor

@kaixi-wang kaixi-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should automatically be filtering out None fields as I've seen multiple instances where users filter by None to be able to identify groups of samples

fiftyone/core/odm/document.py Outdated Show resolved Hide resolved
@minhtuev
Copy link
Contributor Author

@kaixi-wang : we were discussing doing a try-catch exception for sample_fields since None conversion fails specifically for this field, but it would be helpful to know if there is a general rule for None conversion :)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d18765b and 8d46873.

Files selected for processing (2)
  • fiftyone/core/fields.py (2 hunks)
  • fiftyone/core/odm/document.py (3 hunks)
Files skipped from review as they are similar to previous changes (1)
  • fiftyone/core/odm/document.py
Additional comments not posted (1)
fiftyone/core/fields.py (1)

1506-1525: LGTM! Verify the logging setup.

The method correctly filters out None values and logs a warning if any are found. Ensure that the logging mechanism is properly set up in the application to capture these warnings.

Verification successful

LGTM!

The method correctly filters out None values and logs a warning if any are found. The logging mechanism is properly set up in the application through custom configurations using logging.getLogger.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the logging mechanism is properly set up to capture warnings.

# Test: Search for logging setup in the codebase. Expect: Proper logging configuration.
rg --type python 'logging.basicConfig'

Length of output: 69


Script:

#!/bin/bash
# Description: Verify that the logging mechanism is properly set up to capture warnings.

# Test: Search for logging setup in the codebase. Expect: Proper logging configuration.
rg --type py 'logging.basicConfig'

Length of output: 34


Script:

#!/bin/bash
# Description: Verify that the logging mechanism is properly set up to capture warnings.

# Test: Search for logging configurations or usages of the logging module in the codebase.
rg --type py 'import logging' -A 5

Length of output: 27332


Script:

#!/bin/bash
# Description: Verify that the logging mechanism is properly set up to capture warnings.

# Test: Search for custom logging configurations, handlers, or setup functions in the codebase.
rg --type py 'logging.getLogger' -A 5

Length of output: 31014

@minhtuev minhtuev requested a review from kaixi-wang July 30, 2024 17:50
@minhtuev minhtuev force-pushed the bugfix/check-null-fields branch from 8d46873 to c40384c Compare July 30, 2024 17:53
@minhtuev minhtuev force-pushed the bugfix/check-null-fields branch from c40384c to d805fb6 Compare July 30, 2024 17:54
@minhtuev
Copy link
Contributor Author

@coderabbitai full review

Copy link
Contributor

coderabbitai bot commented Jul 30, 2024

Actions performed

Full review triggered.

@minhtuev minhtuev closed this Jul 30, 2024
@minhtuev
Copy link
Contributor Author

minhtuev commented Jul 30, 2024

Closing this PR since we discussed a different approach, check the new PR:
https://github.com/voxel51/fiftyone/pull/4602/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants