Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added WriteConcern as a param for dataset #4696

Merged
merged 2 commits into from
Aug 20, 2024
Merged

Conversation

minhtuev
Copy link
Contributor

@minhtuev minhtuev commented Aug 16, 2024

What changes are proposed in this pull request?

For Mongo write operations, WriteConcern is an argument that controls the process of write operations from the client side. Setting the value of w=0 for WriteConcern allows the client to terminate early without waiting for the rest of the clusters to confirm write operations.

For more information on WriteConcern: https://www.mongodb.com/docs/manual/reference/write-concern/

Similarly related, ReadConcern: https://www.mongodb.com/docs/manual/reference/read-concern/

How is this patch tested? If it is not, please explain why.

  • Manual test (✅ )

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release
    notes for FiftyOne users.

(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

Summary by CodeRabbit

  • New Features

    • Enhanced index creation with options for acknowledgment and completion wait, allowing users more control during index creation.
    • Improved handling of MongoDB write operations, providing greater flexibility in dataset management through customizable write concerns.
  • Bug Fixes

    • Resolved issues related to collection statistics retrieval by ensuring consistent application of write concerns.
  • Documentation

    • Updated method documentation to clarify the new parameters and their intended use.

@minhtuev minhtuev requested a review from benjaminpkane August 16, 2024 22:09
Copy link
Contributor

coderabbitai bot commented Aug 16, 2024

Warning

Rate limit exceeded

@minhtuev has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 18 minutes and 0 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between 90b6fb9 and 9a1654b.

Walkthrough

The recent updates enhance the functionality of the fiftyone framework by introducing improved control over MongoDB write operations. Key changes include the addition of parameters for acknowledgment in index creation and flexible write concerns in dataset management methods. These enhancements streamline database interactions, promoting better performance and maintainability in handling collections and statistics.

Changes

Files Change Summary
fiftyone/core/collections.py, fiftyone/core/dataset.py Updated create_index method to include acknowledged and wait parameters, enhancing index creation control. Introduced _get_sample_collection and _get_frame_collection methods for customizable MongoDB interactions, applying write concerns consistently.

Poem

🐰 In gardens lush, where data plays,
A tweak here brightens our storage ways.
With indices keen and concerns so fine,
We hop through code, where structures align.
A sprinkle of joy in each MongoDB call,
Together we flourish, together we stand tall! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@minhtuev minhtuev force-pushed the feat/fast-index-creation branch from 01eadcf to 8e83944 Compare August 16, 2024 22:10
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 991c1d0 and 8e83944.

Files selected for processing (2)
  • fiftyone/core/collections.py (4 hunks)
  • fiftyone/core/dataset.py (4 hunks)
Additional comments not posted (10)
fiftyone/core/dataset.py (5)

24-31: Import of WriteConcern is appropriate.

The import of WriteConcern from pymongo is necessary for the new functionality related to controlling MongoDB write operations.


333-333: Addition of _write_concern attribute is appropriate.

The _write_concern attribute is added to manage MongoDB write concerns, initialized to None to allow for optional configuration.


1184-1188: Inclusion of write_concern in _sample_collstats is beneficial.

This change ensures that the configured write concern is respected when retrieving collection statistics, enhancing control over MongoDB operations.


1195-1199: Inclusion of write_concern in _frame_collstats is beneficial.

This change ensures that the configured write concern is respected when retrieving frame collection statistics, enhancing control over MongoDB operations.


7045-7064: Enhancement of collection methods with write_concern is appropriate.

These methods now accept a write_concern parameter, providing flexibility in configuring write operations for both sample and frame collections.

fiftyone/core/collections.py (5)

22-22: Import WriteConcern.

The WriteConcern class is imported from pymongo. Ensure that it's used correctly in the context of MongoDB operations.


9129-9130: Update function signature to include acknowledged parameter.

The create_index method now includes an acknowledged parameter, allowing control over whether index creation should be acknowledged by the server. This is a useful addition for performance tuning in scenarios where acknowledgment is not necessary.


9165-9166: Clarify acknowledged parameter behavior.

The docstring correctly explains that setting acknowledged to False results in w=0 for the WriteConcern, meaning the operation does not require acknowledgment from the server. Ensure this behavior is consistent with the intended use cases.


9245-9246: Set write concern based on acknowledgment.

The code correctly sets the write_concern to WriteConcern(w=0) when acknowledged is False, which aligns with the intended behavior of non-acknowledged operations. This is a good implementation for scenarios requiring faster operations without server acknowledgment.


9249-9255: Use write_concern when getting collections.

The _get_frame_collection and _get_sample_collection methods are called with write_concern to ensure the correct acknowledgment behavior is applied. This ensures that the collection operations respect the acknowledged parameter.

@minhtuev
Copy link
Contributor Author

minhtuev commented Aug 16, 2024

We can set a global write concern argument too :) https://www.mongodb.com/developer/products/mongodb/global-read-write-concerns/

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minhtuev just confirming: you've tested acknowledged=False on large index creation, right? And it indeed has the desired effect of returning roughly immediately, before the index is fully constructed?

@@ -9126,7 +9126,9 @@ def get_index_information(self, include_stats=False):

return index_info

def create_index(self, field_or_spec, unique=False, **kwargs):
def create_index(
self, field_or_spec, unique=False, acknowledged=True, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling this parameter wait=True rather than acknowledged=True? We have one precedent of something like this: session.wait()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is fine with me, usually wait means we are waiting for a purpose, so in this case, we are waiting for the index to finish building.

@@ -9238,10 +9242,17 @@ def create_index(self, field_or_spec, unique=False, **kwargs):
# Satisfactory index already exists
return index_name

# Setting `w=0` sets `acknowledged=False` in pymongo
write_concern = WriteConcern(w=0) if not acknowledged else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support this for drop_index() as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this for drop index - it terminates immediately!

return foo.get_db_conn()[self._sample_collection_name]
return self._get_sample_collection(write_concern=self._write_concern)

def _get_sample_collection(self, write_concern=None) -> Collection:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer not to add type hints here for consistency with the rest of this module not having them.

@@ -322,6 +330,7 @@ def __init__(
self._run_cache = cachetools.LRUCache(5)

self._deleted = False
self._write_concern = None
Copy link
Contributor

@brimoor brimoor Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of globally configuring a dataset's write concern is creative, but it's scary because it's very possible that builtin methods and user code alike will start breaking if one were to globally set w=0, as certain routines may do a sequence of things where some steps implicitly rely on previous steps having been fully completed (ie populating a new field and then immediately computing something about it).

I'd suggest that we not add Dataset._write_concern at this time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@minhtuev minhtuev force-pushed the feat/fast-index-creation branch from f6203a3 to 90b6fb9 Compare August 20, 2024 02:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8e83944 and 90b6fb9.

Files selected for processing (2)
  • fiftyone/core/collections.py (4 hunks)
  • fiftyone/core/dataset.py (3 hunks)
Files skipped from review as they are similar to previous changes (1)
  • fiftyone/core/collections.py
Additional context used
Ruff
fiftyone/core/dataset.py

30-30: pymongo.WriteConcern imported but unused

Remove unused import: pymongo.WriteConcern

(F401)


32-32: pymongo.collection.Collection imported but unused

Remove unused import: pymongo.collection.Collection

(F401)

Additional comments not posted (2)
fiftyone/core/dataset.py (2)

7042-7045: Verify write_concern parameter integration.

The write_concern parameter is added to the _get_sample_collection method. Ensure that this parameter is correctly passed and utilized in MongoDB operations where necessary.


7055-7061: Verify write_concern parameter integration.

The write_concern parameter is added to the _get_frame_collection method. Ensure that this parameter is correctly passed and utilized in MongoDB operations where necessary.

Comment on lines 30 to 31
WriteConcern,
)
from pymongo.collection import Collection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused imports.

The imports WriteConcern and Collection are not used in the code and should be removed to clean up the codebase.

-    WriteConcern,
- from pymongo.collection import Collection
Tools
Ruff

30-30: pymongo.WriteConcern imported but unused

Remove unused import: pymongo.WriteConcern

(F401)


32-32: pymongo.collection.Collection imported but unused

Remove unused import: pymongo.collection.Collection

(F401)

@minhtuev minhtuev force-pushed the feat/fast-index-creation branch from 90b6fb9 to 9a1654b Compare August 20, 2024 02:49
@minhtuev
Copy link
Contributor Author

minhtuev commented Aug 20, 2024

@brimoor : setting WriteConcern(w=0) does what we want it to do, but I am still learning more about WriteConcern & ReadConcern :)

Normal index creation is slow for 10M dataset:
https://www.loom.com/share/3df7adf8ce0943d1b376a01fa4f07e22?sid=c87c3c15-0eca-416f-939c-051e61b0824f

Setting w=0 for WriteConcern allows the command to exit:
https://www.loom.com/share/9ec56cce87244656a5b1943d74ba03cb?sid=6d83e741-9747-469c-8508-0e43ea2efc2b

I also discovered certain operations, such as dataset.summary() will hang and wait until index creation is finished. Probably because we are running dataset.count() in the summary. Probably what we need to do is to either cache the value or play around with ReadConcern argument (setting level=available) so that our DB ops can work during long index creation.

@minhtuev minhtuev requested a review from brimoor August 20, 2024 03:57
Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minhtuev LGTM! Can you retarget this change at develop since it's fully functional as a standalone addition?

@minhtuev minhtuev changed the base branch from feat/index-management to develop August 20, 2024 04:29
@minhtuev minhtuev requested a review from brimoor August 20, 2024 04:29
Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@minhtuev minhtuev merged commit 033a7e2 into develop Aug 20, 2024
13 checks passed
@minhtuev minhtuev deleted the feat/fast-index-creation branch August 20, 2024 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants