Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated migration #8543

Merged
merged 7 commits into from
Oct 16, 2024
Merged

Updated migration #8543

merged 7 commits into from
Oct 16, 2024

Conversation

bsekachev
Copy link
Member

@bsekachev bsekachev commented Oct 15, 2024

Motivation and context

How has this been tested?

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • I have linked related issues (see GitHub docs)
  • I have increased versions of npm packages if it is necessary
    (cvat-canvas,
    cvat-core,
    cvat-data and
    cvat-ui)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

Summary by CodeRabbit

  • New Features
    • Introduced new validation models: ValidationParams, ValidationLayout, and ValidationFrame to enhance data validation capabilities.
    • Implemented a function to clean up redundant ground truth jobs, improving data integrity.
  • Improvements
    • Updated the RelatedFile model to strengthen its relationship with the Image model, enhancing data management.

@bsekachev bsekachev requested a review from Marishka17 as a code owner October 15, 2024 10:04
Copy link
Contributor

coderabbitai bot commented Oct 15, 2024

Walkthrough

The changes in the migration file 0084_honeypot_support.py introduce a new function cleanup_invalid_data for managing ground truth jobs, ensuring data integrity by deleting redundant jobs while retaining at least one per task. Additionally, three new models—ValidationParams, ValidationLayout, and ValidationFrame—are created to enhance validation capabilities. The RelatedFile model is modified to establish a ManyToMany relationship with the Image model, removing the previous ForeignKey relationship with primary_image and updating the images field for better reverse lookups.

Changes

File Path Change Summary
cvat/apps/engine/migrations/0084_honeypot_support.py - Added function cleanup_invalid_data.
- Created models: ValidationParams, ValidationLayout, ValidationFrame.
- Modified RelatedFile model: added ManyToMany relationship with Image, removed primary_image, updated images field with related_name="related_files".

Poem

In the garden where data grows,
A rabbit hops where the clean stream flows.
With valid frames and layouts bright,
We tidy up jobs, making things right.
Hooray for the changes, let’s dance and play,
For a better tomorrow, hip-hip-hooray! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (1)
cvat/apps/engine/migrations/0084_honeypot_support.py (1)

Line range hint 118-146: Handle exceptions in reverse migration revert_m2m_for_related_files

The reverse migration raises an exception if any RelatedFile has more than one associated Image. This could prevent rolling back the migration in certain cases.

  • Provide a clear message and guidance on how to resolve the issue if the exception is raised.
  • Consider whether it's feasible to programmatically resolve or merge multiple images into a single primary_image or adjust the data model accordingly.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between ac01fff and f82de29.

📒 Files selected for processing (1)
  • cvat/apps/engine/migrations/0084_honeypot_support.py (3 hunks)
🧰 Additional context used
🔇 Additional comments (6)
cvat/apps/engine/migrations/0084_honeypot_support.py (6)

248-251: Ensure data integrity during the migration

When performing data cleanup in migrations, it's crucial to handle exceptions and ensure that the database remains in a consistent state if an error occurs.

Wrap the data manipulation code in a transaction to ensure atomicity:

from django.db import transaction

@transaction.atomic
def cleanup_invalid_data(apps):
    # Existing code

Line range hint 98-112: Check ManyToMany initialization for related files and images

The function init_m2m_for_related_files populates the intermediate table for the ManyToMany relationship between RelatedFile and Image. Ensure that:

  • The bulk creation handles all existing RelatedFile instances with a non-null primary_image.
  • Data integrity is maintained, and there are no duplicate entries.

Consider adding logging or progress indicators if the dataset is large to monitor the migration progress.


Line range hint 188-241: Review field choices and default values in new models

In the ValidationParams and ValidationLayout models:

  • Ensure that the choices for mode and frame_selection_method fields accurately reflect all valid options.
  • Verify that fields like random_seed, frame_count, and frame_share handle null values appropriately.

Line range hint 68-93: Verify correct initialization of validation layouts

The init_validation_layout_in_tasks_with_gt_job function initializes ValidationLayout instances. Ensure that:

  • The frames field is correctly calculated using get_segment_rel_frame_set.
  • All possible db_segment.type values are handled appropriately in get_segment_rel_frame_set.

To confirm, you can run:

#!/bin/bash
# Description: Verify all segment types are accounted for in get_segment_rel_frame_set.

# Expect: No unhandled segment types.
ast-grep --lang python --pattern '
def get_segment_rel_frame_set($_) -> $_:
    $_
    else:
        raise ValueError($_)
' 0084_honeypot_support.py

Line range hint 252-261: Update references due to changes in RelatedFile model fields

The primary_image field is removed, and an images ManyToMany field with related_name="related_files" is added to the RelatedFile model. Ensure that all code referencing primary_image is updated to use the new relationship.

Run this script to identify potential code that needs updating:

#!/bin/bash
# Description: Find all references to 'primary_image' in the codebase.

# Expect: All references should be reviewed and updated.
rg --type python 'primary_image' cvat/apps/

61-64: Ensure at least one ground truth job remains per task

The loop removes ground truth jobs until only one remains. However, without safeguards, there's a risk of accidentally removing all ground truth jobs for a task if groups[task_id] becomes empty due to unexpected data conditions.

Consider adding a check to ensure that the loop stops when one job remains:

while len(groups[task_id]) > 1:
    # Existing deletion logic

To verify that the function behaves correctly, run the following script:

Comment on lines 248 to 251
migrations.RunPython(
cleanup_invalid_data,
reverse_code=migrations.RunPython.noop,
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Provide a meaningful reverse migration for cleanup_invalid_data

Currently, reverse_code is set to migrations.RunPython.noop, which means there is no operation to reverse the data changes made by cleanup_invalid_data. This could be problematic if a rollback is necessary.

Consider implementing a reverse function that can restore the deleted ground truth jobs if possible or document clearly why a reverse migration is not feasible due to data constraints.

Comment on lines 44 to 66
def cleanup_invalid_data(apps):
Task = apps.get_model("engine", "Task")
Job = apps.get_model("engine", "Job")

broken_tasks = Task.objects.annotate(
ground_truth_jobs_count=Count(
'segment__job', filter=Q(segment__job__type='ground_truth')
)
).filter(ground_truth_jobs_count__gt=1).values_list('segment__task__id', flat=True)
gt_jobs = Job.objects.filter(
segment__task__id__in=broken_tasks
).filter(type='ground_truth').order_by('created_date').all()

groups = defaultdict(list)
for gt_job in gt_jobs:
groups[gt_job.segment.task.id].append(gt_job)

for task_id in groups:
while len(groups[task_id]) > 1:
gt_job = groups[task_id].pop()

assert gt_job.type == 'ground_truth'
gt_job.delete()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Optimize database queries for better performance

The cleanup_invalid_data function retrieves and processes data in a way that may not be optimal for large datasets.

  • Use iterator() with a suitable chunk_size when dealing with large querysets to avoid loading all objects into memory at once.
  • Consider combining queries or using subqueries to minimize database hits.

For example:

gt_jobs = (
    Job.objects.filter(
        segment__task__id__in=broken_tasks,
        type='ground_truth'
    )
    .order_by('created_date')
    .iterator(chunk_size=1000)
)

⚠️ Potential issue

Replace assert with explicit exception handling for reliability

In the cleanup_invalid_data function, using assert statements can be bypassed if Python is run with optimizations (-O flag), as assertions are removed. It's safer to use explicit exception handling to ensure that the check is always performed.

Apply this diff to replace the assert statement with a conditional raise:

             while len(groups[task_id]) > 1:
                 gt_job = groups[task_id].pop()

-                assert gt_job.type == 'ground_truth'
+                if gt_job.type != 'ground_truth':
+                    raise ValueError(f"Expected gt_job.type to be 'ground_truth', got {gt_job.type}")

                 gt_job.delete()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def cleanup_invalid_data(apps):
Task = apps.get_model("engine", "Task")
Job = apps.get_model("engine", "Job")
broken_tasks = Task.objects.annotate(
ground_truth_jobs_count=Count(
'segment__job', filter=Q(segment__job__type='ground_truth')
)
).filter(ground_truth_jobs_count__gt=1).values_list('segment__task__id', flat=True)
gt_jobs = Job.objects.filter(
segment__task__id__in=broken_tasks
).filter(type='ground_truth').order_by('created_date').all()
groups = defaultdict(list)
for gt_job in gt_jobs:
groups[gt_job.segment.task.id].append(gt_job)
for task_id in groups:
while len(groups[task_id]) > 1:
gt_job = groups[task_id].pop()
assert gt_job.type == 'ground_truth'
gt_job.delete()
def cleanup_invalid_data(apps):
Task = apps.get_model("engine", "Task")
Job = apps.get_model("engine", "Job")
broken_tasks = Task.objects.annotate(
ground_truth_jobs_count=Count(
'segment__job', filter=Q(segment__job__type='ground_truth')
)
).filter(ground_truth_jobs_count__gt=1).values_list('segment__task__id', flat=True)
gt_jobs = Job.objects.filter(
segment__task__id__in=broken_tasks
).filter(type='ground_truth').order_by('created_date').all()
groups = defaultdict(list)
for gt_job in gt_jobs:
groups[gt_job.segment.task.id].append(gt_job)
for task_id in groups:
while len(groups[task_id]) > 1:
gt_job = groups[task_id].pop()
if gt_job.type != 'ground_truth':
raise ValueError(f"Expected gt_job.type to be 'ground_truth', got {gt_job.type}")
gt_job.delete()

@bsekachev bsekachev changed the title [WIP] Updated migration Updated migration Oct 15, 2024
@bsekachev
Copy link
Member Author

@SpecLad applied proposed changes

Copy link

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.23%. Comparing base (ac01fff) to head (9e4a03e).
Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8543      +/-   ##
===========================================
- Coverage    74.30%   74.23%   -0.07%     
===========================================
  Files          400      400              
  Lines        43218    43218              
  Branches      3909     3909              
===========================================
- Hits         32114    32085      -29     
- Misses       11104    11133      +29     
Components Coverage Δ
cvat-ui 78.66% <ø> (-0.07%) ⬇️
cvat-server 70.47% <ø> (-0.08%) ⬇️

@bsekachev bsekachev merged commit c557f70 into develop Oct 16, 2024
34 of 36 checks passed
@bsekachev bsekachev deleted the bs/updated_migration branch October 24, 2024 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants