Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DataChain.diff() #718

Merged
merged 11 commits into from
Dec 20, 2024
Merged

Added DataChain.diff() #718

merged 11 commits into from
Dec 20, 2024

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Dec 17, 2024

This is wrapper method around more generic DataChain.compare() which is used to calculate diff between file based chains (those which have File objects in it).
For matching same rows file signals source and path are used.
For comparing if row is modified / unchanged, file signals version and etag are used.

@ilongin ilongin linked an issue Dec 17, 2024 that may be closed by this pull request
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.43%. Comparing base (983cbd8) to head (d1c5109).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #718      +/-   ##
==========================================
+ Coverage   87.39%   87.43%   +0.03%     
==========================================
  Files         114      114              
  Lines       10951    10967      +16     
  Branches     1508     1508              
==========================================
+ Hits         9571     9589      +18     
+ Misses        999      998       -1     
+ Partials      381      380       -1     
Flag Coverage Δ
datachain 87.37% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

```py
diff = images.diff(
new_images,
on=["file"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update example here or update typings above?

on: str = "file",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, fixed!

Copy link

cloudflare-workers-and-pages bot commented Dec 18, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: d1c5109
Status: ✅  Deploy successful!
Preview URL: https://97418e03.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-716-file-diff.datachain-documentation.pages.dev

View logs

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing ✨
Please see comments inline. I'm approving to not to block it.

@@ -3350,3 +3350,108 @@ def test_compare_right_compare_wrong_length(test_session):
assert str(exc_info.value) == (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test file is massive 😱

Could you please extract compare and diff test to a separate file like test_diff.py. See test_merge.py as an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

added: bool = True,
deleted: bool = True,
modified: bool = True,
unchanged: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think diff() should have only added=True and modified=True. the rest should be False.

We need to optimize it for delta update use case which process only new and changed files.

unchanged: bool = False,
status_col: Optional[str] = None,
) -> "DataChain":
"""Similar as .compare() but for file based chains, i.e. those that have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention the similarity to compare() in the end of the description. We cannot assum that user knows compare() alread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added better docs

deleted=True,
modified=True,
unchanged=True,
status_col="diff"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's empty by default. The same in the description.

@pytest.mark.parametrize("deleted", (True, False))
@pytest.mark.parametrize("modified", (True, False))
@pytest.mark.parametrize("unchanged", (True, False))
@pytest.mark.parametrize("status_col", ("diff", None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test looks overcomplicated 🙂

Is it possible to separate this to small tests instead of large parametrized one?

Also, I don't see value in all modified, deleted, unchanged statuses since compare() tests have to cover this, not diff(). The same for status_col. 2-3 simple tests should cover diff() pretty well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified it

@pytest.mark.parametrize("deleted", (True, False))
@pytest.mark.parametrize("modified", (True, False))
@pytest.mark.parametrize("unchanged", (True, False))
@pytest.mark.parametrize("status_col", ("diff", None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - to complicated to unit test and it test mostly compare(), not diff().
I'd appreciate it if you could simplify it.

if deleted:
expected.append(("D", fs3, 3))
if unchanged:
expected.append(("U", fs4, 4))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I was sure that we decided to get rid of U and use S (Same) instead, didn't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will change it in my followup PR. I forgot to change it in that main one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to S .. also changed flag name from unchanged to same

@ilongin ilongin merged commit 20c73b2 into main Dec 20, 2024
34 checks passed
@ilongin ilongin deleted the ilongin/716-file-diff branch December 20, 2024 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create DataChain.diff() method
3 participants