-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added DataChain.diff()
#718
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #718 +/- ##
==========================================
+ Coverage 87.39% 87.43% +0.03%
==========================================
Files 114 114
Lines 10951 10967 +16
Branches 1508 1508
==========================================
+ Hits 9571 9589 +18
+ Misses 999 998 -1
+ Partials 381 380 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
src/datachain/lib/dc.py
Outdated
```py | ||
diff = images.diff( | ||
new_images, | ||
on=["file"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update example here or update typings above?
on: str = "file",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, fixed!
Deploying datachain-documentation with Cloudflare Pages
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing ✨
Please see comments inline. I'm approving to not to block it.
tests/unit/lib/test_datachain.py
Outdated
@@ -3350,3 +3350,108 @@ def test_compare_right_compare_wrong_length(test_session): | |||
assert str(exc_info.value) == ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test file is massive 😱
Could you please extract compare and diff test to a separate file like test_diff.py. See test_merge.py as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/datachain/lib/dc.py
Outdated
added: bool = True, | ||
deleted: bool = True, | ||
modified: bool = True, | ||
unchanged: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think diff() should have only added=True and modified=True. the rest should be False.
We need to optimize it for delta update use case which process only new and changed files.
src/datachain/lib/dc.py
Outdated
unchanged: bool = False, | ||
status_col: Optional[str] = None, | ||
) -> "DataChain": | ||
"""Similar as .compare() but for file based chains, i.e. those that have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention the similarity to compare() in the end of the description. We cannot assum that user knows compare() alread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added better docs
deleted=True, | ||
modified=True, | ||
unchanged=True, | ||
status_col="diff" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's empty by default. The same in the description.
tests/unit/lib/test_datachain.py
Outdated
@pytest.mark.parametrize("deleted", (True, False)) | ||
@pytest.mark.parametrize("modified", (True, False)) | ||
@pytest.mark.parametrize("unchanged", (True, False)) | ||
@pytest.mark.parametrize("status_col", ("diff", None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test looks overcomplicated 🙂
Is it possible to separate this to small tests instead of large parametrized one?
Also, I don't see value in all modified, deleted, unchanged statuses since compare() tests have to cover this, not diff(). The same for status_col
. 2-3 simple tests should cover diff() pretty well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified it
tests/unit/lib/test_datachain.py
Outdated
@pytest.mark.parametrize("deleted", (True, False)) | ||
@pytest.mark.parametrize("modified", (True, False)) | ||
@pytest.mark.parametrize("unchanged", (True, False)) | ||
@pytest.mark.parametrize("status_col", ("diff", None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - to complicated to unit test and it test mostly compare(), not diff().
I'd appreciate it if you could simplify it.
tests/unit/lib/test_datachain.py
Outdated
if deleted: | ||
expected.append(("D", fs3, 3)) | ||
if unchanged: | ||
expected.append(("U", fs4, 4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I was sure that we decided to get rid of U
and use S
(Same) instead, didn't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will change it in my followup PR. I forgot to change it in that main one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to S
.. also changed flag name from unchanged
to same
…n into ilongin/716-file-diff
This is wrapper method around more generic
DataChain.compare()
which is used to calculate diff between file based chains (those which haveFile
objects in it).For matching same rows file signals
source
andpath
are used.For comparing if row is modified / unchanged, file signals
version
andetag
are used.