-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding DataChain.diff(...)
#666
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #666 +/- ##
==========================================
+ Coverage 87.34% 87.44% +0.09%
==========================================
Files 113 114 +1
Lines 10796 10879 +83
Branches 1480 1496 +16
==========================================
+ Hits 9430 9513 +83
Misses 989 989
Partials 377 377
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Deploying datachain-documentation with
|
Latest commit: |
dc50a72
|
Status: | ✅ Deploy successful! |
Preview URL: | https://c92bf09f.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-636-datachain-diff.datachain-documentation.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great change! Added a few comment.
Approving to not to block.
src/datachain/lib/dc.py
Outdated
modified: bool = True, | ||
unchanged: bool = False, | ||
status_col: Optional[str] = None, | ||
) -> "Self": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you can extract these 100 lines of logic outside of dc.py while keep the method with the docs here 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted to datachain.lib.diff.py
src/datachain/lib/dc.py
Outdated
num_statuses = sum(1 if s else 0 for s in [added, deleted, modified, unchanged]) | ||
|
||
if num_statuses > 1 and not status_col: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not necessary. these columns are completely independent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Made status_col
optional even when there are multiple statuses requested.
src/datachain/lib/dc.py
Outdated
elif deleted and not any([added, modified, unchanged]): | ||
res = right_left_merge | ||
else: | ||
res = left_right_merge.union(right_left_merge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please double check that it does not have duplicates after the joins 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't have any duplicates
src/datachain/lib/dc.py
Outdated
diff_cond = [] | ||
|
||
if added: | ||
added_cond = sqlalchemy.and_( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace all sqlalchamy function to our built-in functions.
It can look like this:
from datachain.func import _case, _if
l = dc1.mutate(ldiff=1)
r = dc2.mutate(rdiff=1)
dc_diff = (
l.join(r, on="id")
.mutate(d1=_case(_isnon(ldiff), "A", "D") )
.mutate(d2=_case(_isnon(rdiff), "A", "D") )
.mutate(diff=_case(d1!=d2, "A", "D") )
.select_except("d1", "d2", "ldiff", "rdiff")
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we talked, I will create a followup issue for this as improvement, not to block additional work on delta updated etc.
Adding new
DataChain.diff(...)
. method that returns difference between two datasets / chains.Result is new chain that has additional column with possible values:
A
,D
,M
,U
representing added, deleted, modified and unchanged row.Note that if only one "status" is asked, by setting proper flags, this additional column is not created as it would have only one value for all rows. Beside additional diff column, new chain has schema of the chain on which method was called.
Example usage: