Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DataChain.diff(...) #666

Merged
merged 21 commits into from
Dec 15, 2024
Merged

Adding DataChain.diff(...) #666

merged 21 commits into from
Dec 15, 2024

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Dec 6, 2024

Adding new DataChain.diff(...). method that returns difference between two datasets / chains.
Result is new chain that has additional column with possible values: A, D, M, U representing added, deleted, modified and unchanged row.
Note that if only one "status" is asked, by setting proper flags, this additional column is not created as it would have only one value for all rows. Beside additional diff column, new chain has schema of the chain on which method was called.

Example usage:

diff = persons.diff(
      new_persons,
      on=["id"],
      right_on=["other_id"],
      compare=["name"],
      added=True,
      deleted=True,
      modified=True,
      unchanged=True,
      status_col="diff"
)

@ilongin ilongin marked this pull request as draft December 6, 2024 09:02
@ilongin ilongin linked an issue Dec 6, 2024 that may be closed by this pull request
Copy link

codecov bot commented Dec 6, 2024

Codecov Report

Attention: Patch coverage is 97.59036% with 2 lines in your changes missing coverage. Please review.

Project coverage is 87.44%. Comparing base (eaf0412) to head (dc50a72).

Files with missing lines Patch % Lines
src/datachain/lib/diff.py 97.46% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #666      +/-   ##
==========================================
+ Coverage   87.34%   87.44%   +0.09%     
==========================================
  Files         113      114       +1     
  Lines       10796    10879      +83     
  Branches     1480     1496      +16     
==========================================
+ Hits         9430     9513      +83     
  Misses        989      989              
  Partials      377      377              
Flag Coverage Δ
datachain 87.37% <97.59%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

cloudflare-workers-and-pages bot commented Dec 6, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: dc50a72
Status: ✅  Deploy successful!
Preview URL: https://c92bf09f.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-636-datachain-diff.datachain-documentation.pages.dev

View logs

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change! Added a few comment.
Approving to not to block.

modified: bool = True,
unchanged: bool = False,
status_col: Optional[str] = None,
) -> "Self":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you can extract these 100 lines of logic outside of dc.py while keep the method with the docs here 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to datachain.lib.diff.py

num_statuses = sum(1 if s else 0 for s in [added, deleted, modified, unchanged])

if num_statuses > 1 and not status_col:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not necessary. these columns are completely independent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Made status_col optional even when there are multiple statuses requested.

elif deleted and not any([added, modified, unchanged]):
res = right_left_merge
else:
res = left_right_merge.union(right_left_merge)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please double check that it does not have duplicates after the joins 😅

Copy link
Contributor Author

@ilongin ilongin Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't have any duplicates

diff_cond = []

if added:
added_cond = sqlalchemy.and_(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace all sqlalchamy function to our built-in functions.
It can look like this:

from datachain.func import _case, _if

l = dc1.mutate(ldiff=1)
r = dc2.mutate(rdiff=1)

dc_diff = (
 l.join(r, on="id")
 .mutate(d1=_case(_isnon(ldiff), "A", "D") )
 .mutate(d2=_case(_isnon(rdiff), "A", "D") )
 .mutate(diff=_case(d1!=d2, "A", "D") )
 .select_except("d1", "d2", "ldiff", "rdiff")
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we talked, I will create a followup issue for this as improvement, not to block additional work on delta updated etc.

@ilongin ilongin marked this pull request as ready for review December 10, 2024 14:07
@ilongin ilongin merged commit 1279ab0 into main Dec 15, 2024
34 checks passed
@ilongin ilongin deleted the ilongin/636-datachain-diff branch December 15, 2024 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement DataChain.diff(...)
2 participants