Skip to content

Commit

Permalink
added docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ilongin committed Dec 17, 2024
1 parent 87694b2 commit 1d40f26
Showing 1 changed file with 52 additions and 0 deletions.
52 changes: 52 additions & 0 deletions src/datachain/toolkit/diff.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,58 @@ def compare(
modified: bool = True,
unchanged: bool = False,
) -> dict[str, "DataChain"]:
"""Comparing two chains by identifying rows that are added, deleted, modified
or unchanged. Result is the new chain that has additional column with possible
values: `A`, `D`, `M`, `U` representing added, deleted, modified and unchanged
rows respectively. Note that if only one "status" is asked, by setting proper
flags, this additional column is not created as it would have only one value
for all rows. Beside additional diff column, new chain has schema of the chain
on which method was called.
Comparing two chains and returning multiple chains, one for each of `added`,
`deleted`, `modified` and `unchanged` status. Result is returned in form of
dictionary where each item represents one of the statuses and key values
are `A`, `D`, `M`, `U` corresponding. Note that status column is not in the
resulting chains.
Parameters:
left: Chain to calculate diff on.
right: Chain to calculate diff from.
on: Column or list of columns to match on. If both chains have the
same columns then this column is enough for the match. Otherwise,
`right_on` parameter has to specify the columns for the other chain.
This value is used to find corresponding row in other dataset. If not
found there, row is considered as added (or removed if vice versa), and
if found then row can be either modified or unchanged.
right_on: Optional column or list of columns
for the `other` to match.
compare: Column or list of columns to compare on. If both chains have
the same columns then this column is enough for the compare. Otherwise,
`right_compare` parameter has to specify the columns for the other
chain. This value is used to see if row is modified or unchanged. If
not set, all columns will be used for comparison
right_compare: Optional column or list of columns
for the `other` to compare to.
added (bool): Whether to return chain containing only added rows.
deleted (bool): Whether to return chain containing only deleted rows.
modified (bool): Whether to return chain containing only modified rows.
unchanged (bool): Whether to return chain containing only unchanged rows.
Example:
```py
chains = compare(
persons,
new_persons,
on=["id"],
right_on=["other_id"],
compare=["name"],
added=True,
deleted=True,
modified=True,
unchanged=True,
)
```
"""
from datachain.lib.diff import compare as chain_compare

status_col = "diff_" + "".join(
Expand Down

0 comments on commit 1d40f26

Please sign in to comment.