-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DataChain.diff(...)
#636
Comments
[Q] @dmpetrov is using |
@ilongin great summary, thank you for scoping it!
A couple of additions:
You are right. We need:
I'd suggest start from (3) and implement (1)-(2) as a special case. (4) is needed only if it's easy to implement 🙂 So the signature should be like: def diff(self,
other: "DataChain",
added: bool = True,
deleted: bool = True,
changed: bool = True,
unchanged: bool = False,
on_file: str = "file",
right_on_file: Optional[str] = None,
on: Union[str, Sequence[str]] = None,
right_on: Union[str, Sequence[str]] = None,
status_col: Optional[str]=None,
) -> "Self": |
One core requirement: it has to be executed by a single merge/join command (full outer join). It's required for 2 reasons:
It ok to add custom columns before the join using cheap |
Updated specification. Output is not needed by default and should be |
@dmpetrov qq just do double check something. What is the exact definition of modified ( My understanding (correct me if I'm wrong):
Am I missing / missunderstanding anything or this is correct? |
@ilongin we should not compare additional columns - only file columns For file diff:
Note, PS: we need to think how to handle not latest files |
@dmpetrov ok, thanks, that is when
What should be the output (table) of this commands? |
Good question. It should be about changes in For example:
The schema change does not affect the diff:
@ilongin WDYT? |
@dmpetrov I think I'm just not sure about the results of rows with id
Do you agree with it? Regarding which schema to keep, I think it should be left one (on which method was called)
Honestly I'm not 100% sure about this. My fist thought was to consider them as |
@ilongin I thought a bit more about this - yes, you are right. We should compare column values as well. If user needs only files - they filter out everything but file.
ha... it's just a matter of convention. Do we expect |
@dmpetrov so just do double check, WDYT on schema changes (adding / removing signals) and how that affects picking |
I think I got you point... The challenge is - in majority of the cases that we know today, users need file diff only (match on We should probably implement general diff first as you suggesting (without In this case, I'd keep @ilongin WDYT? |
@dmpetrov yes, I think that's a good idea to split those into two method and make file diff just a simple wrapper around general diff. I'm just not sure about naming. Maybe I would keep the general one as So the new signatures would be:
Let me know if this makes sense. |
We should add new method to
DataChain
with signature:Method should return new
DataChain
instance having the same schema as instance on which method was called, but with one additional column calledsys__diff
(or something similar, TBD). That additional column can have 3 values in it:A
-> this row is in first chain, but not in the otherD
-> this row is in second chain, but not in the firstM
-> this row is in both chains, but different / modifiedWe should look at file signals (
file_obj
argument is for determining where to find file signal) and compare it's "hash" functions to find out those 3 values of diff. Hash is calculated inFile
object using columns:source
path
version
etag
As a follow up, we should create util function to return multiple
DataChain
instances, each for added, deleted and modified value ofdiff
column - it should wrapDataChain.diff()
and just filter by value ofsys__diff
.The text was updated successfully, but these errors were encountered: