Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataChain.diff(...) #636

Closed
ilongin opened this issue Nov 26, 2024 · 13 comments · Fixed by #666
Closed

Implement DataChain.diff(...) #636

ilongin opened this issue Nov 26, 2024 · 13 comments · Fixed by #666
Assignees
Labels
enhancement New feature or request

Comments

@ilongin
Copy link
Contributor

ilongin commented Nov 26, 2024

We should add new method to DataChain with signature:

def diff(self, other: "DataChain", file_obj: str, added: bool = True, deleted: bool = True, changed: bool = True) -> "Self":
    ....

Method should return new DataChain instance having the same schema as instance on which method was called, but with one additional column called sys__diff (or something similar, TBD). That additional column can have 3 values in it:

  1. A -> this row is in first chain, but not in the other
  2. D -> this row is in second chain, but not in the first
  3. M -> this row is in both chains, but different / modified

We should look at file signals (file_obj argument is for determining where to find file signal) and compare it's "hash" functions to find out those 3 values of diff. Hash is calculated in File object using columns:

  1. source
  2. path
  3. version
  4. etag

As a follow up, we should create util function to return multiple DataChain instances, each for added, deleted and modified value of diff column - it should wrap DataChain.diff() and just filter by value of sys__diff.

@ilongin ilongin self-assigned this Nov 26, 2024
@ilongin
Copy link
Contributor Author

ilongin commented Nov 26, 2024

[Q] @dmpetrov is using File subobject to determine diff too limited? In current implementation of subtract we can add any arbitrary combination of columns on which to calculate equality. Maybe we can start with File and later on add ability for any columns?

@ilongin ilongin added the enhancement New feature or request label Nov 26, 2024
@dmpetrov
Copy link
Member

dmpetrov commented Nov 26, 2024

@ilongin great summary, thank you for scoping it!

def diff(self, other: "DataChain", file_obj: str, added: bool = True, deleted: bool = True, changed: bool = True) -> "Self":

A couple of additions:

  • We also need unchanged: bool=False (the only False by default)
  • In additional to A/D/M statuses we need something like U - unchanged
  • Default value for file_obj: str = "file",
  • Please do not use abstract terms like obj: file_obj --> file_col or on_file which is better
  • sys__diff should be status_col: Optional[str]=None with None by default since in many cases it's defined by params: ds_new_only = ds.diff(other, deleted=False, changed=False). When user needs to distinguish statuses: ds.diff(other, status_col="diff_status")
    • Also, please do not use __ for signal naming - it's implementation detail - use sys.diff instead.

[Q] @dmpetrov is using File subobject to determine diff too limited?

You are right. We need:

  1. on_file: str="file"
  2. right_on_file: Optional[str]=None (default value means - using the on_filefor right as well). So, user can use ds.diff(... on_file="file", right_on_file="laion.file") or ds.diff(... on_file="laion.file")
  3. on: Union[str, Sequence[str]] = None, right_on: Union[str, Sequence[str]] = None as a more general way of diff-ing. on_file has to be ignored if it's set.
  4. (optional) if all set to None - it can return a full diff on all columns ds.diff(other, on=None)

I'd suggest start from (3) and implement (1)-(2) as a special case. (4) is needed only if it's easy to implement 🙂

So the signature should be like:

def diff(self,
  other: "DataChain",
  added: bool = True,
  deleted: bool = True,
  changed: bool = True,
  unchanged: bool = False,
  on_file: str = "file",
  right_on_file: Optional[str] = None,
  on: Union[str, Sequence[str]] = None,
  right_on: Union[str, Sequence[str]] = None,
  status_col: Optional[str]=None,
) -> "Self":

@dmpetrov
Copy link
Member

One core requirement: it has to be executed by a single merge/join command (full outer join). It's required for 2 reasons:

  1. Efficiency - diff-ing buckets with billions of files.
  2. Feature completion - we need to make sure our API allows these use cases.

It ok to add custom columns before the join using cheap mutate() if needed - ds.mutate(custome=1). This is a standard trick to set a not None values and check result of outer-join if it gets None after the join. However, these have to be clean up after the join - user should not see these.

@dmpetrov
Copy link
Member

Updated specification. Output is not needed by default and should be status_col instead: output="diff" --> status_col=Optional[str]=None

@ilongin
Copy link
Contributor Author

ilongin commented Dec 3, 2024

@dmpetrov qq just do double check something. What is the exact definition of modified (M) and unchanged (U) scenarios?

My understanding (correct me if I'm wrong):
Let's say we have row with same id (e.g let's say we match by it) in both left and right datasets, then:

  1. If right has one additional column compared to left, it's automatically marked as M
  2. If left has one additional column compared to right, it's automatically marked as M
  3. If both left and right have the same columns but there is at least one difference in column values it is marked as M
  4. If both left and right have the same columns and all the values in all columns are the same - only then it is marked as U

Am I missing / missunderstanding anything or this is correct?

@dmpetrov
Copy link
Member

dmpetrov commented Dec 4, 2024

@ilongin we should not compare additional columns - only file columns source, path, version, etag (except special cases - all columns).

For file diff:

  • M - file with the same path exist in both datasets but version/etag do not match (I assume you work with is_latest file)
  • U - file with the same path exist in both datasets AND version/etag do match (I assume you work with is_latest file)

Note, id of record is not needed.

PS: we need to think how to handle not latest files not is_latest

@ilongin
Copy link
Contributor Author

ilongin commented Dec 4, 2024

@dmpetrov ok, thanks, that is when on_file is used. But what about general case (when on and optionally right_on are used). How to determine when the row is modified or unchanged? Let's say we have this example:

           ds1
            id  |   name    |
            ------------------
            1       John_new
            2       Doe 
            4       Andy

            ds2
            id  |   name    |
            -----------------
            1       John
            3       Mark
            4       Andy

           ds3
            id  |   name    |  city | 
            -----------------------------------
            1       John_new          London
            2       Doe               New York
            4       Andy              Tokyo

What should be the output (table) of this commands?
1, ds1.diff(ds2, on=["id"], status_column="diff")
2. ds1.diff(ds3, on=["id"], status_column="diff")

@dmpetrov
Copy link
Member

dmpetrov commented Dec 5, 2024

what about general case (when on and optionally right_on are used).

Good question. It should be about changes in on and right_on columns but not the others. The question is - what to keep in the output - left or right schema. Since we use "return new DataChain instance having the same schema as instance on which method was called" - we should keep schema of the dataset on which method was called.

For example:

1, ds1.diff(ds2, on=["id"], status_column="diff")

RESULT:
            id  |   name | diff |
            ----------------------
            1     John_new     M
            2     Doe          D
            3                  A
            4     Andy         U
  1. ds1.diff(ds3, on=["id"], status_column="diff")

The schema change does not affect the diff:

RESULT:
            id  |   name | diff |
            ----------------------
            1     John_new     U
            2     Doe          U
            4     Andy         U

@ilongin WDYT?

@ilongin
Copy link
Contributor Author

ilongin commented Dec 5, 2024

Good question. It should be about changes in on and right_on columns but not the others. The question is - what to keep in the output - left or right schema. Since we use "return new DataChain instance having the same schema as instance on which method was called" - we should keep schema of the dataset on which method was called.

For example:

1, ds1.diff(ds2, on=["id"], status_column="diff")

RESULT:
            id  |   name | diff |
            ----------------------
            1     John_new     M
            2     Doe          D
            3                  A
            4     Andy         U

@dmpetrov I think on and right_on columns are used only to match rows, but we should look to other columns to determine M or U and you actually did that in your resulting table by setting row with id of 1 and name John_new as M because you saw the name was changed.

I'm just not sure about the results of rows with id 1 and 2. I think row 2 (Doe) should be considered as A and not D as we are comparing what's added or removed in dataset on which method was called (ds1) compared to other dataset (ds2) and not the other way around. Also, I would maybe keep the value of columns in rows which are removed. So my solution is this:

        id  |   name    |   diff    |
        -----------------------------
        1       John1       M
        2       Doe         A
        3       Mark        D
        4       Andy        U

Do you agree with it?

Regarding which schema to keep, I think it should be left one (on which method was called)

  1. ds1.diff(ds3, on=["id"], status_column="diff")

The schema change does not affect the diff:

RESULT:
            id  |   name | diff |
            ----------------------
            1     John_new     U
            2     Doe          U
            4     Andy         U

@ilongin WDYT?

Honestly I'm not 100% sure about this. My fist thought was to consider them as M as something new was added or removed from the rows but maybe we can leave this to the user to decide by adding additional flag in arguments, e.g schema_change_sensitive = False

@ilongin ilongin linked a pull request Dec 6, 2024 that will close this issue
@dmpetrov
Copy link
Member

dmpetrov commented Dec 7, 2024

@ilongin I thought a bit more about this - yes, you are right. We should compare column values as well. If user needs only files - they filter out everything but file.

I think row 2 (Doe) should be considered as A and not D

ha... it's just a matter of convention. Do we expect delta = new.diff(old, ...) or delta = old.diff(new, ...). Please implement the way that you proposed. We can play a bit and decide if reverse logic is needed 🙂

@ilongin
Copy link
Contributor Author

ilongin commented Dec 7, 2024

@dmpetrov so just do double check, on_file and right_on_file will stay and we use combo of source + path to match rows, but we will use all other columns to see if row was M or U? ... This makes us have custom logic for File matching. Wondering if we can just leave this to the user though .. meaning, to only have on and right_on and if user want's to match on some File subobject it can do by specifying those fields , e.g ds1.diff(ds2, on=["some_obj.file.source", "some_obj.file.path"])

WDYT on schema changes (adding / removing signals) and how that affects picking M or U ? Do you think we should have that additional flag to let user decide about it, as I suggested, or not?

@dmpetrov
Copy link
Member

dmpetrov commented Dec 8, 2024

I think I got you point... The challenge is - in majority of the cases that we know today, users need file diff only (match on source+path and see if version+etag changed). This is why I didn't expect to take the other fields into account. The general diff is a bit more abstract case and it would be challenging for users to make it practical old.diff(new, on=("file.source", "file.path"), on_right=("laion.f1.source", "laion.f1.path")).

We should probably implement general diff first as you suggesting (without on_file). And then implement file specific diff.

In this case, I'd keep diff() name for file based method and name the general on as compare() and so.

@ilongin WDYT?

@ilongin
Copy link
Contributor Author

ilongin commented Dec 8, 2024

@dmpetrov yes, I think that's a good idea to split those into two method and make file diff just a simple wrapper around general diff.

I'm just not sure about naming. Maybe I would keep the general one as diff and name the file one just as file_diff as they are basically the same thing, just file one being more specific. Anyway, we can think about that along the way.

So the new signatures would be:

def diff(self,
  other: "DataChain",
  added: bool = True,
  deleted: bool = True,
  changed: bool = True,
  unchanged: bool = False,
  on: Union[str, Sequence[str]] = None,
  right_on: Union[str, Sequence[str]] = None,
  compare: Union[str, Sequence[str]] = None,  #  list of columns which to use to determine if row is modified or unchanged. If left None then all columns will be checked and row will be considered changed if any column is added or removed from other dataset.
  right_compare: Union[str, Sequence[str]] = None,
  status_col: Optional[str]=None,
) -> "Self":

def file_diff(self,
  other: "DataChain",
  added: bool = True,
  deleted: bool = True,
  changed: bool = True,
  unchanged: bool = False,
  on: Union[str, Sequence[str]] = None,  # file object on which to match
  right_on: Union[str, Sequence[str]] = None, 
  status_col: Optional[str]=None,
) -> "Self":

Let me know if this makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants