Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for using cell ID in diffing and merging #639

Merged
merged 35 commits into from
Nov 1, 2023

Conversation

vidartf
Copy link
Collaborator

@vidartf vidartf commented Nov 23, 2022

Fixes #553.

Previous cell ID efforts have been focused on having them not have nbdime fall over, and make sure nbdime doesn't drop them completely (#566). This PR makes nbdime start using the IDs to make better diffs! While the PR does a bunch of extra things as well (some optimization, some refactoring to have nbdime treat the id field as atomic, some test fixes, improving the union merge strategy), the main logical changes for the cell IDs are relatively small (the changes to compare_cell_X functions in nbdime/diffing/notebooks.py). To break it down:

  • In the strictest check, cells without an ID is never considered equal to cells with IDs.
  • In the two less-strict comparisons (moderate and approximate), it will consider cells as unequal if they both have ids and they differ. However, they will allow a cell without an ID to be considered similar to a cell with an ID. This allows:
    • Commits where cell IDs are added to notebooks that previously didn't have it should still be able to match cells.
    • If a notebook ends up with only some cells having IDs (e.g. during a partial merge or rebase), and then IDs are added to those cells later, then in the first comparison pass (strict) the existing cells that all had IDs are first matched to each other, and only in the next passes are the cells that gained IDs in the current change compared with their old ID-less versions.

I want to run some extra manual tests on the change of the differ for the source field, so leaving as draft for now. We also need to have a way to surface the cell ID in the web apps. Maybe include it in the [N] input prompt? This will be useful e.g. if a user has removed an empty cell at the bottom of a notebook, then automatically readded it by executing the last cell with Shift + Enter. They will look identical in the web UI, but nbdime will now insist that an empty cell was removed, and a new one added.

@vidartf vidartf closed this May 10, 2023
@vidartf vidartf reopened this May 10, 2023
@krassowski
Copy link
Member

We also need to have a way to surface the cell ID in the web apps. Maybe include it in the [N] input prompt?

Markdown and raw cells do not have execution count prompts:

Screenshot from 2023-06-05 00-17-51

The cell IDs tend to be long and non-user-readable. The would likely fit next to the prompt:

Screenshot from 2023-06-05 00-24-29

Is this what you had in mind? I can send a PR against this one.

What would you say about only showing those when hovering over the cell to reduce visual clutter?

This will be useful e.g. if a user has removed an empty cell at the bottom of a notebook, then automatically readded it by executing the last cell with Shift + Enter. They will look identical in the web UI, but nbdime will now insist that an empty cell was removed, and a new one added.

Another common habit which will lead to this is when working with e.g. visualisation, if a user duplicates the cell, modifies the code, runs the other cell, compares visualisation results and depending on satisfaction removes one or the other.

I was wondering whether it should be:

if cell_a_id == cell_b_id:
    return True

rather than:

return cell_a_id == cell_b_id

I guess this would have much less of a performance benefit (i.e. only benefit if notebook structure is largely unchanged).

@krassowski
Copy link
Member

I opened vidartf#3 implementing the UI changes exposing the cell IDs. I think that these could be improved iteratively later.

I had also measured the execution time to understand how much improvement this PR brings. In a notebook of 1000 code cells (10 unique cells duplicated 100 times), benchmarked with hyperfine using nbdime diff path/to/notebook.ipynb:

Scenario Before (mean ± σ) After (mean ± σ) Speedup
no changes 1.133 s ± 0.056 s 460.2 ms ± 17.7 ms x 2.5
content of last cell changed 10.442 s ± 0.442 s 2.486 s ± 0.060 s x 4.2
content of first cell changed 10.717 s ± 0.843 s 2.448 s ± 0.093 s x 4.4
last cell added 10.735 s ± 0.748 s 2.398 s ± 0.159 s x 4.5
first cell added 13.872 s ± 1.710 s 2.311 s ± 0.254 s x 6.0
cells randomly resuffled 16.925 s ± 0.232 s 2.555 s ± 0.053 s x 6.6
content of every cell changed 387.859 s ± 18.706 s 2.664 s ± 0.078 s x 145.6

This is pretty impressive, especially for the case of a notebook where content of each cell was modified.

@krassowski
Copy link
Member

This will be useful e.g. if a user has removed an empty cell at the bottom of a notebook, then automatically readded it by executing the last cell with Shift + Enter. They will look identical in the web UI, but nbdime will now insist that an empty cell was removed, and a new one added.

With vidartf#3 this case would be seen as:

Screenshot from 2023-06-10 20-48-42

@vidartf
Copy link
Collaborator Author

vidartf commented Jul 7, 2023

@krassowski I changed how ID comparison is done now. Now it simply uses ID equality as the first pass (any cells that have the same ID will be considered equal, and used for the first pass of creating coherent snakes). It should be more consistent with current behavior when diffing/merging notebooks without IDs (or partial transitions). I'm pretty confident that this should not affect the performance for notebooks with cell IDs, but it might affect performance for mixed notebooks (not sure if positively or negatively).

Note that both the current and previous implementation allows for cells to be considered equal even if they have different ID fields. @minrk suggested at JupyterCon that this should be safe in the case where no cell IDs match each on either side, but it might not for other scenarios, so maybe that should be the only scenario where the non-ID comparison is invoked? I.e. if a base notebook with IDs branch, and two sides each add the same exact cell other than the ID, nbdime should not consider them equal. Formulated as code:

if {a.id for a in notebook_A} & {b.id for b in notebook_B}:  # check for intersection of IDs
    notebook_predicates['cells'] = [ compare_cell_by_ids ]  # only ID comparison is considered
else:
    notebook_predicates['cells'] = [  # IDs have no impact here, so ignore and fall back on old implementation
        compare_cell_approximate,
        compare_cell_moderate,
        compare_cell_strict,
    ]

I would be very happy to have opinions on either of the above points!

@vidartf
Copy link
Collaborator Author

vidartf commented Jul 7, 2023

This will be useful e.g. if a user has removed an empty cell at the bottom of a notebook, then automatically readded it by executing the last cell with Shift + Enter. They will look identical in the web UI, but nbdime will now insist that an empty cell was removed, and a new one added.

Another common habit which will lead to this is when working with e.g. visualisation, if a user duplicates the cell, modifies the code, runs the other cell, compares visualisation results and depending on satisfaction removes one or the other.

These arguments are probably why I would tilt for the current behavior of allowing non-ID identical cells to match, but there might be counter-argument that I haven't considered(?).

@vidartf vidartf marked this pull request as ready for review July 7, 2023 18:23
@krassowski
Copy link
Member

I'm pretty confident that this should not affect the performance for notebooks with cell IDs

For the most part this is true, except for not very realistic scenario of a notebook with cells randomly reshuffled, in which case there is still a performance boost by a factor of two (rather than six as before).

Scenario Before (mean ± σ) After (mean ± σ) Speedup
no changes 1.133 s ± 0.056 s 469.5 ms ± 14.3 ms x 2.4
content of last cell changed 10.442 s ± 0.442 s 2.357 s ± 0.027 s x 4.4
content of first cell changed 10.717 s ± 0.843 s 2.371 s ± 0.071 x 4.5
last cell added 10.735 s ± 0.748 s 2.404 s ± 0.089 s x 4.5
first cell added 13.872 s ± 1.710 s 2.391 s ± 0.091 s x 5.8
cells randomly resuffled 16.925 s ± 0.232 s 8.568 s ± 0.075 s x 2.0
content of every cell changed 387.859 s ± 18.706 s 2.558 s ± 0.107 s x 151.6

Copy link
Member

@krassowski krassowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether there is anything else you would like to do here, but it seems good to go for me. If you believe this needs further user testing, maybe it would be a good idea to merge it together with JupyterLab 4.0 port and release an alpha/beta for more users to test out?

@vidartf
Copy link
Collaborator Author

vidartf commented Sep 8, 2023

Not sure whether there is anything else you would like to do here, but it seems good to go for me. If you believe this needs further user testing, maybe it would be a good idea to merge it together with JupyterLab 4.0 port and release an alpha/beta for more users to test out?

I was waiting on input on vidartf#3, but I can merge and iterate on that PR.

@fcollonval
Copy link
Collaborator

bot please update playwright snapshots

@fcollonval
Copy link
Collaborator

This works for notebook diff (tested on example8) but not on the merge view (tested on example8).

I don't have time to address the merge case.

@github-actions
Copy link
Contributor

Playwright windows-latest snapshots updated.

@krassowski
Copy link
Member

So this PR does not solve the pre-existing problem with merge (#690) but still provides a notable performance benefit. In my opinion we should merge it and include in the upcoming release.

Any objections @vidartf?

An object that encapsulates differs and predicates, and also the new "is_atomic", which is added so that we can mark e.g. cell IDs as atomic. DiffConfig could potentially also be used in the future to  better config notebook ignores etc?
When comparing identical sources we can avoid a full diff by a quick equality check.
Avoid recursing for string mimetypes that are identical
In the strictest check, cells without ID is never considered equal to cells with IDs, but they can be in the two less-strict comparisons. I.e. if a notebook ends up with only some cells having IDs, and then IDs are added to those cells later, then in the first pass the existing cells that all had IDs are first matched to each other, and only in the next passes are the cells that gained IDs compared with their old ID-less versions.
Always use line based diffing for source.
Enable previously not working test
@vidartf
Copy link
Collaborator Author

vidartf commented Oct 27, 2023

Note: While I have a fix for the failure locally, we still don't have a way to visualize a conflict on cell IDs. The changes in vidartf#3 only display the change in the diff case. I can add a placeholder for now: "conflict on cell ID, use raw text editor to resolve".

@krassowski
Copy link
Member

Once you push your changes I can thread the cell IDs in the merge case too, though I am not sure on the design.

@vidartf
Copy link
Collaborator Author

vidartf commented Oct 27, 2023

@krassowski pushed changes here, I expect some UI test will need re-rendering, but I will let you have a look at improvements first (or maybe in follow-up PR, as branches here seem to be getting rebased, so new PRs are probably less messy).

Adds tests for the recently added ability to output merge decisions as raw JSON. As the tests indicate, this is a breaking behavior, but this is a major version release, so should be ok.
@vidartf
Copy link
Collaborator Author

vidartf commented Oct 30, 2023

bot please update playwright snapshots

@github-actions
Copy link
Contributor

Playwright windows-latest snapshots updated.

@github-actions
Copy link
Contributor

Playwright ubuntu-22.04 snapshots updated.

@vidartf vidartf closed this Oct 30, 2023
@vidartf vidartf reopened this Oct 30, 2023
This is needed to have download test work (assumes no conflicts).
@vidartf
Copy link
Collaborator Author

vidartf commented Nov 1, 2023

bot please update playwright snapshots

Copy link
Contributor

github-actions bot commented Nov 1, 2023

Playwright ubuntu-22.04 snapshots updated.

Copy link
Contributor

github-actions bot commented Nov 1, 2023

Playwright windows-latest snapshots updated.

Click should only register as final after mouse up with primary button.
@vidartf vidartf merged commit c1ea9b9 into jupyter:master Nov 1, 2023
14 checks passed
@vidartf
Copy link
Collaborator Author

vidartf commented Nov 1, 2023

@krassowski merged. Let's do any further improvement for cell ID conflicts display/resolution in a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support cell IDs
3 participants