Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve data serialization #483

Merged
merged 5 commits into from
Mar 21, 2024

Conversation

martinRenou
Copy link
Member

@martinRenou martinRenou commented Feb 28, 2024

Make ipydatagrid more performant, achieving two things:

  • binary buffers usage in data serialization, this improves the communication between the back-end and the front-end quite a lot. (e.g. a million cells datagrid used to take 6 seconds to show up with the old approach with a local jupyter server on my laptop, it now takes half a second).
  • reducing memory footprint in the front-end by improving the data structure.

What's remaining to make the PR ready to review:

  • update JS test code
  • update Python test code
  • cell editing seems broken, needs data serialization in the front-end and deserialization in the back-end
  • filtering transform is not completely done yet
  • Fix support for heterogeneous data types in columns (do not use binary buffers in that case)

In follow up PRs, the next items should be resolved:

  • use binary buffer for _visible_rows attribute
  • use binary buffer for schema and fields attribute?
  • improve the transforms/view logic to prevent making any copies of the original data. Views should be a way to "view" the original data, it shouldn't make any copy of the original one as much as it can.

js/datagrid.ts Outdated Show resolved Hide resolved
@ianthomas23
Copy link
Contributor

I have tried this locally and I see the same dramatic speed improvements. It would be good to continue with this as it will be a good basis for experiments in filtering and sorting on the backend that I'd like to look at.

@paddymul
Copy link

I have been working just this week to better understand binary serialization from pandas through ipywidgets to js. I think I'm going to use arrow-js. I'm hoping to publish a very rough early repo later today.

I'm currently fleshing out a simple IPYWidget library that lets me prototype simple examples, and it will be easier to collaborate with other people since it's a simple library.

Trevor Manz and Kyle Barron have been doing work in this space too.

I'd love to collaborate with others on this.

@paddymul
Copy link

FWIW I just pushed the first commits to the serialization playground df_cereal
https://github.com/paddymul/df_cereal

I have examples of arrow-js serialization working entirely in js.
I currently can't get the python side to work to communicate bytes or base64 to JS

Benchmarks and more docs coming soon.

BTW I looked at what bqplot is doing. I suspect arrow based serialization will be much faster since it doesn't deal with json at all.

@martinRenou
Copy link
Member Author

Thank you for reaching out @paddymul. This looks interesting!

will be much faster

I'm a tiny bit skeptical about this. The JSON message bqplot sends is minimal in the end.

I feel like we should go ahead with this PR once it's passing all tests. Then I'm 💯 to continue discussing on having a common place for having better binary serialization that we can use across widgets. I don't like depending on bqplot for this, but it was already a dependency for some reason (probably some legacy dependency due to removed code) so it's convenient to just use it for now.

@martinRenou martinRenou force-pushed the binary_buffers branch 10 times, most recently from 780434e to 18cd5c2 Compare March 20, 2024 09:06
@martinRenou martinRenou requested a review from ianthomas23 March 20, 2024 09:15
@martinRenou martinRenou marked this pull request as ready for review March 20, 2024 09:15
Signed-off-by: martinRenou <martin.renou@gmail.com>
Signed-off-by: martinRenou <martin.renou@gmail.com>
@@ -852,9 +896,10 @@ def _get_row_index_of_primary_key(self, value):
"as the primary key."
)

# TODO Is there a better way for this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative way of doing this is

df = self._data["data"]
row_indices = df.index[(df[key] == value).all(axis="columns")].to_list()

which supports key and value being lists of more than one item. But we need to be careful to support the dataframe index being something other than integers starting at zero, so I think a better approach is

df = self._data["data"]
row_indices = pd.RangeIndex(len(df))[(df[key] == value).all(axis="columns")].to_list() 

I expect this to be significantly faster for large problems as the iteration occurs in pandas (C/Cython) rather than Python here.

Do you know if there are tests of this function for dataframes with non-integer keys?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I added it

Do you know if there are tests of this function for dataframes with non-integer keys?

As far as I understand, the primary key will always be integers, we're always using integers for indexing under the hood even though the user would provide a dataframe using another type for indexing.

And indeed your suggestion seems to work with string indexes:
Screenshot from 2024-03-21 11-12-13

Signed-off-by: martinRenou <martin.renou@gmail.com>
Signed-off-by: martinRenou <martin.renou@gmail.com>
@martinRenou martinRenou requested review from ianthomas23, gaborbernat and kaiayoung and removed request for ianthomas23 March 21, 2024 10:22
Copy link
Contributor

@ianthomas23 ianthomas23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@gaborbernat gaborbernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaborbernat gaborbernat merged commit 825a02a into jupyter-widgets:main Mar 21, 2024
12 checks passed
@martinRenou martinRenou deleted the binary_buffers branch March 21, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants