Improve data serialization #483

martinRenou · 2024-02-28T12:23:10Z

Make ipydatagrid more performant, achieving two things:

binary buffers usage in data serialization, this improves the communication between the back-end and the front-end quite a lot. (e.g. a million cells datagrid used to take 6 seconds to show up with the old approach with a local jupyter server on my laptop, it now takes half a second).
reducing memory footprint in the front-end by improving the data structure.

What's remaining to make the PR ready to review:

update JS test code
update Python test code
cell editing seems broken, needs data serialization in the front-end and deserialization in the back-end
filtering transform is not completely done yet
Fix support for heterogeneous data types in columns (do not use binary buffers in that case)

In follow up PRs, the next items should be resolved:

use binary buffer for _visible_rows attribute
use binary buffer for schema and fields attribute?
improve the transforms/view logic to prevent making any copies of the original data. Views should be a way to "view" the original data, it shouldn't make any copy of the original one as much as it can.

js/datagrid.ts

ianthomas23 · 2024-03-14T14:31:24Z

I have tried this locally and I see the same dramatic speed improvements. It would be good to continue with this as it will be a good basis for experiments in filtering and sorting on the backend that I'd like to look at.

paddymul · 2024-03-14T14:50:13Z

I have been working just this week to better understand binary serialization from pandas through ipywidgets to js. I think I'm going to use arrow-js. I'm hoping to publish a very rough early repo later today.

I'm currently fleshing out a simple IPYWidget library that lets me prototype simple examples, and it will be easier to collaborate with other people since it's a simple library.

Trevor Manz and Kyle Barron have been doing work in this space too.

I'd love to collaborate with others on this.

paddymul · 2024-03-14T19:24:19Z

FWIW I just pushed the first commits to the serialization playground df_cereal
https://github.com/paddymul/df_cereal

I have examples of arrow-js serialization working entirely in js.
I currently can't get the python side to work to communicate bytes or base64 to JS

Benchmarks and more docs coming soon.

BTW I looked at what bqplot is doing. I suspect arrow based serialization will be much faster since it doesn't deal with json at all.

martinRenou · 2024-03-15T08:36:38Z

Thank you for reaching out @paddymul. This looks interesting!

will be much faster

I'm a tiny bit skeptical about this. The JSON message bqplot sends is minimal in the end.

I feel like we should go ahead with this PR once it's passing all tests. Then I'm 💯 to continue discussing on having a common place for having better binary serialization that we can use across widgets. I don't like depending on bqplot for this, but it was already a dependency for some reason (probably some legacy dependency due to removed code) so it's convenient to just use it for now.

Signed-off-by: martinRenou <martin.renou@gmail.com>

tests/js/arrayUtils.test.ts

ianthomas23 · 2024-03-20T16:41:13Z

ipydatagrid/datagrid.py

@@ -852,9 +896,10 @@ def _get_row_index_of_primary_key(self, value):
                "as the primary key."
            )

+        # TODO Is there a better way for this?


An alternative way of doing this is

df = self._data["data"] row_indices = df.index[(df[key] == value).all(axis="columns")].to_list()

which supports key and value being lists of more than one item. But we need to be careful to support the dataframe index being something other than integers starting at zero, so I think a better approach is

df = self._data["data"] row_indices = pd.RangeIndex(len(df))[(df[key] == value).all(axis="columns")].to_list()

I expect this to be significantly faster for large problems as the iteration occurs in pandas (C/Cython) rather than Python here.

Do you know if there are tests of this function for dataframes with non-integer keys?

Thanks for the suggestion! I added it

Do you know if there are tests of this function for dataframes with non-integer keys?

As far as I understand, the primary key will always be integers, we're always using integers for indexing under the hood even though the user would provide a dataframe using another type for indexing.

And indeed your suggestion seems to work with string indexes:

Signed-off-by: martinRenou <martin.renou@gmail.com>

ianthomas23

LGTM

gaborbernat

martinRenou commented Feb 28, 2024

View reviewed changes

js/datagrid.ts Outdated Show resolved Hide resolved

martinRenou force-pushed the binary_buffers branch 10 times, most recently from 780434e to 18cd5c2 Compare March 20, 2024 09:06

martinRenou requested a review from ianthomas23 March 20, 2024 09:15

martinRenou marked this pull request as ready for review March 20, 2024 09:15

Make use of binary buffers to speed up data transfer

8db8ca8

Signed-off-by: martinRenou <martin.renou@gmail.com>

martinRenou force-pushed the binary_buffers branch from 18cd5c2 to 8db8ca8 Compare March 20, 2024 09:17

Lumino v1 compat

0c40490

Signed-off-by: martinRenou <martin.renou@gmail.com>

martinRenou force-pushed the binary_buffers branch from f025709 to 0c40490 Compare March 20, 2024 09:40

martinRenou commented Mar 20, 2024

View reviewed changes

tests/js/arrayUtils.test.ts Outdated Show resolved Hide resolved

ianthomas23 reviewed Mar 20, 2024

View reviewed changes

martinRenou added 2 commits March 21, 2024 11:14

Revert test removal

8e8b498

Signed-off-by: martinRenou <martin.renou@gmail.com>

Review comment

7dc5da5

Signed-off-by: martinRenou <martin.renou@gmail.com>

martinRenou force-pushed the binary_buffers branch from 0d72d90 to 7dc5da5 Compare March 21, 2024 10:15

martinRenou requested review from ianthomas23, gaborbernat and kaiayoung and removed request for ianthomas23 March 21, 2024 10:22

ianthomas23 approved these changes Mar 21, 2024

View reviewed changes

gaborbernat approved these changes Mar 21, 2024

View reviewed changes

Merge branch 'main' into binary_buffers

b6bdbd1

gaborbernat merged commit 825a02a into jupyter-widgets:main Mar 21, 2024
12 checks passed

martinRenou deleted the binary_buffers branch March 21, 2024 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve data serialization #483

Improve data serialization #483

martinRenou commented Feb 28, 2024 •

edited

Loading

ianthomas23 commented Mar 14, 2024

paddymul commented Mar 14, 2024

paddymul commented Mar 14, 2024

martinRenou commented Mar 15, 2024

ianthomas23 Mar 20, 2024

martinRenou Mar 21, 2024

ianthomas23 left a comment

gaborbernat left a comment

Improve data serialization #483

Improve data serialization #483

Conversation

martinRenou commented Feb 28, 2024 • edited Loading

ianthomas23 commented Mar 14, 2024

paddymul commented Mar 14, 2024

paddymul commented Mar 14, 2024

martinRenou commented Mar 15, 2024

ianthomas23 Mar 20, 2024

Choose a reason for hiding this comment

martinRenou Mar 21, 2024

Choose a reason for hiding this comment

ianthomas23 left a comment

Choose a reason for hiding this comment

gaborbernat left a comment

Choose a reason for hiding this comment

martinRenou commented Feb 28, 2024 •

edited

Loading