-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve data serialization #483
Conversation
I have tried this locally and I see the same dramatic speed improvements. It would be good to continue with this as it will be a good basis for experiments in filtering and sorting on the backend that I'd like to look at. |
I have been working just this week to better understand binary serialization from pandas through ipywidgets to js. I think I'm going to use arrow-js. I'm hoping to publish a very rough early repo later today. I'm currently fleshing out a simple IPYWidget library that lets me prototype simple examples, and it will be easier to collaborate with other people since it's a simple library. Trevor Manz and Kyle Barron have been doing work in this space too. I'd love to collaborate with others on this. |
FWIW I just pushed the first commits to the serialization playground df_cereal I have examples of arrow-js serialization working entirely in js. Benchmarks and more docs coming soon. BTW I looked at what bqplot is doing. I suspect arrow based serialization will be much faster since it doesn't deal with json at all. |
Thank you for reaching out @paddymul. This looks interesting!
I'm a tiny bit skeptical about this. The JSON message bqplot sends is minimal in the end. I feel like we should go ahead with this PR once it's passing all tests. Then I'm 💯 to continue discussing on having a common place for having better binary serialization that we can use across widgets. I don't like depending on bqplot for this, but it was already a dependency for some reason (probably some legacy dependency due to removed code) so it's convenient to just use it for now. |
780434e
to
18cd5c2
Compare
Signed-off-by: martinRenou <martin.renou@gmail.com>
18cd5c2
to
8db8ca8
Compare
Signed-off-by: martinRenou <martin.renou@gmail.com>
f025709
to
0c40490
Compare
ipydatagrid/datagrid.py
Outdated
@@ -852,9 +896,10 @@ def _get_row_index_of_primary_key(self, value): | |||
"as the primary key." | |||
) | |||
|
|||
# TODO Is there a better way for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative way of doing this is
df = self._data["data"]
row_indices = df.index[(df[key] == value).all(axis="columns")].to_list()
which supports key
and value
being lists of more than one item. But we need to be careful to support the dataframe index being something other than integers starting at zero, so I think a better approach is
df = self._data["data"]
row_indices = pd.RangeIndex(len(df))[(df[key] == value).all(axis="columns")].to_list()
I expect this to be significantly faster for large problems as the iteration occurs in pandas
(C/Cython) rather than Python here.
Do you know if there are tests of this function for dataframes with non-integer keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I added it
Do you know if there are tests of this function for dataframes with non-integer keys?
As far as I understand, the primary key will always be integers, we're always using integers for indexing under the hood even though the user would provide a dataframe using another type for indexing.
And indeed your suggestion seems to work with string indexes:
Signed-off-by: martinRenou <martin.renou@gmail.com>
Signed-off-by: martinRenou <martin.renou@gmail.com>
0d72d90
to
7dc5da5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make ipydatagrid more performant, achieving two things:
What's remaining to make the PR ready to review:
In follow up PRs, the next items should be resolved:
_visible_rows
attribute