Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a column to store categories, rather than a mapping #69

Merged
merged 5 commits into from
Jul 28, 2022

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Sep 29, 2021

This PR implements one of the changes mentioned in #41, i.e.,

Add get_children() method, and store the mapping that is now in Column.describe_categorical in a child column instead. Note that child columns are also needed for variable-length strings.

Replacing the previous mapping with a child column now implies that the data buffer stores integer indices into the child column.

@vnlitvinov
Copy link
Contributor

@shwina @rgommers should we maybe not try to disguise the (explicit) categories as generic "children"?

I think just replacing the type of df.describe_categorical['mapping'] from dict to Column instead of introducing of a new method could be better - it's still a mapping after all 🙃

@rgommers
Copy link
Member

We briefly touched on this PR today, we'd like to move it forward since all interchange protocol implementers seem to be in favor of this. @vnlitvinov plans to revisit his proposal from above and add that here or open an alternative PR. Then we can try to finalize this, and then update implementations in the various libraries for it.

@shwina
Copy link
Contributor Author

shwina commented May 27, 2022

still a mapping after all

I'm not sure I quite follow; the Column is explicitly not a mapping, rather it is concretely the categories. Perhaps I'm missing something?

@vnlitvinov vnlitvinov force-pushed the store-categories-in-column branch from 91ed593 to 902ba7c Compare June 1, 2022 10:12
@vnlitvinov
Copy link
Contributor

@shwina I've missed your comment, so I went ahead with the original plan we discussed last meeting and I had added a commit doing so. If you don't like the idea feel free to remove my commit.

As for it not being a mapping... it somewhat is :) it maps an integer index to a category value.

@rgommers
Copy link
Member

After discussion on Thursday, @shwina and others were happy with the changes @vnlitvinov pushed. With one change to make: rename mapping to categories, since it's no longer a real mapping (only implicitly, the keys are 0 ... n_categories) and it would help with transitioning too.

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Co-authored-by: Keith Kraus <keith.j.kraus@gmail.com>
@rgommers
Copy link
Member

With one change to make: rename mapping to categories,

This change was made in the spec part, but not the rest of this PR. I'm inclined to push a change to pandas_implementation.py and then merge this.

@rgommers
Copy link
Member

With one change to make: rename mapping to categories,

This change was made in the spec part, but not the rest of this PR. I'm inclined to push a change to pandas_implementation.py and then merge this.

I tried this - since the tests no longer pass with this PR, and we now have an actual Pandas implementation, it may be better to completely remove this early prototype in pandas_implementation.py.

@rgommers
Copy link
Member

I tried this - since the tests no longer pass with this PR, and we now have an actual Pandas implementation, it may be better to completely remove this early prototype in pandas_implementation.py.

Discussed in today's call: everyone is happy with deleting this prototype code.

rgommers added a commit to rgommers/dataframe-api that referenced this pull request Jul 14, 2022
Now that we have four real-world implementations in cuDF, Vaex,
Modin and Pandas, we no longer need this prototype. Having it
here makes it harder to merge other PRs (see data-apisgh-69), so let's
remove it now.
rgommers added a commit that referenced this pull request Jul 28, 2022
Now that we have four real-world implementations in cuDF, Vaex,
Modin and Pandas, we no longer need this prototype. Having it
here makes it harder to merge other PRs (see gh-69), so let's
remove it now.
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone is happy and this is now a pretty small/straightforward diff. So merging. I'll follow up with all the implementers to ensure we will actually propagate these changes to all implementations.

Thanks @shwina, @vnlitvinov, @kkraus14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bc-breaking A change that is not fully backwards compatible enhancement New feature or request interchange-protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants