-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Open
Labels
EnhancementExtensionArrayExtending pandas with custom dtypes or arrays.Extending pandas with custom dtypes or arrays.
Description
Currently, Categorical serves two main purposes
- A type for expressing data from a fixed set of categories
- A memory efficient storage format for low-cardinality objects
This proposal is to add a new extension type (let's call it DictEncodedArray
for now) for the second use case. The storage format would be the same as
Categorical: an Index of the unique "keys" (categories) and an array of codes.
Much of the implementation would be shared. But they would have different
semantics on operations
- concat (union by default)
- groupby (unobserved categories would be dropped by default)
- value_counts (unobserved categories would be dropped by default)
This is most useful for strings, but could even be useful for storing a large
array of 64-bit precision items (store the 64-bit items once, then use an int16
or int32 array for the codes).
bear24rw, ahmedsoe and krassowski
Metadata
Metadata
Assignees
Labels
EnhancementExtensionArrayExtending pandas with custom dtypes or arrays.Extending pandas with custom dtypes or arrays.