Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add cat.len_chars and cat.len_bytes #20211

Merged
merged 4 commits into from
Dec 10, 2024
Merged

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented Dec 7, 2024

This adds a fast path for operations that can be performed on the categories of a categorical series. I've added len_bytes and len_chars for now. If desired, it would be trivial to add a few others such as starts_with and ends_with.

import polars as pl

with pl.StringCache():
    pl.Series(["a", "b"], dtype=pl.Categorical)  # fill some cache
    df = pl.DataFrame({
        "a": pl.Series(["Café", "345", "Café", "東京", None], dtype=pl.Categorical)
    })

df.with_columns(
    pl.col("a").cat.len_bytes().alias("n_bytes"),
    pl.col("a").cat.len_chars().alias("n_chars"),
)
# shape: (5, 3)
# ┌──────┬─────────┬─────────┐
# │ a    ┆ n_bytes ┆ n_chars │
# │ ---  ┆ ---     ┆ ---     │
# │ cat  ┆ u32     ┆ u32     │
# ╞══════╪═════════╪═════════╡
# │ Café ┆ 5       ┆ 4       │
# │ 345  ┆ 3       ┆ 3       │
# │ Café ┆ 5       ┆ 4       │
# │ 東京 ┆ 6       ┆ 2       │
# │ null ┆ null    ┆ null    │
# └──────┴─────────┴─────────┘

These require the strings feature because they dispatch to str.len_bytes and str.len_chars.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Dec 7, 2024
Copy link

codecov bot commented Dec 7, 2024

Codecov Report

Attention: Patch coverage is 96.55172% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.63%. Comparing base (d4bab3f) to head (2b9edc3).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-plan/src/dsl/function_expr/cat.rs 94.73% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #20211   +/-   ##
=======================================
  Coverage   79.62%   79.63%           
=======================================
  Files        1565     1565           
  Lines      218187   218243   +56     
  Branches     2475     2475           
=======================================
+ Hits       173734   173794   +60     
+ Misses      43886    43882    -4     
  Partials      567      567           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mcrumiller mcrumiller marked this pull request as ready for review December 7, 2024 17:15
@mcrumiller mcrumiller marked this pull request as draft December 7, 2024 17:48
@mcrumiller mcrumiller marked this pull request as ready for review December 7, 2024 20:11
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @mcrumiller. I've left some comments.

T: PolarsDataType,
{
let ca = s.categorical()?;
let (categories, phys) = match &**ca.get_rev_map() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this move into its own function, that saves monomorphizaton bloat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this resolve it? 2b9edc3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great!

};

// Apply function to categories
let categories = StringChunked::with_chunk(PlSmallStr::EMPTY, categories.clone());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should take the name of s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let categories = StringChunked::with_chunk(PlSmallStr::EMPTY, categories.clone());
let result = op(&categories).into_series();

let out = result.take(phys.idx()?)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can do a take_unchecked.

@@ -171,6 +171,21 @@ impl PyStringFunction {
}
}

#[pyclass(name = "CategoricalFunction", eq)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove these. I don't think we need those for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the whole file, 0a83dbe

@@ -793,8 +808,16 @@ pub(crate) fn into_py(py: Python<'_>, expr: &AExpr) -> PyResult<PyObject> {
FunctionExpr::BinaryExpr(_) => {
return Err(PyNotImplementedError::new_err("binary expr"))
},
FunctionExpr::Categorical(_) => {
return Err(PyNotImplementedError::new_err("categorical expr"))
FunctionExpr::Categorical(catfun) => match catfun {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can remain NotImplemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the whole file, 0a83dbe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. 👍

@ritchie46 ritchie46 changed the title feat: Add cat.str_len and cat.str_bytes feat: Add cat.len_chars and cat.len_bytes Dec 8, 2024
@mcrumiller mcrumiller marked this pull request as draft December 8, 2024 13:31
@mcrumiller mcrumiller marked this pull request as ready for review December 8, 2024 14:40
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Dec 9, 2024

If desired, it would be trivial to add a few others such as starts_with and ends_with.

Good stuff. Bringing cat to parity with str while also speeding it up is a worthy goal; I can say for sure that every single researcher at our place believes (rightly or wrongly) that cat is a pure string optimisation and that it should behave as a string in essentially all circumstances... (the fact that it doesn't sort as one by default is an ongoing pain point, but that's a whole other discussion 🤣)

I'd certainly be happy to see more fast-path str ops come to cat 👍

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Dec 9, 2024

Good stuff. Bringing cat to parity with str while also speeding it up is a worthy goal; I can say for sure that every single researcher at our place believes (rightly or wrongly) that cat is a pure string optimisation and that it should behave as a string in essentially all circumstances... (the fact that it doesn't sort as one by default is an ongoing pain point, but that's a whole other discussion 🤣)

@alexander-beedie as per the #9773 discussion I don't think we will get most string ops in the cat namespace unfortunately. I'm with your researchers here in that I would have loved a .cat.slice() but...it was not meant to be. But a few of these easy ones are nice that don't require creating a new revmap.

@ritchie46
Copy link
Member

@alexander-beedie as per the #9773 discussion I don't think we will get most string ops in the cat namespace unfortunately. I'm with your researchers here in that I would have loved a .cat.slice() but...it was not meant to be. But a few of these easy ones are nice that don't require creating a new revmap.

I think we should just map shrink_to_fit to a kernel that interns the strings. Then long strings will be deduplicated.

@ritchie46 ritchie46 merged commit a21ac2e into pola-rs:main Dec 10, 2024
26 checks passed
@connor-elliott
Copy link

+1 for starts_with/ends_with/contains

@mcrumiller mcrumiller deleted the cat-len branch December 10, 2024 15:50
@mcrumiller
Copy link
Contributor Author

@connor-elliott that should now be an easy add, I'll take a look tonight when I'm home from work.

@mcrumiller
Copy link
Contributor Author

I think we should just map shrink_to_fit to a kernel that interns the strings. Then long strings will be deduplicated.

The reason we often prefer categoricals is because they take up about 1/4 the space (4 bytes per element versus 16), not necessarily for performance reasons. In many of these cases, it would really nice to have a pl.Char datatype, or something similar to pl.Array(pl.UInt8, N) with fixed N.

@ritchie46
Copy link
Member

I think we should just map shrink_to_fit to a kernel that interns the strings. Then long strings will be deduplicated.

The reason we often prefer categoricals is because they take up about 1/4 the space (4 bytes per element versus 16), not necessarily for performance reasons. In many of these cases, it would really nice to have a pl.Char datatype, or something similar to pl.Array(pl.UInt8, N) with fixed N.

I think we should just map shrink_to_fit to a kernel that interns the strings. Then long strings will be deduplicated.

The reason we often prefer categoricals is because they take up about 1/4 the space (4 bytes per element versus 16), not necessarily for performance reasons. In many of these cases, it would really nice to have a pl.Char datatype, or something similar to pl.Array(pl.UInt8, N) with fixed N.

And that will be fixed by streaming. Where categoricals are yet again problematic and complex. They seem like a good solution. But are a band aid. There are better solutions;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants