Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use StringViewArray as output of substr when input was StringArray #12338

Open
alamb opened this issue Sep 5, 2024 · 4 comments
Open

Use StringViewArray as output of substr when input was StringArray #12338

alamb opened this issue Sep 5, 2024 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@alamb
Copy link
Contributor

alamb commented Sep 5, 2024

Is your feature request related to a problem or challenge?

Part of #11752

StringView is a new arrow array type that allows for more efficient string processing -- specifically it allows string data to be adjusted without copying the underlying data

See this blog post for more details: https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/

@Kev1n8 added support for StringView to the substr function in #12044

At the moment substr produces a StringArray output when the input is StringArray, but we could actually generate a StringViewArray as output which would be more efficient in most cases (avoids copying the string values)

However, in order to avoid errors when substr is used in an expression, we need to make sure that all the rest of the String functions support StringView as input as well. Aka we should wait for the "Required for enabling StringView by default" list on #11752 to be completed

Describe the solution you'd like

  1. change the output type of substr to be StringViewArray when the input is StringArray (note for LargeStringArray we will still need to copy the data I think as StringView is limited to 2^32 bytes)
  2. Change the implementation of substr to use StringView internally
  3. Add tests

Describe alternatives you've considered

No response

Additional context

Note that @kevin8 has already added support for StringView to the substr function in #12044

They also suggested this same optimization could be applied #12044 (comment)

@alamb alamb added the enhancement New feature or request label Sep 5, 2024
@Omega359
Copy link
Contributor

Omega359 commented Sep 6, 2024

I would propose that this change when made happens only after #12119 lands.

@alamb alamb added the help wanted Extra attention is needed label Nov 21, 2024
@alamb
Copy link
Contributor Author

alamb commented Nov 21, 2024

This is probably a good exercise now

@Omega359
Copy link
Contributor

Omega359 commented Nov 21, 2024

Should this be the default for all functions that handle strings now? Require they can accept Utf8View as input for all fields and produce Utf8View output ?

I ask because that should solve issues with chained functions causing a series of casts such as

<function 1 = utf8 -> Utf8View), <function 2 = utf8 -> Utf8) ...

@alamb
Copy link
Contributor Author

alamb commented Nov 21, 2024

Should this be the default for all functions that handle strings now? Require they can accept Utf8View as input for all fields and produce Utf8View output ?

I am not sure

I think each function should produce Utf8 / Utf8View depending on what makes sense (Utf8 is more efficient if you have to rewrite all the strings anyways)

Once string functions accept either Utf8/Utf8View which I think would avoid the chained casting you are describing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants