-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Implement support for StringView
in DataFusion
#10918
Comments
I think we should aim for a first "milestone" of showing improvements for some clickbench queries |
Will we completely change StringArray to StringViewArray in Datafusion? A more concrete example is
We have the string column If not, if we somehow want to utilize StringViewArray, how do we minize the cost of conversion between String and StringView? It seems Polars completely refactor their String to StringView 🤔 |
I think since they are two separate types in Arrow we couldn't fully switch to StringView the way polars could as it controls the whole stack. Users could still feed DataFusion StringViewArray from custom TableProviders and would expect StringView at the output. However what I think we could do is internally to DataFusion (e.g. within the plan, before the final output) is use StringView in the batches that flow through intermediate nodes in the plan.
Indeed, As you point out, I don't think we can transparently switch to using StringView -- instead we would have to start encoding information in the plans about the new types. I wonder if we could have a new logical optimzier pass that tried to annotate operations that support it to use StringView in their schema rather than String. Then the ExecutionPlans would know if they were supposed to generate StringView as output or the more traditional StringArray 🤔 Here is an idea of one place to start: #9403 (comment) |
I think @XiangpengHao is looking into another place to use StringView which is #10921 -- where we have a similar idea to use StringView in some sub portion of the plan. Here is more info about the optimizer pass idea: #10921 (comment) |
I think this project is going pretty well We are at the point of starting to implement some basic functions using StringView. |
Now that we have upgraded to arrow 52.1.0, I think we could merge the |
In case anyone wants an overview of adding StringView to DataFusion, here is a presentation and slides from @XiangpengHao |
Is your feature request related to a problem or challenge?
StringView
/BinaryView
were added to the Arrow format that make it more suitable for certain types of operations on strings. Specifically when filtering with string data, creating the outputStringArray
requires copying the strings to a new, packed binary buffer.GenericByteViewArrary
was designed to solve this limitation and the arrow-rs implementation, tracked by apache/arrow-rs#5374, is now complete enough to start adding support into DataFusionI think we can improve performance in certain cases by using StringView (this is also described in more details in the Pola.rs blog post)
StringViewArray
/BinaryViewArray
rather thanStringArray
/BinaryArray
saves a copy (and @ariesdevil is quite close to having it integrated into the parquet reader ImplementStringViewArray
andBinaryViewArray
reading/writing in parquet arrow-rs#5530)substr(url, 4) = 'http'
) as the intermediate result ofsubstr
can be called without copying string valuesDescribe the solution you'd like
I would like to support StringView / BinaryView support in DataFusion.
While my primary usecase is for reading data from parquet, I think teaching DataFusion to use
StringView
(at least as intermediate values when evaluating expressions may help significantlyDevelopment branch:
string-view
Since this feature requires upstream arrow-rs support apache/arrow-rs#5374 that is not yet released we plan to do development on a
string-view
feature branch:https://github.com/apache/datafusion/tree/string-view
Task List
Here are some high level tasks (I can help flesh these out for anyone who is interested in helping)
arrow_cast
support forStringView
andBinaryView
#10920=
and inequality<>
support forStringView
#10919Utf8View
literals #10998LIKE
for StringView arrays #11024BinaryView
#10996LIKE
for StringView arrays #11024REGEXP_REPLACE
for StringView #11025LargeString
andLargeBinary
forStringView
andBinaryView
#11023like
benchmark for StringView arrow-rs#5936Utf8View
->Utf8
/BinaryView
->Binary
for compatibilityschema_force_string_view
) by default #11682StringView
support for CharacterLength #11677Describe alternatives you've considered
No response
Additional context
Polars implemented it recently in rust so that can serve as a motivation
Blog Post https://pola.rs/posts/polars-string-type/
https://twitter.com/RitchieVink/status/1749466861069115790
Facebook/Velox's take: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
Related PRs:
pola-rs/polars#13748
pola-rs/polars#13839
pola-rs/polars#13489
The text was updated successfully, but these errors were encountered: