-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Implement StringViewArray
and BinaryViewArray
#5374
Comments
@kallisti-dev I believe may do some work on this. I suggested breaking it down into a sequence of PRs (keeping notes of what is not yet implemented along the way) Specifically, I suggest the first PR should have:
That will likely be a sizeable PR on its own. |
Post from Velox on iimportance of stringview: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/ |
Moving an update from @sundy-li and @ariesdevil from #4585 (comment) to here for visibility I believe @ariesdevil is considering working on this relatively shortly
@XiangpengHao, who will be working with us at InfluxData this summer (starting in May) is also interested in this project, so when we get closer to May we'll engage more fully and perhaps we can figure out how to split up some of the work. |
Do we have a group similar to Discord or Slack? |
@ariesdevil There is both discord and slack -- links are here https://github.com/apache/arrow-rs?tab=readme-ov-file#arrow-rust-community The discord server doesn't require an invite and seems to be a bit more active recently |
@ariesdevil says they have time to begin work on this so I will begin filing subtasks Update: I filed these two tasks to get us started. |
Update here is that @ariesdevil has a PR #5481 that we plan to merge tomorrow with the initial array implementations and we'll continue iterating from there Once that is merged, I'll plan to write some more tickets up |
Is there an issue to make sure that parquet support for the new arrow view types are supported. Currently can not use with the parquet writer when they are specified in the schema. I am actually more familiar with Arrow than I am Parquet so excuse the question: is there an official Parquet spec or implementation plan for the view types? |
I'm working on read/write parquet for view types. |
Update here: thanks to @ariesdevil and @XiangpengHao and @RinChanNOWWW we have pretty good basic support for StringView in arrow-rs -- we have basic construction, cast and soon comparison functionality complete. Also, we can now read directly from parquet into StringView arrays #5530 We have also begun the work to integrate this into DataFusion, which is tracked in apache/datafusion#10918 |
Update here is that most of the basic StringView work for basic scanning / filtering is done. There will likely be an large tail of places to implement special StringView support as follw on. Over the next week or two I plan to file a new ticket for the remaining known work and close this ticket down |
I think we are close to finishing enough of this ticket to claim initial support is done -- I expect we can claim initial support is complete in the 52.2.0 release #5998, due to be released in a few weeks time. Once that is done, I'll open a epic to track remaining / additional work |
Given we have now released all the basic support in arrow 52.0.0 am going to claim we are done with the initial implementation and close this ticket 🎉 Huge kudos to everyone who helped including @RinChanNOWWW @ariesdevil @tustvold @XiangpengHao and @a10y (sorry if I forgot anyone) I am collecting additional potential work items in #6163 |
…#44175) ### Rationale for this change BinaryView and Utf8View are now supported in arrow-rs [as of the 52.0.0 release](apache/arrow-rs#5374 (comment)). ### What changes are included in this PR? Add two checkmarks for the Data Type support. ### Are these changes tested? N/A ### Are there any user-facing changes? N/A Authored-by: wiedld <dlw405@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Recently two new types were added to the Arrow format that make it more suitable for certain types of operations on strings
Specifically when doing filtering / take with string data, creating a new
Utf8Array
requires copying the strings to a new, packed binary buffer. The "VariableSizeBinaryView" was designed to solve this limitation and recently added to the Arrow spec.Describe the solution you'd like
I would like to implement
StringViewArray
andBinaryViewArray
following the spec:The spec: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
https://github.com/apache/arrow/blob/3fe598ae4dfd7805ab05452dd5ed4b0d6c97d8d5/format/Schema.fbs#L187-L205
Initially, I would suggest we get the basic types in place:
DataType::Utf8View
andDataType::BinaryView
#5468StringViewArray
implementation and layout and basic construction + tests #5469BinaryViewArray
implementation and layout and basic constructionThen as follow on PRs, add support for key features
StringViewArray
andBinaryViewArray
#5506StringViewArray
andBinaryViewArray
#5507cast
kernel support forStringViewArray
andBinaryViewArray
#5508StringViewArray
andBinaryViewArray
#5509filter
kernel support forStringViewArray
andBinaryViewArray
#5510take
kernel support forStringViewArray
andBinaryViewArray
#5511cast
kernel support forStringViewArray
andBinaryViewArray
<-->
DictionaryArray` #5861compare_op
forGenericBinaryView
#5897like
benchmark for StringView #5936compare_op
forGenericBinaryView
#5903gc
garbage collector support forStringViewArray
andBinaryViewArray
#5513StringViewArray
andBinaryViewArray
reading/writing in parquet #5530deduplicate
/intern
functionality for StringView #5910StringViewArray
#6094interleave
(used in Sort)Potential Follow ons / additional optimizations
ByteView
s for small strings #6034like
/ilike
kernels for StringView #5951Describe alternatives you've considered
I think a good plan would be to dust off the prototype on #4585 from @tustvold (linked from #4253).
Initially, the idea would be to dust off the PR and split it into a few smaller PRs with tests and docs.
Additional context
Polars implemented it recently in rust so that can serve as a motivation
Blog Post https://pola.rs/posts/polars-string-type/
https://twitter.com/RitchieVink/status/1749466861069115790
Facebook/Velox's take: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
Related PRs:
pola-rs/polars#13748
pola-rs/polars#13839
pola-rs/polars#13489
The text was updated successfully, but these errors were encountered: