Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add arrow feature to re_chunk and conversions to RecordBatch #7355

Merged
merged 2 commits into from
Sep 9, 2024

Conversation

jleibs
Copy link
Member

@jleibs jleibs commented Sep 4, 2024

What

Basic type conversions from TransportChunk to RecordBatch and back.

Adding the round-trip test turned up an interesting issue.

TransportChunk <-> RecordBatch fails to round-trip successfully because we lose the ExtensionType encapsulation that used to be encoded by arrow2.

While on the surface this isn't immediately problematic, as we don't care about ExtensionTypes, the discussion indicates there are in fact going to be very real pain points when it comes to writing semantic data processing engines using arrow-rs. This is because the metadata is attached to the FIELD, not the DATATYPE, and there exist many processing contexts where the context of that field itself is lost.
apache/arrow-rs#4472

Checklist

  • I have read and agree to Contributor Guide and the Code of Conduct
  • I've included a screenshot or gif (if applicable)
  • I have tested the web demo (if applicable):
  • The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
  • If applicable, add a new check to the release checklist!
  • If have noted any breaking changes to the log API in CHANGELOG.md and the migration guide

To run all checks from main, comment on the PR with @rerun-bot full-check.

@jleibs jleibs marked this pull request as ready for review September 4, 2024 18:27
@jleibs jleibs added 🏹 arrow concerning arrow ⛃ re_datastore affects the datastore itself exclude from changelog PRs with this won't show up in CHANGELOG.md labels Sep 4, 2024
@teh-cmc teh-cmc self-requested a review September 5, 2024 06:54
Copy link
Member

@teh-cmc teh-cmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice


impl TransportChunk {
/// Create an arrow-rs [`RecordBatch`] containing the data from this [`TransportChunk`].
pub fn try_as_arrow_record_batch(&self) -> Result<RecordBatch, ArrowError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hate to be that guy but im gonna be that guy
image

Suggested change
pub fn try_as_arrow_record_batch(&self) -> Result<RecordBatch, ArrowError> {
pub fn try_to_arrow_record_batch(&self) -> Result<RecordBatch, ArrowError> {

@jleibs jleibs merged commit a3c3f55 into main Sep 9, 2024
34 checks passed
@jleibs jleibs deleted the jleibs/convert_to_record_batch branch September 9, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏹 arrow concerning arrow exclude from changelog PRs with this won't show up in CHANGELOG.md ⛃ re_datastore affects the datastore itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants