Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for Arrow IPC stream format #4252

Merged
merged 12 commits into from
Dec 15, 2022
Merged

add support for Arrow IPC stream format #4252

merged 12 commits into from
Dec 15, 2022

Conversation

nwt
Copy link
Member

@nwt nwt commented Dec 7, 2022

Features:

  • Auto-detection via zio/anyio.NewReader*
  • Format specification via "-f arrows" and "-i arrows" (".arrows" being the file extension recommend by the Arrow project for this format)
  • HTTP API support in requests and responses via the application/vnd.apache.arrow.stream MIME type
  • Preservation on read of most Arrow types (excluding metadata) via Zed named types (with an "arrow_" prefix) for Arrow types without an equivalent Zed type
  • Ability to write most Arrow types given properly named Zed types (with an "arrow_" prefix)

Limitations:

  • No support for the Arrow IPC file format
  • Reader is slow due to repeated array.New*Data calls
  • Writer does not support big endianness, compression, control of record batch length, or dictionary or sparce union types
  • Tests aren't thorough

For #4226.

Features:

* Auto-detection via zio/anyio.NewReader*
* Format specification via "-f arrows" and "-i arrows" (".arrows" being
  the file extension recommend by the Arrow project for this format)
* HTTP API support in requests and responses via the
  application/vnd.apache.arrow.stream MIME type
* Preservation on read of most Arrow types (excluding metadata) via
  Zed named types (with an "arrow_" prefix) for Arrow types without an
  equivalent Zed type
* Ability to write most Arrow types given properly named Zed types
  (with an "arrow_" prefix)

Limitations:

* No support for the Arrow IPC file format
* Reader is slow due to repeated array.New*Data calls
* Writer does not support big endianness, compression, control of record
  batch length, or dictionary or sparce union types
* Tests aren't thorough
@nwt nwt requested a review from a team December 7, 2022 23:34
@nwt
Copy link
Member Author

nwt commented Dec 8, 2022

This increases the zq executable's size by 17 MB for darwin/amd64 and 12 MB for linux/amd64. Ouch.

$ git checkout main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
$ git rev-parse @
a31161a3404c7cb45ba2dd405b195c01e0123cf9
$ GOOS=darwin GOARCH=amd64 make build
$ ls -lh dist/zq
-rwxr-xr-x  1 noah  staff    18M Dec  8 07:34 dist/zq*
$ GOOS=linux GOARCH=amd64 make build
$ ls -lh dist/zq
-rwxr-xr-x  1 noah  staff    13M Dec  8 07:34 dist/zq*
$ git checkout arrow
Switched to branch 'arrow'
$ git rev-parse @
2f046e6473c9a85ffa5b4ce19daf8b11a634c364
$ GOOS=darwin GOARCH=amd64 make build
$ ls -lh dist/zq
-rwxr-xr-x  1 noah  staff    35M Dec  8 07:35 dist/zq*
$ GOOS=linux GOARCH=amd64 make build
$ ls -lh dist/zq
-rwxr-xr-x  1 noah  staff    25M Dec  8 07:35 dist/zq*

{Name: "days", Type: zed.TypeInt32},
{Name: "milliseconds", Type: zed.TypeUint32},
}
var decimal128Fields = []zed.Column{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make an issue for adding decimal128 support. The type code exists in the spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mccanne: There's an existing issue #1522 that speaks to implementing all four of the decimal type variations. We could open separate issues for each if you'd like, or just keep that one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to separate them.

case arrow.INT64:
return zed.TypeInt64, nil
case arrow.FLOAT16:
return r.zctx.LookupTypeNamed("arrow_float16", zed.TypeFloat32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an issue for float16.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already open as #1312.

return nil, err
}
return r.zctx.LookupTypeNamed("arrow_decimal128", typ)
case arrow.DECIMAL256:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add issue for decimal256


// Writer is a zio.Writer for the Arrow IPC stream format. Given Zed values
// with appropriately named types (see the newArrowDataType implementation), it
// can write all Arrow types except dictionaries and sparse unions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify this comment with the concept that dictionaries don't exist as a Zed type but we could add support to boomerang dictionary types with a named arrow type. Maybe create an issue for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 868965f.

Copy link
Collaborator

@mccanne mccanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@nwt nwt merged commit 49dcdd0 into main Dec 15, 2022
@nwt nwt deleted the arrow branch December 15, 2022 17:16
@philrz philrz linked an issue Dec 15, 2022 that may be closed by this pull request
@nwt nwt mentioned this pull request Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apache Arrow support
3 participants