Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for Arrow IPC stream format #4252

Merged
merged 12 commits into from
Dec 15, 2022
25 changes: 15 additions & 10 deletions api/mime.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,17 @@ import (
)

const (
MediaTypeAny = "*/*"
MediaTypeCSV = "text/csv"
MediaTypeJSON = "application/json"
MediaTypeLine = "application/x-line"
MediaTypeNDJSON = "application/x-ndjson"
MediaTypeParquet = "application/x-parquet"
MediaTypeZeek = "application/x-zeek"
MediaTypeZJSON = "application/x-zjson"
MediaTypeZNG = "application/x-zng"
MediaTypeZSON = "application/x-zson"
MediaTypeAny = "*/*"
MediaTypeArrowStream = "application/vnd.apache.arrow.stream"
MediaTypeCSV = "text/csv"
MediaTypeJSON = "application/json"
MediaTypeLine = "application/x-line"
MediaTypeNDJSON = "application/x-ndjson"
MediaTypeParquet = "application/x-parquet"
MediaTypeZeek = "application/x-zeek"
MediaTypeZJSON = "application/x-zjson"
MediaTypeZNG = "application/x-zng"
MediaTypeZSON = "application/x-zson"
)

type ErrUnsupportedMimeType struct {
Expand All @@ -41,6 +42,8 @@ func MediaTypeToFormat(s string, dflt string) (string, error) {
switch typ {
case MediaTypeAny, "":
return dflt, nil
case MediaTypeArrowStream:
return "arrows", nil
case MediaTypeCSV:
return "csv", nil
case MediaTypeJSON:
Expand All @@ -65,6 +68,8 @@ func MediaTypeToFormat(s string, dflt string) (string, error) {

func FormatToMediaType(format string) string {
switch format {
case "arrows":
return MediaTypeArrowStream
case "csv":
return MediaTypeCSV
case "json":
Expand Down
2 changes: 1 addition & 1 deletion cli/inputflags/flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ func (f *Flags) Options() anyio.ReaderOpts {
}

func (f *Flags) SetFlags(fs *flag.FlagSet, validate bool) {
fs.StringVar(&f.Format, "i", "auto", "format of input data [auto,zng,vng,json,zeek,zjson,csv,parquet,line]")
fs.StringVar(&f.Format, "i", "auto", "format of input data [arrows,auto,zng,vng,json,zeek,zjson,csv,parquet,line]")
fs.BoolVar(&f.ZNG.Validate, "validate", validate, "validate the input format when reading ZNG streams")
fs.IntVar(&f.ZNG.Threads, "threads", 0, "number of threads used for scanning ZNG input")
f.ReadMax = auto.NewBytes(zngio.MaxSize)
Expand Down
2 changes: 1 addition & 1 deletion cli/outputflags/flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ func (f *Flags) SetFormatFlags(fs *flag.FlagSet) {
if f.DefaultFormat == "" {
f.DefaultFormat = "zng"
}
fs.StringVar(&f.Format, "f", f.DefaultFormat, "format for output data [zng,vng,json,parquet,table,text,csv,lake,zeek,zjson,zson]")
fs.StringVar(&f.Format, "f", f.DefaultFormat, "format for output data [arrows,zng,vng,json,parquet,table,text,csv,lake,zeek,zjson,zson]")
fs.BoolVar(&f.jsonShortcut, "j", false, "use line-oriented JSON output independent of -f option")
fs.BoolVar(&f.zsonShortcut, "z", false, "use line-oriented ZSON output independent of -f option")
fs.BoolVar(&f.zsonPretty, "Z", false, "use formatted ZSON output independent of -f option")
Expand Down
13 changes: 7 additions & 6 deletions docs/commands/zq.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ Note here that the query `1+1` [implies](../language/overview.md#26-implied-oper

| Option | Auto | Specification |
|-----------|------|------------------------------------------|
| `arrows` | yes | [Arrow IPC Stream Format](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format) |
| `json` | yes | [JSON RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.html) |
| `csv` | yes | [CSV RFC 4180](https://www.rfc-editor.org/rfc/rfc4180.html) |
| `parquet` | no | [Apache Parquet](https://github.com/apache/parquet-format) |
Expand Down Expand Up @@ -285,14 +286,14 @@ produces

### 3.4 Schema-rigid Outputs

Certain data formats like Parquet are "schema rigid" in the sense that they
require a schema to be defined before values can be written into the file and
all the values in the file must conform to this schema.
Certain data formats like Arrow and Parquet are "schema rigid" in the sense that
they require a schema to be defined before values can be written into the file
and all the values in the file must conform to this schema.

Zed, however, has a fine-grained type system instead of schemas and a sequence
of data values are completely self-describing and may be heterogeneous in nature.
This creates a challenge converting the type-flexible Zed formats to a schema-rigid
format like Parquet.
format like Arrow and Parquet.

For example, this seemingly simple conversion:
```mdtest-command fails
Expand All @@ -319,8 +320,8 @@ but the data was necessarily changed (by inserting nulls):

#### 3.4.2 Splitting Schemas

Another common approach to dealing with the schema-rigid limitation of Parquet
is to create a separate file for each schema.
Another common approach to dealing with the schema-rigid limitation of Arrow and
Parquet is to create a separate file for each schema.

`zq` can do this too with the `-split` option, which specifies a path
to a directory for the output files. If the path is `.`, then files
Expand Down
1 change: 1 addition & 0 deletions docs/lake/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -516,6 +516,7 @@ The supported MIME types are as follows:

| Format | MIME Type |
| ------ | --------- |
| Arrow IPC Stream | application/vnd.apache.arrow.stream |
| CSV | text/csv |
| JSON | application/json |
| NDJSON | application/x-ndjson |
Expand Down
4 changes: 2 additions & 2 deletions docs/tutorials/schools.md
Original file line number Diff line number Diff line change
Expand Up @@ -721,8 +721,8 @@ Geyserville New Tech Academy,Geyserville Unified,Geyserville,Sonoma,95441-9670,3
,,,,,,,,,,,,,,,,Sonoma,Geyserville Unified,Geyserville New Tech Academy
,,,,,,,,,,,,,,,,Sonoma,Geyserville Unified,
```
In addition to the `csv` format, the `parquet`, `table`, and `zeek` formats
also benefit from fused records.
In addition to the `csv` format, the `arrows`, `parquet`, `table`, and `zeek`
formats also benefit from fused records.

### 4.4 [put](../language/operators/drop.md)

Expand Down
31 changes: 22 additions & 9 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ go 1.19
require (
github.com/agnivade/levenshtein v1.1.1
github.com/alecthomas/units v0.0.0-20190717042225-c3de453c63f4
github.com/apache/arrow/go/v11 v11.0.0-20221214174703-0dfec8e98f4f
github.com/araddon/dateparse v0.0.0-20210429162001-6b43995a97de
github.com/aws/aws-sdk-go v1.36.17
github.com/axiomhq/hyperloglog v0.0.0-20191112132149-a4c4c47bc57f
Expand All @@ -31,36 +32,48 @@ require (
go.uber.org/multierr v1.8.0
go.uber.org/zap v1.23.0
golang.org/x/exp v0.0.0-20220916125017-b168a2c6b86b
golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4
golang.org/x/sys v0.0.0-20220722155257-8c9f86f7a55f
golang.org/x/sync v0.0.0-20220819030929-7fc1605a5dde
golang.org/x/sys v0.0.0-20220829200755-d48e67d00261
golang.org/x/term v0.0.0-20210927222741-03fcf44c2211
golang.org/x/text v0.3.7
gopkg.in/natefinch/lumberjack.v2 v2.0.0
gopkg.in/yaml.v3 v3.0.1
)

require (
github.com/JohnCGriffin/overflow v0.0.0-20211019200055-46fa312c352c // indirect
github.com/andybalholm/brotli v1.0.4 // indirect
github.com/apache/thrift v0.16.0 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash/v2 v2.1.1 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/dgryski/go-metro v0.0.0-20180109044635-280f6062b5bc // indirect
github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f // indirect
github.com/golang/protobuf v1.4.3 // indirect
github.com/goccy/go-json v0.9.11 // indirect
github.com/golang/protobuf v1.5.2 // indirect
github.com/golang/snappy v0.0.4 // indirect
github.com/google/flatbuffers v2.0.8+incompatible // indirect
github.com/jmespath/go-jmespath v0.4.0 // indirect
github.com/mattn/go-isatty v0.0.12 // indirect
github.com/klauspost/asmfmt v1.3.2 // indirect
github.com/klauspost/compress v1.15.9 // indirect
github.com/klauspost/cpuid/v2 v2.0.9 // indirect
github.com/mattn/go-isatty v0.0.16 // indirect
github.com/mattn/go-runewidth v0.0.10 // indirect
github.com/matttproud/golang_protobuf_extensions v1.0.1 // indirect
github.com/niemeyer/pretty v0.0.0-20200227124842-a10e7caefd8e // indirect
github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8 // indirect
github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3 // indirect
github.com/prometheus/common v0.10.0 // indirect
github.com/prometheus/procfs v0.1.3 // indirect
github.com/rivo/uniseg v0.1.0 // indirect
github.com/zeebo/xxh3 v1.0.2 // indirect
go.opentelemetry.io/otel v0.16.0 // indirect
go.uber.org/atomic v1.10.0 // indirect
go.uber.org/atomic v1.7.0 // indirect
golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4 // indirect
golang.org/x/net v0.0.0-20220722155237-a158d28d115b // indirect
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
google.golang.org/protobuf v1.25.1-0.20201020201750-d3470999428b // indirect
gopkg.in/check.v1 v1.0.0-20200902074654-038fdea0a05b // indirect
golang.org/x/tools v0.1.12 // indirect
golang.org/x/xerrors v0.0.0-20220609144429-65e65417b02f // indirect
google.golang.org/genproto v0.0.0-20200526211855-cb27e3aa2013 // indirect
google.golang.org/grpc v1.49.0 // indirect
google.golang.org/protobuf v1.28.1 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
)
Loading