util/parquet: add support for arrays #101860

jayshrivastava · 2023-04-19T18:31:36Z

This change extends and refactors the util/parquet library to be able to read and write arrays.

Release note: None

Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071

cockroach-teamcity · 2023-04-19T18:31:45Z

This change is

miretskiy

Reviewed 4 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)

pkg/util/parquet/write_functions.go line 97 at r1 (raw file):

//
// For more info on definition levels and repetition levels, refer to
// https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/

Awesome.

pkg/util/parquet/write_functions.go line 110 at r1 (raw file):

	d tree.Datum, w file.ColumnChunkWriter, a *batchAlloc, wFn writeFn, isArray bool,
) error {
	if isArray {

I feel like this function should have remained as two different functions: one for writing arrays,
and another one for writing regular datums. You can do the switch on isArray at the single call site.

Better, yet, can't we change colWriter so that the right function is invoked; regardless of
what kind of column it is?
Basically, when you create schema, you current assign colWriter to be equal to
elementCol.colWriter

result.colWriter = elementCol.colWriter

But why do that? Why not do something like:

result.colWriter =  func(d tree.Datum, w file.ColumnChunkWriter, a *batchAlloc) error {
    writeArray(...., elementCol.colWriter)
 }

(that is: just wrap result column writer (which writes array), with a function that calls "writeArray" using correct column type writer (elementCol.col.Writer)?)

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

pkg/util/parquet/write_functions.go line 110 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I feel like this function should have remained as two different functions: one for writing arrays,
and another one for writing regular datums. You can do the switch on isArray at the single call site.

Better, yet, can't we change colWriter so that the right function is invoked; regardless of
what kind of column it is?
Basically, when you create schema, you current assign colWriter to be equal to
elementCol.colWriter
result.colWriter = elementCol.colWriter
But why do that? Why not do something like:
result.colWriter =  func(d tree.Datum, w file.ColumnChunkWriter, a *batchAlloc) error {
    writeArray(...., elementCol.colWriter)
 }
(that is: just wrap result column writer (which writes array), with a function that calls "writeArray" using correct column type writer (elementCol.col.Writer)?)

I think we run into a loop with your last point. writeArray needs to decide the repLevels and defLevels and call the writeFn. However, result.writeFn below calls writeArray.

result.writeFn =  func(d tree.Datum, w file.ColumnChunkWriter, a *batchAlloc) error {
    writeArray(...., elementCol.writeFn)
 }

So then instead, we can keep the write function as is and only change writeDatumToColChunk.

result.writeFn =  elementCol.writeFn

func (w *Writer) writeDatumToColChunk (d tree.Datum ...) {
    if d.isArray() {
        writeArray(col.writeFn)
    } else {
        writeScalar(col.writeFn)
    }
}

But I ended up going with this. I thought it would be nicer:
Schema column now has a writeFn and writeInvoker. The invokers are writeScalar and writeArray. They figure out the levels and call writeFn. The writeFn encodes and writes bytes to files.

miretskiy

Reviewed 1 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)

pkg/util/parquet/write_functions.go line 110 at r1 (raw file):

Previously, jayshrivastava (Jayant) wrote…

I think we run into a loop with your last point. writeArray needs to decide the repLevels and defLevels and call the writeFn. However, result.writeFn below calls writeArray.
result.writeFn =  func(d tree.Datum, w file.ColumnChunkWriter, a *batchAlloc) error {
    writeArray(...., elementCol.writeFn)
 }
So then instead, we can keep the write function as is and only change writeDatumToColChunk.
result.writeFn =  elementCol.writeFn

func (w *Writer) writeDatumToColChunk (d tree.Datum ...) {
    if d.isArray() {
        writeArray(col.writeFn)
    } else {
        writeScalar(col.writeFn)
    }
}
But I ended up going with this. I thought it would be nicer:
Schema column now has a writeFn and writeInvoker. The invokers are writeScalar and writeArray. They figure out the levels and call writeFn. The writeFn encodes and writes bytes to files.

I like this this much better;
I'm not too thrilled about the split with writeInvoker and writeFn in the element... Do you think something like this would make things a bit cleaner (it's okay if not and/or you disagree):

// writer is responsible for writing datum into provided column.
// (basically, this is your write invoker)...
type writer interface {
   Write(d tree.Datum, w file.ColumnChunkWriter, a *batchAlloc) error
}

// arrayWriter -- responsible for writing array values.
// Note: it's just a typedef on writeFn -- meaning we can close over the type
// of the array value we are writing.
type arrayWriter writeFn

func (w arrayWriter) Write(d tree.Datum, cw file.ColumnChunkWriter, a *batchAlloc) error {
    return writeArray(d, cw, a, writeFn(w))
}

// Similarly, scalarWriter just forwards to your writeScalar function
type scalarWriter writeFn

func (w scalarWriter) Write(d tree.Datum, cw file.ColumnChunkWriter, a *batchAlloc) error {
    return writeScalar(d, cw, a, writeFn(w))
}

I think, with the above, you can just have a single "writer" in the schema struct, and just invoke it.

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

pkg/util/parquet/write_functions.go line 110 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I like this this much better;
I'm not too thrilled about the split with writeInvoker and writeFn in the element... Do you think something like this would make things a bit cleaner (it's okay if not and/or you disagree):

// writer is responsible for writing datum into provided column.
// (basically, this is your write invoker)...
type writer interface {
   Write(d tree.Datum, w file.ColumnChunkWriter, a *batchAlloc) error
}

// arrayWriter -- responsible for writing array values.
// Note: it's just a typedef on writeFn -- meaning we can close over the type
// of the array value we are writing.
type arrayWriter writeFn

func (w arrayWriter) Write(d tree.Datum, cw file.ColumnChunkWriter, a *batchAlloc) error {
    return writeArray(d, cw, a, writeFn(w))
}

// Similarly, scalarWriter just forwards to your writeScalar function
type scalarWriter writeFn

func (w scalarWriter) Write(d tree.Datum, cw file.ColumnChunkWriter, a *batchAlloc) error {
    return writeScalar(d, cw, a, writeFn(w))
}

I think, with the above, you can just have a single "writer" in the schema struct, and just invoke it.

Done.

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained

This change extends and refactors the util/parquet library to be able to read and write arrays. Release note: None Informs: cockroachdb#99028 Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071

jayshrivastava · 2023-04-20T17:34:56Z

bors r=miretskiy

craig · 2023-04-20T18:23:56Z

Build failed:

Bazel Essential CI (Cockroach)

jayshrivastava · 2023-04-20T19:02:27Z

bors retry

jayshrivastava · 2023-04-20T19:17:05Z

bors ping

craig · 2023-04-20T19:17:07Z

pong

jayshrivastava · 2023-04-20T19:17:40Z

bors r=miretskiy

craig · 2023-04-20T19:17:42Z

Already running a review

craig · 2023-04-20T20:15:08Z

Build succeeded:

Bazel Essential CI (Cockroach)

jayshrivastava force-pushed the parquet-just-arrays branch 2 times, most recently from 3e218ba to 5ebf4cd Compare April 19, 2023 18:35

jayshrivastava requested a review from miretskiy April 19, 2023 19:32

jayshrivastava marked this pull request as ready for review April 19, 2023 19:32

miretskiy suggested changes Apr 19, 2023

View reviewed changes

jayshrivastava force-pushed the parquet-just-arrays branch from 5ebf4cd to adf8c1a Compare April 19, 2023 20:42

jayshrivastava commented Apr 19, 2023

View reviewed changes

jayshrivastava requested a review from miretskiy April 19, 2023 20:51

miretskiy reviewed Apr 19, 2023

View reviewed changes

jayshrivastava force-pushed the parquet-just-arrays branch 2 times, most recently from d3f2ad7 to 3cbca9e Compare April 20, 2023 15:04

jayshrivastava commented Apr 20, 2023

View reviewed changes

jayshrivastava requested a review from miretskiy April 20, 2023 15:05

miretskiy approved these changes Apr 20, 2023

View reviewed changes

util/parquet: add support for arrays

42b37b2

This change extends and refactors the util/parquet library to be able to read and write arrays. Release note: None Informs: cockroachdb#99028 Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071

jayshrivastava force-pushed the parquet-just-arrays branch from 3cbca9e to 42b37b2 Compare April 20, 2023 15:57

craig bot merged commit ccc9d02 into cockroachdb:master Apr 20, 2023

jayshrivastava mentioned this pull request May 10, 2023

cdc: add new parquet library #99028

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

util/parquet: add support for arrays #101860

util/parquet: add support for arrays #101860

jayshrivastava commented Apr 19, 2023 •

edited

Loading

cockroach-teamcity commented Apr 19, 2023

miretskiy left a comment

jayshrivastava left a comment

miretskiy left a comment

jayshrivastava left a comment

miretskiy left a comment

jayshrivastava commented Apr 20, 2023

craig bot commented Apr 20, 2023

jayshrivastava commented Apr 20, 2023

jayshrivastava commented Apr 20, 2023

craig bot commented Apr 20, 2023

jayshrivastava commented Apr 20, 2023

craig bot commented Apr 20, 2023

craig bot commented Apr 20, 2023

util/parquet: add support for arrays #101860

util/parquet: add support for arrays #101860

Conversation

jayshrivastava commented Apr 19, 2023 • edited Loading

cockroach-teamcity commented Apr 19, 2023

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava commented Apr 20, 2023

craig bot commented Apr 20, 2023

jayshrivastava commented Apr 20, 2023

jayshrivastava commented Apr 20, 2023

craig bot commented Apr 20, 2023

jayshrivastava commented Apr 20, 2023

craig bot commented Apr 20, 2023

craig bot commented Apr 20, 2023

jayshrivastava commented Apr 19, 2023 •

edited

Loading