Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement StringViewArray and BinaryViewArray reading/writing in parquet #5530

Closed
Tracked by #5374
alamb opened this issue Mar 18, 2024 · 11 comments
Closed
Tracked by #5374

Implement StringViewArray and BinaryViewArray reading/writing in parquet #5530

alamb opened this issue Mar 18, 2024 · 11 comments
Labels
parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Mar 18, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

This is part of the larger project to implement StringViewArray -- see #5374

In #5481 we added support for StringViewArray and ByteViewArray.

The parquet crate has a reader and writer for reading/writing parquet data to arrow:

Describe the solution you'd like
I would like to be able to read a StringViewArray and BinaryViewArray directly from the reader and writer with no data copies (so the raw byte values are not copied).

  1. Add functionality
  2. Add tests

Describe alternatives you've considered

For example, I think we need to add the support to the writer here

ArrowDataType::Dictionary(_, value_type) => match value_type.as_ref() {
ArrowDataType::Utf8 | ArrowDataType::LargeUtf8 | ArrowDataType::Binary | ArrowDataType::LargeBinary => {
out.push(bytes(leaves.next().unwrap()))
}
_ => {
out.push(col(leaves.next().unwrap()))
}
}
_ => return Err(ParquetError::NYI(
format!(
"Attempting to write an Arrow type {data_type:?} to parquet that is not yet implemented"
)
))
}

Additional context

The reader/writer already handles DictionaryArrays which I think could serve as a model for the view arrays.

@ariesdevil reports they are working on this feature #5374 (comment)

@alamb
Copy link
Contributor Author

alamb commented Apr 12, 2024

@mapleFU brought up a good point on #5557 (comment) that I wanted to record here. The observation is that that the special handling StringView / BinaryView will be substantially more code in the parquet decoder.

The reason for adding the special case is to avoid a copy

For example as I understand it, from the parquet encodings doc

...
BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array
...

So the data looks like this (length prefix, followed by the bytes):

\3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz

To make a StringArray, those bytes must be copied to a new buffer so they are contiguous:

offets: [0, 3, 29]
data: fooabcdefghijklmnoprstuvwxyz

However, for a StringView array, the raw bytes can be used without copying

views: [(len: 3, data:"foo"), (len:26, offset8)] 
\3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz

@mapleFU
Copy link
Member

mapleFU commented Apr 12, 2024

In parquet, ARROW:SCHEMA is used to identify the schema and extended info in parquet file. I think the trick point is that, without key-value metadata, both view and string are stored as same thing. So I think we can add optimization for "read string as stringview", but I also think maybe storing a stringview as string can make some legacy reader not confused about it.

For some view, this might work for some encoding( like PLAIN and DELTA_LENGTH_BYTE_ARRAY), but might be tricky for DELTA_BYTE_ARRAY.

Besides, do we need to "copy" the buffer when we'd like to clone or concat the array?

@alamb
Copy link
Contributor Author

alamb commented Apr 12, 2024

In parquet, ARROW:SCHEMA is used to identify the schema and extended info in parquet file. I think the trick point is that, without key-value metadata, both view and string are stored as same thing. So I think we can add optimization for "read string as stringview", but I also think maybe storing a stringview as string can make some legacy reader not confused about it.

This is a good point. Maybe something we could consider is to support setting the desired type directly on the reader
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html

Something like

let reader = ArrowReaderBuilder::try_new(...)
  // override any schema declared in the file
  .with_schema(schema)
  .build()?;

If we supported a similar API for the writer than we could write StringViewArary without any extra copies but the metadata stored in the file could say "read this as a StringView" 🤔

@mapleFU
Copy link
Member

mapleFU commented Apr 12, 2024

Yes, my view is that:

  1. writer side: cast string view/binary view to string/binary, keeping the "large" and metadata. This also not confuse legacy reader
  2. reader side: user can choose read by string or string view

@alamb
Copy link
Contributor Author

alamb commented Apr 12, 2024

I think the current writer has the very nice property that RecordBatches round trp cleanly -- they can be written to parquet and then re-read and will be equal.

Thus, yy personal preference is that the writer should (by default) write the metadata that matches the data it was passed (so in this case specify StringView) . The reader should likewise read out the type that was written unless the user overrides the behavior for some reason.

I think it would be valuable to add an option on the writer to write in "compatiblity" mode (and store string/binary data with StringArray metadata, for example)

@mapleFU
Copy link
Member

mapleFU commented Apr 12, 2024

This may differ from how we treat "view"( For both list view and string view ). If we think it's a part of string or list, it might better be string/list in metadata. If we regard it as a completely different type, we should regard it as string-view/list-view

For ipc, they're completely different type. For parquet, they stored nothing different, however parquet uses ipc schema here. ::arrow::ipc::ReadSchema might work bad in this scenerio, which might regard "large-string-view-with-metadata" to another type. I'm ok with this but I think at least we can testing this

I believe a new file can be read well, but the nested info might be lost. But anyway this might not severe

@alamb
Copy link
Contributor Author

alamb commented Apr 12, 2024

however parquet uses ipc schema here. ::arrow::ipc::ReadSchema might work bad in this scenerio,

Thanks @mapleFU -- I don't fully understand the concern -- if the embedded metadata in a parquet file says it was written as a StringViewArray I would expect the parqet readers that support that type to read it out as a StringViewArray unless they don't support the type (old readers) or the user specifies they want to read the data from parquet as a different type

@mapleFU
Copy link
Member

mapleFU commented Apr 12, 2024

unless they don't support the type

IMO at least, a release without StringView/BinaryView's spec (rather than don't support read StringViewArray) they might meet a "parse failed" and cannot parse an ARROW:SCHEMA. This doesn't matter if we think it's ok.

@alamb
Copy link
Contributor Author

alamb commented Apr 13, 2024

unless they don't support the type

IMO at least, a release without StringView/BinaryView's spec (rather than don't support read StringViewArray) they might meet a "parse failed" and cannot parse an ARROW:SCHEMA. This doesn't matter if we think it's ok.

I see -- I think we are both in agreement that the user of the Rust parquet crate should be able to decide how they want to handle this situation. We may just have different opinions on what a better default is.

As long as both options are possible, I could be convinced either way on the default value

@mapleFU
Copy link
Member

mapleFU commented Apr 13, 2024

Yes, I'm also ok with writing a string view type, and maybe we can generate a test file first

@alamb
Copy link
Contributor Author

alamb commented Jun 17, 2024

Update here is that we have pretty fast support for reading into StringView and BinaryView arrays now thanks to @ariesdevil and @XiangpengHao so closing this ticket

There is follow on work tracked in #5904 in case anyone wants to try and make it faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants