Skip to content

[C++][Parquet] Forced UTF8 encoding of BYTE_ARRAY on stream::read/write #42971

@asfimport

Description

@asfimport
StreamReader& StreamReader::operator>>(optional<std::string>& v) {
 CheckColumn(Type::BYTE_ARRAY, ConvertedType::UTF8);
 ByteArray ba;

 

StreamWriter& StreamWriter::WriteVariableLength(const char* data_ptr,
 std::size_t data_len) {
 CheckColumn(Type::BYTE_ARRAY, ConvertedType::UTF8);

 

Though the C++ Parquet::Schema::Node allows physical type of BYTE_ARRAY with ConvertedType=NONE, the stream reader/writer classes throw when ConvertedType != UTF8.

std::string is, unfortunately, the canonical byte buffer class in C++.

A simple approach might be to create >>parquet::ByteArray.. with columnCheck(BYTE_ARRAY, NONE), and let the user take it from there.  that would use the existing methods that >>std::string uses.. just an idea.

I am new to this forum, and have assigned MAJOR to this bug, but gladly defer to those who have a better grasp of classification.

Reporter: ian

Note: This issue was originally created as PARQUET-1958. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions