Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Reader Decoder Support Status #9767

Open
yingsu00 opened this issue May 10, 2024 · 1 comment
Open

Parquet Reader Decoder Support Status #9767

yingsu00 opened this issue May 10, 2024 · 1 comment
Labels
enhancement New feature or request parquet

Comments

@yingsu00
Copy link
Collaborator

yingsu00 commented May 10, 2024

Description

Normal Data Page Types

Velox Type Parquet LogicalType Parquet ConvertedType Parquet Storage Type Supported?
BOOLEAN     BOOLEAN (1 bit) Partial
Tinyint INT(8, true) INT_8 = 15 (deprecated) INT32 Y
Smallint INT(16, true) INT_16 = 16 (deprecated) INT32 Y
Integer INT(32, true) INT_32 = 17 (deprecated) INT32 Y
Bigint INT(64, true) INT_64 = 18 (deprecated) INT64 Y
Tinyint INT(8, false) UINT_8 = 11 (deprecated) INT32 Y
Smallint INT(16, false) UINT_16 = 12 (deprecated) INT32 Y
Integer INT(32, false) UINT_32 = 13 (deprecated) INT32 Y
Bigint INT(64, false) UINT_64 = 14 (deprecated) INT64 Y
Hugeint     FIXED_LEN_BYTE_ARRAY (len = 16) Y
ShortDecimal Decimal 1 <= precision <= 9 DECIMAL = 5 INT32 Y
ShortDecimal Decimal  1 <= precision <= 18 DECIMAL = 5 INT64 Y
Short/LongDecimal Decimal  precision limited by len DECIMAL = 5 FIXED_LEN_BYTE_ARRAY Y
Short/LongDecimal Decimal  precision unlimited DECIMAL = 5 BYTE_ARRAY N
Real FLOAT16   FIXED_LEN_BYTE_ARRAY (len = 2) N
Real     FLOAT Y
Double     DOUBLE Y
DateType DATE DATE = 6 INT32 Y
  TIME(isAdjustedToUTC=True/False, unit=MILLIS) TIME_MILLIS = 7 (deprecated) INT32 N
  TIME(isAdjustedToUTC=True/False, unit=MICROS) TIME_MICROS = 8. (deprecated) INT64 N
  TIME(isAdjustedToUTC=True/False, unit=NANOS)   INT64 N
Timestamp TIMESTAMP(isAdjustedToUTC=True/False, unit=MILLIS) TIMESTAMP_MILLIS = 9 (deprecated) INT64 #8325
Timestamp TIMESTAMP(isAdjustedToUTC=True/False, unit=MICROS) TIMESTAMP_MICROS = 10 (deprecated)   #8325
Timestamp TIMESTAMP(isAdjustedToUTC=True/False, unit=NANOS)     #8325
Timestamp     INT96(deprecated) N
CustomType::TimeStampWithTimeZone TIMESTAMP(isAdjustedToUTC=False)   INT64 N
IntervalDayTimeType INTERVAL INTERVAL = 21 FIXED_LEN_BYTE_ARRAY (len=12) N
IntervalYearMonthType INTERVAL INTERVAL = 21 FIXED_LEN_BYTE_ARRAY (len=12) N
VARCHAR STRING UTF8 = 0 BYTE_ARRAY Y
VARCHAR ENUM ENUM = 4 BYTE_ARRAY Y
VARCHAR UUID   FIXED_LEN_BYTE_ARRAY (len=16) N
VARBINARY STRING BYTE_ARRAY N
CustomType::JSON JSON JSON = 19 BYTE_ARRAY N
  BSON BSON = 20 BYTE_ARRAY N
Array LIST LIST = 3   Y
Row LIST LIST = 3   Y
Map MAP MAP_KEY_VALUE = 2   Y
Map MAP MAP = 1   Y
UnknownType UNKNOWN (always null)      

Normal Data Page Encodings

Parquet Storage Type Parquet Encoding Version Supported
BOOLEAN (1 bit) Plain (0) 1 Y
BOOLEAN (1 bit) RLE/BP (3) 1 N
INT32 Plain (0) 1 Y
INT32 DELTA_BINARY_PACKED (5) 2 Y
INT64 DELTA_BINARY_PACKED (5) 2 Y
FLOAT Plain (0) 1 Y
FLOAT BYTE_STREAM_SPLIT (9) 2 N
DOUBLE Plain (0) 1 Y
DOUBLE BYTE_STREAM_SPLIT (9) 2 N
FIXED_LEN_BYTE_ARRAY Plain (0) 1 Partial for certain types
FIXED_LEN_BYTE_ARRAY DELTA_BYTE_ARRAY (7) 2 N
BYTE_ARRAY Plain (0) 1 Y
BYTE_ARRAY DELTA_BYTE_ARRAY (7) 2 N
BYTE_ARRAY DELTA_LENGTH_BYTE_ARRAY (6) 2 N

Dictionary Page Encodings

Parquet  Type Parquet Encoding Supported
BOOLEAN Plain (0) Y
INT32 Plain (0) Y
INT64 Plain (0) Y
INT96(deprecated) Plain (0) N
FLOAT Plain (0) Y
DOUBLE Plain (0) Y
BYTE_ARRAY Plain (0) Y
FIXED_LEN_BYTE_ARRAY Plain (0) Y

Repetition/Definition Levels

Parquet  Type Parquet Encoding Supported
INT32 RLE/BP (3) Y needs to be updated
INT32 BIT_PACKED (4) (deprecated) N
@jkhaliqi
Copy link
Contributor

jkhaliqi commented Sep 13, 2024

Parquet Version 2 Data Types

Parquet file created from Presto Java

The two Encodings that did not show up when creating parquet files from Presto Java was BYTE_STREAM_SLIT(Float, Double) and DELTA_LENGTH_BYTE_ARRAY(varchar, string, binary).

Using Spark we were also not able to create the parquet file to use these encodings, but with Apache Arrow we were able to create a parquet file to use these encoding by changing around the WriterProperties as seen from this doc: https://arrow.apache.org/docs/cpp/parquet.html#writer-properties

Following is a list of the type and parquet encoding for V2 parquet table created from Presto Java

Presto Type Parquet Type Parquet Encodings
Boolean Boolean RLE
TinyInt INT32 DELTA_BINARY_PACKED
smallint INT32 DELTA_BINARY_PACKED
Integer INT32 DELTA_BINARY_PACKED
Bigint INT64 DELTA_BINARY_PACKED
REAL FLOAT PLAIN
DOUBLE DOUBLE PLAIN
DECIMAL FIXED_LEN_BYTE_ARRAY RLE_DICTIONARY
VARCHAR BYTE_ARRAY DELTA_BYTE_ARRAY
Char BYTE_ARRAY DELTA_BYTE_ARRAY
VarBinary BYTE_ARRAY DELTA_BYTE_ARRAY
JSON create table tmp(json json);Query 20240829_223247_00157_hpkyz failed: No default Hive type provided for unsupported Hive type: json On docs for parquet:Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is used (for example, a Parquet JSON column may be read as Arrow Binary or FixedSizeBinary).https://arrow.apache.org/docs/cpp/parquet.html#logical-types
Date INT32 DELTA_BINARY_PACKED
Time create table tmp(time time);Query 20240829_223030_00153_hpkyz failed: No default Hive type provided for unsupported Hive type: time  
Time With Time Zone    
Timestamp INT64 DELTA_BINARY_PACKED
Timestamp with timezone    
Interval year to month create table tmp(iym interval year to month);Query 20240829_223424_00163_hpkyz failed: No default Hive type provided for unsupported Hive type: interval year to month  
Interval day to second create table tmp(iym interval);Query 20240829_223452_00165_hpkyz failed: line 1:18: Unknown type 'interval' for column 'iym'create table tmp(iym interval)  
array(integer) INT32 DELTA_BINARY_PACKED
array(boolean) BOOLEAN RLE
map(integer, integer) INT32 DELTA_BINARY_PACKED
row("f0" varbinary, "f1" timestamp) Broke it down to what was inside -> Broke it down to what was inside -> {"PathInSchema":["P0","F0"],"Type":"BYTE_ARRAY","Encodings":["RLE_DICTIONARY"],"CompressedSize":186422,"UncompressedSize":223747,"NumValues":1000,"CompressionCodec":"GZIP"},{"PathInSchema":["P0","F1"],"Type":"INT64","Encodings":["RLE_DICTIONARY"],"CompressedSize":2527,"UncompressedSize":2500,"NumValues":1000,"NullCount":234,"MaxValue":9197623049880936755,"MinValue":58472672228734950,"CompressionCodec":"GZIP"}
IPADDRESS create table tmp(ipaddress ipaddress);Query 20240829_222932_00151_hpkyz failed: No default Hive type provided for unsupported Hive type: ipaddress  
IPPREFIX create table tmp(ip ipprefix);Query 20240829_223604_00167_hpkyz failed: No default Hive type provided for unsupported Hive type: ipprefix  
UUID create table tmp(u uuid);Query 20240829_223636_00168_hpkyz failed: No default Hive type provided for unsupported Hive type: uuid On docs for parquet:Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is used (for example, a Parquet JSON column may be read as Arrow Binary or FixedSizeBinary).https://arrow.apache.org/docs/cpp/parquet.html#logical-types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request parquet
Projects
None yet
Development

No branches or pull requests

2 participants