diff --git a/data/README.md b/data/README.md index 4bb59c2..b5d05a2 100644 --- a/data/README.md +++ b/data/README.md @@ -33,6 +33,7 @@ | alltypes_tiny_pages_plain.parquet | small page sizes with plain encoding with page index [impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet). | | rle_boolean_encoding.parquet | option boolean columns with RLE encoding | | fixed_length_byte_array.parquet | optional FIXED_LENGTH_BYTE_ARRAY column with page index. See [fixed_length_byte_array.md](fixed_length_byte_array.md) for details. | +| int32_with_null_pages.parquet | optional INT32 column with random null pages. See [int32_with_null_pages.md](int32_with_null_pages.md) for details. | | datapage_v1-uncompressed-checksum.parquet | uncompressed INT32 columns in v1 data pages with a matching CRC | | datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in v1 data pages with a matching CRC | | datapage_v1-corrupt-checksum.parquet | uncompressed INT32 columns in v1 data pages with a mismatching CRC | diff --git a/data/int32_with_null_pages.md b/data/int32_with_null_pages.md new file mode 100644 index 0000000..fe16340 --- /dev/null +++ b/data/int32_with_null_pages.md @@ -0,0 +1,73 @@ + + +`int32_with_null_pages.parquet` is generated by parquet-mr version 1.13.0-SNAPSHOT. + +It has a single column of int32 type with 1000 values and page index enabled. + +Both integer and null values are random generated. However, a null page is generated by purpose. + +# File Metadata (from parquet-cli meta command) +``` +File path: int32_with_null_pages.parquet +Created by: parquet-mr version 1.13.0-SNAPSHOT (build 433de8df33fcf31927f7b51456be9f53e64d48b9) +Properties: + writer.model.name: example +Schema: +message schema { + optional int32 int32_field; +} + + +Row group 0: count: 1000 3.33 B records start: 4 total(compressed): 3.250 kB total(uncompressed):3.250 kB +-------------------------------------------------------------------------------- + type encodings count avg size nulls min / max +int32_field INT32 _ _ 1000 3.33 B 275 "-2136906554" / "2145722375" +``` + +# Column Index (from parquet-cli column-index command) +``` +row-group 0: +column index for column int32_field: +Boundary order: UNORDERED + null count min max +page-0 8 -2135807632 2144701119 +page-1 55 -2104090659 1745329571 +page-2 100 +page-3 52 -2116849709 2077105757 +page-4 16 -2048691758 2143189382 +page-5 12 -2017923401 2087827129 +page-6 5 -2136906554 2125689411 +page-7 7 -2113313110 2145722375 +page-8 8 -2046900272 2087168549 +page-9 12 -1941944785 2078586537 + +offset index for column int32_field: + offset compressed size first row index +page-0 4 415 0 +page-1 419 220 100 +page-2 639 31 200 +page-3 670 228 300 +page-4 898 382 400 +page-5 1280 402 500 +page-6 1682 422 600 +page-7 2104 411 700 +page-8 2515 417 800 +page-9 2932 400 900 +``` diff --git a/data/int32_with_null_pages.parquet b/data/int32_with_null_pages.parquet new file mode 100644 index 0000000..8263774 Binary files /dev/null and b/data/int32_with_null_pages.parquet differ