Is there any way to access parquet metadata? #347

llimllib · 2021-11-04T00:39:22Z

llimllib
Nov 4, 2021

I'm writing my data file with pyarrow, and adding metadata with:

schema = pa.Schema.from_pandas(df).with_metadata(
    {"updated": datetime.utcnow().isoformat() + "Z"},
)
table = pa.Table.from_pandas(df, schema=schema)
pq.write_table(table, output)

Is there any API for accessing that metadata? If there is, I couldn't find it

Edit, here's a python script to create a parquet file, reload it, and print out its schema:

from datetime import datetime

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

d = {'col1': [1, 2]}
df = pd.DataFrame(data=d)
schema = pa.Schema.from_pandas(df).with_metadata(
    {"updated": datetime.utcnow().isoformat() + "Z"},
)
table = pa.Table.from_pandas(df, schema=schema)
pq.write_table(table, "data/test.parquet")

t = pq.read_table("data/test.parquet")
print(t.schema)

When I query its metadata via parquet_metadata, I only get the column information back:

>>> await query(conn, `SELECT * FROM parquet_metadata('http://devd.io:8000/data/test.parquet')`)
[
  {
    "file_name": "http://devd.io:8000/data/test.parquet",
    "row_group_id": 0,
    "row_group_num_rows": 2,
    "row_group_num_columns": 1,
    "row_group_bytes": 100,
    "column_id": 0,
    "file_offset": 108,
    "num_values": 2,
    "path_in_schema": "col1",
    "type": "INT64",
    "stats_min": "1",
    "stats_max": "2",
    "stats_null_count": 0,
    "stats_distinct_count": null,
    "stats_min_value": "1",
    "stats_max_value": "2",
    "compression": "SNAPPY",
    "encodings": "PLAIN_DICTIONARY, PLAIN, RLE",
    "index_page_offset": 0,
    "dictionary_page_offset": 4,
    "data_page_offset": 36,
    "total_compressed_size": 104,
    "total_uncompressed_size": 100
  }
]

OK, it seems like this might want to be a bug filed against duckdb itself? Here's the duckdb shell:

D SELECT * FROM parquet_metadata('test.parquet');
┌──────────────┬──────────────┬────────────────────┬───────────────────────┬─────────────────┬───────────┬─────────────┬────────────┬────────────────┬───────┬───────────┬───────────┬──────────────────┬──────────────────────┬─────────────────┬─────────────────┬─────────────┬──────────────────────────────┬───────────────────┬────────────────────────┬──────────────────┬───────────────────────┬─────────────────────────┐
│  file_name   │ row_group_id │ row_group_num_rows │ row_group_num_columns │ row_group_bytes │ column_id │ file_offset │ num_values │ path_in_schema │ type  │ stats_min │ stats_max │ stats_null_count │ stats_distinct_count │ stats_min_value │ stats_max_value │ compression │          encodings           │ index_page_offset │ dictionary_page_offset │ data_page_offset │ total_compressed_size │ total_uncompressed_size │
├──────────────┼──────────────┼────────────────────┼───────────────────────┼─────────────────┼───────────┼─────────────┼────────────┼────────────────┼───────┼───────────┼───────────┼──────────────────┼──────────────────────┼─────────────────┼─────────────────┼─────────────┼──────────────────────────────┼───────────────────┼────────────────────────┼──────────────────┼───────────────────────┼─────────────────────────┤
│ test.parquet │ 0            │ 2                  │ 1                     │ 100             │ 0         │ 108         │ 2          │ col1           │ INT64 │ 1         │ 2         │ 0                │                      │ 1               │ 2               │ SNAPPY      │ PLAIN_DICTIONARY, PLAIN, RLE │ 0                 │ 4                      │ 36               │ 104                   │ 100                     │
└──────────────┴──────────────┴────────────────────┴───────────────────────┴─────────────────┴───────────┴─────────────┴────────────┴────────────────┴───────┴───────────┴───────────┴──────────────────┴──────────────────────┴─────────────────┴─────────────────┴─────────────┴──────────────────────────────┴───────────────────┴────────────────────────┴──────────────────┴───────────────────────┴─────────────────────────┘
D SELECT * FROM parquet_schema('test.parquet');
┌──────────────┬────────┬─────────┬─────────────┬─────────────────┬──────────────┬────────────────┬───────┬───────────┬──────────┬──────────────┐
│  file_name   │  name  │  type   │ type_length │ repetition_type │ num_children │ converted_type │ scale │ precision │ field_id │ logical_type │
├──────────────┼────────┼─────────┼─────────────┼─────────────────┼──────────────┼────────────────┼───────┼───────────┼──────────┼──────────────┤
│ test.parquet │ schema │ BOOLEAN │ 0           │ REQUIRED        │ 1            │ UTF8           │ 0     │ 0         │ 0        │              │
│ test.parquet │ col1   │ INT64   │ 0           │ OPTIONAL        │ 0            │ UTF8           │ 0     │ 0         │ 0        │              │
└──────────────┴────────┴─────────┴─────────────┴─────────────────┴──────────────┴────────────────┴───────┴───────────┴──────────┴──────────────┘

Filed as duckdb/duckdb#2534

Answered by ankoh

Nov 4, 2021

Yes, this is an issue for main duckdb, thanks for posting a summary there, @llimllib .

View full answer

bmschmidt · 2021-11-04T01:30:53Z

bmschmidt
Nov 4, 2021

There's a parquet_metadata metadata function built into duckdb described here; works for me in observable w/ the following code.

client = DuckDBClient.of([
    await FileAttachment("characteristics@2.parquet"),
    ])

client.table(`SELECT * FROM parquet_metadata('characteristics@2.parquet')`)

2 replies

llimllib Nov 4, 2021
Author

oh, neat! thanks. I get the column metadata for my file back with that, but not the metadata I added above. Will keep digging around

bmschmidt Nov 4, 2021

Oh hmm, I had assumed it was somewhere in there (and was planning on using it!). There's also parquet_schema, but it also doesn't include custom file-level metadata from the parquet header.

ankoh · 2021-11-04T09:08:24Z

ankoh
Nov 4, 2021
Collaborator

Yes, this is an issue for main duckdb, thanks for posting a summary there, @llimllib .

1 reply

llimllib Nov 4, 2021
Author

duckdb/duckdb#2534 👍 thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way to access parquet metadata? #347

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is there any way to access parquet metadata? #347

llimllib Nov 4, 2021

Replies: 2 comments · 3 replies

bmschmidt Nov 4, 2021

llimllib Nov 4, 2021 Author

bmschmidt Nov 4, 2021

ankoh Nov 4, 2021 Collaborator

llimllib Nov 4, 2021 Author

llimllib
Nov 4, 2021

Replies: 2 comments 3 replies

bmschmidt
Nov 4, 2021

llimllib Nov 4, 2021
Author

ankoh
Nov 4, 2021
Collaborator

llimllib Nov 4, 2021
Author