-
I'm writing my data file with pyarrow, and adding metadata with: schema = pa.Schema.from_pandas(df).with_metadata(
{"updated": datetime.utcnow().isoformat() + "Z"},
)
table = pa.Table.from_pandas(df, schema=schema)
pq.write_table(table, output) Is there any API for accessing that metadata? If there is, I couldn't find it Edit, here's a python script to create a parquet file, reload it, and print out its schema: from datetime import datetime
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
d = {'col1': [1, 2]}
df = pd.DataFrame(data=d)
schema = pa.Schema.from_pandas(df).with_metadata(
{"updated": datetime.utcnow().isoformat() + "Z"},
)
table = pa.Table.from_pandas(df, schema=schema)
pq.write_table(table, "data/test.parquet")
t = pq.read_table("data/test.parquet")
print(t.schema) When I query its metadata via >>> await query(conn, `SELECT * FROM parquet_metadata('http://devd.io:8000/data/test.parquet')`)
[
{
"file_name": "http://devd.io:8000/data/test.parquet",
"row_group_id": 0,
"row_group_num_rows": 2,
"row_group_num_columns": 1,
"row_group_bytes": 100,
"column_id": 0,
"file_offset": 108,
"num_values": 2,
"path_in_schema": "col1",
"type": "INT64",
"stats_min": "1",
"stats_max": "2",
"stats_null_count": 0,
"stats_distinct_count": null,
"stats_min_value": "1",
"stats_max_value": "2",
"compression": "SNAPPY",
"encodings": "PLAIN_DICTIONARY, PLAIN, RLE",
"index_page_offset": 0,
"dictionary_page_offset": 4,
"data_page_offset": 36,
"total_compressed_size": 104,
"total_uncompressed_size": 100
}
] OK, it seems like this might want to be a bug filed against duckdb itself? Here's the duckdb shell:
Filed as duckdb/duckdb#2534 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
There's a
|
Beta Was this translation helpful? Give feedback.
-
Yes, this is an issue for main duckdb, thanks for posting a summary there, @llimllib . |
Beta Was this translation helpful? Give feedback.
Yes, this is an issue for main duckdb, thanks for posting a summary there, @llimllib .