-
Notifications
You must be signed in to change notification settings - Fork 58
Nested fields not readable by python lib polars #468
Comments
Hello @oscar6echo, thanks for reporting! I looked at the issue you opened at pola-rs/polars#6428, it appears that this is due to the underlying pyarrow library not supporting the
This is somewhat a double-edge sword with parquet, each column can have a different encoding, and clients can choose to support a subset of the parquet spec, which eventually can result in incompatibilities like the one you describe. parquet-go actually follows the spec which says that
All parquet clients must support the type Row struct {
Idx int `parquet:"idx,plain"`
Name string `parquet:"name,plain"`
...
} With this approach, you will be trading off storage space for greater chance of compatibility between parquet clients.
This is a broad question as it depends on how the schema is constructed, but here are a few entry points:
I hope these answers are useful to you, let me know if you have any other question! |
@achille-roussel thx for the fast and comprehensive reply ! I tried to use the The files produced by parquet-go and polars in this example are indeed incompatible - due to the nested field. But you gave me tips that helped me maybe pintpoint the difference. 1/ schema produced/read by parquet-go:
type Row struct {
Idx int `parquet:"idx"`
Name string `parquet:"name"`
Age int `parquet:"age"`
Sex bool `parquet:"sex"`
Weight float64 `parquet:"weight"`
Time time.Time `parquet:"time"`
Arr []int `parquet:"array"`
}
schema := parquet.SchemaOf(new(Row))
fmt.Println(schema)
message Row {
required int64 idx (INT(64,true));
required binary name (STRING);
required int64 age (INT(64,true));
required boolean sex;
required double weight;
required int64 time (TIMESTAMP(isAdjustedToUTC=true,unit=NANOS));
repeated int64 array (INT(64,true));
} 2/ schema produced/read by polars:
import datetime as dt
from pathlib import Path
import polars as pl
import pyarrow.parquet as pq
now = dt.datetime.now()
s1 = pl.Series("idx", [0, 1], dtype=pl.Int64)
s1 = pl.Series("name", ["Masterfog", "Armspice"], dtype=pl.Utf8)
s2 = pl.Series("age", [22, 23], dtype=pl.Int64)
s3 = pl.Series("sex", [True, False], dtype=pl.Boolean)
s4 = pl.Series("weight", [51.2, 65.3], dtype=pl.Float64)
s5 = pl.Series("time", [now, now], dtype=pl.Datetime)
s6 = pl.Series("array", [[10, 20], [11, 22]], dtype=pl.List(pl.Int64))
df = pl.DataFrame([s1, s2, s3, s4, s5, s6])
path = Path("sample3.pqt")
df.write_parquet(path)
h = pq.ParquetFile(path)
print(h.schema)
The inspection of the schemas produced by each lib show the difference in parquet format. There is a
So if I could create the polars format with parquet-go (possibly using intermediate structs ?) then it should achieve compatibility. I made an attempt to mirror the parquet structure of polars. But it does not seem to work immediately. See trial.go. Is that possible ? I would be grateful for any hint as I think the ability to interact with a polars dataframe from outside its ecosystem is quite interesting - particularly from "Goland" as Go is quite complementary with Python. |
I want to read/write parquet files in go to read/write them in python/polars.
It seems that the nested fields (
[]int
in my example) written by one lib cannot be read by the other. Then instead it returns empty lists.I placed an issue with polars, but it concerns segmentio/parquet-go symmetrically.
See pola-rs/polars#6428
I am surprised that parquet compability can be partial.
So I am looking for a way, if possible, to create the nested field in such a way it is understood by polars.
The text was updated successfully, but these errors were encountered: