Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Nested fields not readable by python lib polars #468

Open
oscar6echo opened this issue Jan 25, 2023 · 2 comments
Open

Nested fields not readable by python lib polars #468

oscar6echo opened this issue Jan 25, 2023 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@oscar6echo
Copy link

I want to read/write parquet files in go to read/write them in python/polars.
It seems that the nested fields ([]int in my example) written by one lib cannot be read by the other. Then instead it returns empty lists.

I placed an issue with polars, but it concerns segmentio/parquet-go symmetrically.
See pola-rs/polars#6428

I am surprised that parquet compability can be partial.
So I am looking for a way, if possible, to create the nested field in such a way it is understood by polars.

  • Here is the parquet file creation
  • Is there some other way to create the nested field that would increase compatility with other libs like polars ?
  • Where in the source code is the nested field creation performed ?
@achille-roussel achille-roussel self-assigned this Jan 25, 2023
@achille-roussel achille-roussel added the question Further information is requested label Jan 25, 2023
@achille-roussel
Copy link
Contributor

Hello @oscar6echo, thanks for reporting!

I looked at the issue you opened at pola-rs/polars#6428, it appears that this is due to the underlying pyarrow library not supporting the DELTA_LENGTH_BYTE_ARRAY encoding, which is used by default in parquet-go.

...
  File "pyarrow/_dataset.pyx", line 332, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2661, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.

I am surprised that parquet compability can be partial.

This is somewhat a double-edge sword with parquet, each column can have a different encoding, and clients can choose to support a subset of the parquet spec, which eventually can result in incompatibilities like the one you describe. parquet-go actually follows the spec which says that DELTA_LENGTH_BYTE_ARRAY should be the default encoding for byte array columns.

Is there some other way to create the nested field that would increase compatility with other libs like polars ?

All parquet clients must support the PLAIN encoding, which is less efficient but simpler to implement, so I would recommend in your case that you force the encoding in your schema, for example by adding the plain marker to the parquet tag of Go struct fields:

type Row struct {
	Idx    int       `parquet:"idx,plain"`
	Name   string    `parquet:"name,plain"`
	...
}

With this approach, you will be trading off storage space for greater chance of compatibility between parquet clients.

Where in the source code is the nested field creation performed ?

This is a broad question as it depends on how the schema is constructed, but here are a few entry points:

I hope these answers are useful to you, let me know if you have any other question!

@oscar6echo
Copy link
Author

@achille-roussel thx for the fast and comprehensive reply !

I tried to use the plain option to see if it would achieve compatibility. But then I realised that I in fact got mixed up in my latest comment.

The files produced by parquet-go and polars in this example are indeed incompatible - due to the nested field. But you gave me tips that helped me maybe pintpoint the difference.

1/ schema produced/read by parquet-go:

  • code
type Row struct {
	Idx    int       `parquet:"idx"`
	Name   string    `parquet:"name"`
	Age    int       `parquet:"age"`
	Sex    bool      `parquet:"sex"`
	Weight float64   `parquet:"weight"`
	Time   time.Time `parquet:"time"`
	Arr    []int     `parquet:"array"`
}
schema := parquet.SchemaOf(new(Row))
fmt.Println(schema)
  • result:
message Row {
	required int64 idx (INT(64,true));
	required binary name (STRING);
	required int64 age (INT(64,true));
	required boolean sex;
	required double weight;
	required int64 time (TIMESTAMP(isAdjustedToUTC=true,unit=NANOS));
	repeated int64 array (INT(64,true));
}

2/ schema produced/read by polars:

  • code
import datetime as dt
from pathlib import Path
import polars as pl
import pyarrow.parquet as pq

now = dt.datetime.now()

s1 = pl.Series("idx", [0, 1], dtype=pl.Int64)
s1 = pl.Series("name", ["Masterfog", "Armspice"], dtype=pl.Utf8)
s2 = pl.Series("age", [22, 23], dtype=pl.Int64)
s3 = pl.Series("sex", [True, False], dtype=pl.Boolean)
s4 = pl.Series("weight", [51.2, 65.3], dtype=pl.Float64)
s5 = pl.Series("time", [now, now], dtype=pl.Datetime)
s6 = pl.Series("array", [[10, 20], [11, 22]], dtype=pl.List(pl.Int64))
df = pl.DataFrame([s1, s2, s3, s4, s5, s6])
path = Path("sample3.pqt")
df.write_parquet(path)

h = pq.ParquetFile(path)
print(h.schema)
  • result:
<pyarrow._parquet.ParquetSchema object at 0x7fd8fb70d540>
required group field_id=-1 schema {
  optional binary field_id=-1 name (String);
  optional int64 field_id=-1 age;
  optional boolean field_id=-1 sex;
  optional double field_id=-1 weight;
  optional int64 field_id=-1 time (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional group field_id=-1 array (List) {
    repeated group field_id=-1 list {
      optional int64 field_id=-1 item;
    }
  }
}

The inspection of the schemas produced by each lib show the difference in parquet format. There is a group notion in the polars generated parquet that does not exist in parquet-go.:

// parquet-go
repeated int64 array (INT(64,true));

vs.

// polars
optional group field_id=-1 array (List) {
    repeated group field_id=-1 list {
      optional int64 field_id=-1 item;
    }

So if I could create the polars format with parquet-go (possibly using intermediate structs ?) then it should achieve compatibility.
Does it make sense ?

I made an attempt to mirror the parquet structure of polars. But it does not seem to work immediately. See trial.go.

Is that possible ?
I recognize polars struct nested structure seem a bit convoluted but probably makes sense in their context.
Else I should probably try a lower level lib. In this case, any recommendation ?

I would be grateful for any hint as I think the ability to interact with a polars dataframe from outside its ecosystem is quite interesting - particularly from "Goland" as Go is quite complementary with Python.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants