Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Class serializer writes map key as optional #396

Closed
crodwell opened this issue Sep 8, 2023 · 11 comments
Closed

[BUG]: Class serializer writes map key as optional #396

crodwell opened this issue Sep 8, 2023 · 11 comments

Comments

@crodwell
Copy link

crodwell commented Sep 8, 2023

Library Version

4.16.3

OS

Any

OS Architecture

64 bit

How to reproduce?

test.proto

message Test {
      message Dimensions {
        repeated string dimensions = 1; 
      }

      string configId = 1;
      int32 timeZone =2;
      map<string, Dimensions> dimensions = 3;
}

protoc -I=./protobuf --csharp_out=. test.proto

When writing to parquet the schema incorrectly makes all strings, including the map key as optional. Bool, float, int etc are all required by default.

- root:
| - ConfigId: BYTE_ARRAY, STRING, UTF8, OPTIONAL
| - TimeZone: INT32, INTEGER, INT_32, REQUIRED
| - Dimensions: MAP, MAP, OPTIONAL
|   - key_value: REPEATED
|   | - Key: BYTE_ARRAY, STRING, UTF8, OPTIONAL
|     - Value: OPTIONAL
|       - Dimensions_: LIST, LIST, OPTIONAL
|         - list: REPEATED
|           - element: BYTE_ARRAY, STRING, UTF8, OPTIONAL

I can manually add Parquet.Serialization.Attributes.ParquetRequired as an attribute to the proto class to fix strings, but this doesn't work for Map keys and map keys are not allowed to be nullable in parquet so my output does not work in the consumers. Any other workarounds?

Failing test

No response

@aloneguid
Copy link
Owner

aloneguid commented Sep 8, 2023

I'm failing to see what this library has to do with the issue. Seems like protobuf generator makes those class properties optional. Parquet is not protobuf, so you'll probably have to manually adjust output to make two incompatible formats work?

@crodwell
Copy link
Author

crodwell commented Sep 8, 2023

hi @aloneguid - I'm not sure protobuf generator does make them optional. this comes out OPTIONAL in parquet

 public const int ConfigIdFieldNumber = 1;
      private string configId_ = "";
      
      [global::System.Diagnostics.DebuggerNonUserCodeAttribute]
      public string ConfigId {
        get { return configId_; }
        set {
          configId_ = pb::ProtoPreconditions.CheckNotNull(value, "value");
        }
      }

but this is REQUIRED in parquet:

 public const int TimeZoneFieldNumber = 2;
  private int timeZone_;

  [global::System.Diagnostics.DebuggerNonUserCodeAttribute]
  public int TimeZone {
    get { return timeZone_; }
    set {
      timeZone_ = value;
    }
  }

only way I've found to get strings working is like so

 public const int ConfigIdFieldNumber = 1;
      private string configId_ = "";
      
      [global::System.Diagnostics.DebuggerNonUserCodeAttribute, Parquet.Serialization.Attributes.ParquetRequired]
      public string ConfigId {
        get { return configId_; }
        set {
          configId_ = pb::ProtoPreconditions.CheckNotNull(value, "value");
        }
      }

but I can't get Maps to work - this lib is definitely allowing a map key to be optional which breaks parquet:

 $ parquet-tools inspect test.parquet
Traceback (most recent call last):
  File "/home/crodwell/.local/bin/parquet-tools", line 8, in <module>
    sys.exit(main())
  File "/home/crodwell/.local/lib/python3.10/site-packages/parquet_tools/cli.py", line 26, in main
    args.handler(args)
  File "/home/crodwell/.local/lib/python3.10/site-packages/parquet_tools/commands/inspect.py", line 54, in _cli
    _execute_simple(
  File "/home/crodwell/.local/lib/python3.10/site-packages/parquet_tools/commands/inspect.py", line 62, in _execute_simple
    pq_file: pq.ParquetFile = pq.ParquetFile(filename)
  File "/home/crodwell/.local/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 334, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1232, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Map keys must be annotated as required.

@aloneguid aloneguid reopened this Sep 8, 2023
@aloneguid
Copy link
Owner

Fair enough, map keys should (but not must) be required.

@crodwell
Copy link
Author

according to the spec https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
The key field encodes the map's key type. This field must have repetition required and must always be present.

@aloneguid aloneguid changed the title [BUG]: Writing Parquet from protobuf object results in Nullable strings that shouldnt be [BUG]: Class serializer writes map key as optional Sep 11, 2023
@aloneguid
Copy link
Owner

Strange as parquet-mr (used in Spark) allows it. I might add integration tests with PyArrow.

aloneguid added a commit that referenced this issue Sep 11, 2023
@aloneguid
Copy link
Owner

I've just added PyArrow integration test, and it proves that pyArrow requires keys to be maked as "required", unlike identical test with parquet-mr (Java) does not. Will fix in a bit.

@aloneguid
Copy link
Owner

pyarrow now reads schema as

Id: int32 not null
Tags: map<string ('Key'), string ('Value') ('Tags')>
  child 0, Tags: struct<Key: string not null, Value: string> not null
      child 0, Key: string not null
      child 1, Value: string

@aloneguid
Copy link
Owner

seems to be OK now, will release a patch today

image

aloneguid added a commit that referenced this issue Sep 11, 2023
@aloneguid
Copy link
Owner

fixed in latest release

@crodwell
Copy link
Author

Thanks, map issue is definitely working now.

@aloneguid
Copy link
Owner

Thanks for confirming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants