Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-468: Clarify MAP logical type #469

Merged
merged 4 commits into from
Nov 12, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -709,13 +709,18 @@ to values. `MAP` must annotate a 3-level structure:

* The outer-most level must be a group annotated with `MAP` that contains a
single field named `key_value`. The repetition of this level must be either
`optional` or `required` and determines whether the list is nullable.
`optional` or `required` and determines whether the map is nullable.
* The middle level, named `key_value`, must be a repeated group with a `key`
field for map keys and, optionally, a `value` field for map values.
field for map keys and, optionally, a `value` field for map values. It must
not contain any other values.
* The `key` field encodes the map's key type. This field must have
repetition `required` and must always be present.
repetition `required` and must always be present. I must be placed at the 0th
gszadovszky marked this conversation as resolved.
Show resolved Hide resolved
position of the `key_value` group. It is suggested to use a primitive as the
type.
* The `value` field encodes the map's value type and repetition. This field can
be `required`, `optional`, or omitted.
be `required`, `optional`, or omitted. It must be placed at the 1st position
of the `key_value` group if present. In case of not present, the map can be
either represented with all null values or as a set of keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the value can be emitted 🤔 When would a case be like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've never seen an actual use-case implemented for this one either. I am open to require having a value but since it is out there for a while now, I am not sure if we can do it. Maybe a suggestion to always specify a value and let it be an optional with all nulls if one would not use it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move omitted value field to the backward compatibility section, which is all about cases that are not suggested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure backward compatibility section would be the best place for this. That section is mainly for supporting Parquet files that were written before the proper spec.
Anyway, it seems similar to the nested key question. @JFinis, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the value can be emitted 🤔 When would a case be like that?

Perhaps when using a MAP to encode a set? 🤷

Copy link
Member

@wgtmac wgtmac Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, @etseidl, WDYT, shall I simply remove this last sentence about handling the case of "missing" values? I don't think we can change the original spec because it is clearly stated that value can be omitted. Then, it is up to the implementation how they support/not support this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of not present, the map can be either represented with all null values or as a set of keys.

What about the below one?

In case of not present, it can be represented as a map with all null values or as a set of keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original discussion backs up this interpretation. I think it's good to leave the explanation in.

The following example demonstrates the type for a non-null map from strings to
nullable integers:
Expand All @@ -741,18 +746,19 @@ keys.
It is required that the repeated group of key-value pairs is named `key_value`
and that its fields are named `key` and `value`. However, these names may not
be used in existing data and should not be enforced as errors when reading.
(`key` and `value` can be identified by their position in case of misnaming.)

Some existing data incorrectly used `MAP_KEY_VALUE` in place of `MAP`. For
backward-compatibility, a group annotated with `MAP_KEY_VALUE` that is not
contained by a `MAP`-annotated group should be handled as a `MAP`-annotated
group.
group. `MAP_KEY_VALUE` may be used for the `kay_value` group.
gszadovszky marked this conversation as resolved.
Show resolved Hide resolved

Examples that can be interpreted using these rules:

```
// Map<String, Integer> (nullable map, non-null values)
optional group my_map (MAP) {
Copy link
Member

@wgtmac wgtmac Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have the one below?

optional group my_map {
  repeated group map (MAP_KEY_VALUE) {
    required binary str (STRING);
    required int32 num;
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think so. At the topmost element we need either a MAP or a MAP_KEY_VALUE (for backward compatibility).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this? I'm just deducing possible types from the new sentence.

optional group my_map (MAP_KEY_VALUE) {
  repeated group map (MAP_KEY_VALUE) {
    required binary str (STRING);
    required int32 num;
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new sentence should not change anything. Just wanted to highlight why it was originally invented. If it causes confusion, it does not worth it.
(At the repeated level we do not say anything about its potential logical type. So, a reader should ignore anything there anyway.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, do you think this sentence is misleading? I'm happy to remove it. The implementation do not need to check the logical type of the repeated field anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is. Especially when we have said that the name of key_value is not required in the previous paragraph.

repeated group map {
repeated group map (MAP_KEY_VALUE) {
required binary str (STRING);
required int32 num;
}
Expand Down