[PROTOCOL RFC] Variant data type #2867

gene-db · 2024-04-08T21:54:55Z

Which Delta project/connector is this regarding?

Description

Adds the protocol changes for the Variant data type (see #2864) to the RFC folder.

How was this patch tested?

N/A

Does this PR introduce any user-facing changes?

N/A

protocol_rfcs/variant-type.md

bart-samwel · 2024-04-09T14:23:37Z

protocol_rfcs/variant-type.md

+Struct fields which start with `_` (underscore) can be safely ignored.
+The only non-ignorable fields must be `value` and `metadata`.


This seems to say (somewhat implicitly) that any field that is not value or metadata should start with an underscore? I don't see why that is necessary, if we define that only value and metadata should be non-ignorable.

Something we would like to support in the future is to introduce another non-ignorable struct field. That would require a new table feature, but it would depend on this table feature. In that case, how can we specify in the protocol that the new table feature would use an additional, non-ignorable field?

The Delta spec already states that clients should ignore unrecognized fields, does that not suffice?
If the idea is to carve out a "safe" namespace for custom fields, then _ prefix seems reasonable?

Makes sense. I removed this unnecessary line.

protocol_rfcs/variant-type.md

gene-db

@bart-samwel Thanks! I updated the PR.

I also left a question about how we could specify the non-ignorable fields spec regarding potential future enhancements?

protocol_rfcs/variant-type.md

gene-db · 2024-04-09T17:13:53Z

protocol_rfcs/variant-type.md

+Struct fields which start with `_` (underscore) can be safely ignored.
+The only non-ignorable fields must be `value` and `metadata`.


Something we would like to support in the future is to introduce another non-ignorable struct field. That would require a new table feature, but it would depend on this table feature. In that case, how can we specify in the protocol that the new table feature would use an additional, non-ignorable field?

protocol_rfcs/variant-type.md

scovich · 2024-04-19T19:32:42Z

protocol_rfcs/variant-type.md

+
+## Variant data in Parquet
+
+The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).


Suggested change

The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).

The Variant data type is represented as two binary encoded values, according to the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).

scovich · 2024-04-19T19:41:42Z

protocol_rfcs/variant-type.md

+Struct fields which start with `_` (underscore) can be safely ignored.
+The only non-ignorable fields must be `value` and `metadata`.


The Delta spec already states that clients should ignore unrecognized fields, does that not suffice?
If the idea is to carve out a "safe" namespace for custom fields, then _ prefix seems reasonable?

scovich · 2024-04-19T19:45:46Z

protocol_rfcs/variant-type.md

+## Reader Requirements for Variant Data Type
+
+When Variant type is supported (`readerFeatures` field of a table's `protocol` action contains `variantType`), readers:
+- must be able to read the two parquet struct fields, `value` and `metadata` and interpret them as a Variant in concordance with the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).


As worded here, this "must be able to read... and interpret... in accordance with the... specification" would seem to contradict the next line that allows to not fully support the logical datatype?

I guess the main requirements are really:

Reader must recognize and tolerate a variant data type in a Delta schema

Reader must use the correct physical schema (struct-of-binary) when reading a Variant column from file

Reader must make the column available to queries:

[Recommended] Expose the logical Variant column, if the engine supports Variant.

[Alternate] Expose the raw physical column, e.g. if the engine does not support Variant.

Thanks! Updated.

gene-db

@scovich Thanks! Updated the PR.

gene-db · 2024-04-19T20:30:35Z

protocol_rfcs/variant-type.md

+
+## Variant data in Parquet
+
+The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).


gene-db · 2024-04-19T21:16:11Z

protocol_rfcs/variant-type.md

+## Reader Requirements for Variant Data Type
+
+When Variant type is supported (`readerFeatures` field of a table's `protocol` action contains `variantType`), readers:
+- must be able to read the two parquet struct fields, `value` and `metadata` and interpret them as a Variant in concordance with the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).


Thanks! Updated.

gene-db · 2024-04-19T21:18:06Z

protocol_rfcs/variant-type.md

+Struct fields which start with `_` (underscore) can be safely ignored.
+The only non-ignorable fields must be `value` and `metadata`.


Makes sense. I removed this unnecessary line.

scovich

LGTM. One question.

scovich · 2024-04-22T14:35:10Z

protocol_rfcs/variant-type.md

+- must use the correct physical schema (struct-of-binary, with fields `value` and `metadata`) when reading a Variant data type from file
+- must make the column available to the engine:
+    - [Recommended] Expose and interpret the struct-of-binary as a single Variant field in accordance with the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
+    - [Alternate] Expose the raw physical struct-of-binary, e.g. if the engine does not support Variant.


Is it also an acceptable alternate, to expose a string column by internally converting the struct-of-binary? Gives up all the benefits of the variant encoding, but maximizes compat with engines that don't allow users to load the library code that would interpret the struct-of-binary directly?

Yeah, that should work. I added that as another alternate.

tdas · 2024-04-24T21:11:17Z

protocol_rfcs/README.md

@@ -22,6 +22,7 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024,
 | 2023-02-09    | [type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md)                                  | https://github.com/delta-io/delta/issues/2623 | Type Widening                 |
 | 2023-02-14    | [managed-commits.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/managed-commits.md)                              | https://github.com/delta-io/delta/issues/2598 | Managed Commits               |
 | 2023-02-26    | [column-mapping-usage.tracking.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/column-mapping-usage-tracking.md)) | https://github.com/delta-io/delta/issues/2682 | Column Mapping Usage Tracking |
+| 2023-04-08    | [variant-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md)                                    | https://github.com/delta-io/delta/issues/2864 | Variant Data Type             |


the gap in this date, and when we are ready to merge is large. can you update this to todays; date

Updated the date.

…2923)  #### Which Delta project/connector is this regarding?  - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description  Adds the variant table feature to minimally implement the variant type as described in the RFC in #2867. Also disables using variant columns as partition columns. ## How was this patch tested? Added some UTs. More UTs will be merged in followup PRs tested against both spark 3.5 and 4.0 snapshot with ``` build/sbt -DsparkVersion=latest spark/'testOnly org.apache.spark.sql.delta.DeltaVariantSuite' build/sbt -DsparkVersion=master spark/'testOnly org.apache.spark.sql.delta.DeltaVariantSuite' ``` ## Does this PR introduce _any_ user-facing changes?  no

Variant data type RFC

be7fa1b

tdas requested review from ryan-johnson-databricks and bart-samwel April 8, 2024 22:01

tdas reviewed Apr 8, 2024

View reviewed changes

protocol_rfcs/variant-type.md Show resolved Hide resolved

tdas reviewed Apr 8, 2024

View reviewed changes

protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved

update requirements

9ef7000

bart-samwel reviewed Apr 9, 2024

View reviewed changes

updates

3e48100

gene-db commented Apr 9, 2024

View reviewed changes

bart-samwel reviewed Apr 10, 2024

View reviewed changes

protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved

protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved

update/clarify text

b98277e

bart-samwel approved these changes Apr 11, 2024

View reviewed changes

richardc-db mentioned this pull request Apr 15, 2024

[SPARK][VARIANT] Add minimal support for variant type in delta-spark #2891

Closed

5 tasks

gene-db requested a review from tdas April 16, 2024 02:15

richardc-db mentioned this pull request Apr 19, 2024

[SPARK][VARIANT] Add minimal support for variant type in delta-spark #2923

Merged

5 tasks

scovich reviewed Apr 19, 2024

View reviewed changes

Update text

3c3b432

gene-db commented Apr 19, 2024

View reviewed changes

scovich approved these changes Apr 22, 2024

View reviewed changes

Update

d4eb6e1

tdas reviewed Apr 24, 2024

View reviewed changes

update date

b16b256

gene-db requested a review from tdas April 25, 2024 02:53

tdas approved these changes Apr 25, 2024

View reviewed changes

tdas merged commit 1e2c74f into delta-io:master Apr 25, 2024
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROTOCOL RFC] Variant data type #2867

[PROTOCOL RFC] Variant data type #2867

gene-db commented Apr 8, 2024 •

edited

Loading

bart-samwel Apr 9, 2024

gene-db Apr 9, 2024

scovich Apr 19, 2024

gene-db Apr 19, 2024

gene-db left a comment

gene-db Apr 9, 2024

scovich Apr 19, 2024

gene-db Apr 19, 2024

scovich Apr 19, 2024

scovich Apr 19, 2024

gene-db Apr 19, 2024

gene-db left a comment

gene-db Apr 19, 2024

gene-db Apr 19, 2024

gene-db Apr 19, 2024

scovich left a comment

scovich Apr 22, 2024

gene-db Apr 22, 2024

tdas Apr 24, 2024

gene-db Apr 24, 2024

		Struct fields which start with `_` (underscore) can be safely ignored.
		The only non-ignorable fields must be `value` and `metadata`.


		## Variant data in Parquet

		The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).

[PROTOCOL RFC] Variant data type #2867

[PROTOCOL RFC] Variant data type #2867

Conversation

gene-db commented Apr 8, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gene-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gene-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gene-db commented Apr 8, 2024 •

edited

Loading