Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Widen the type of existing columns or fields without rewriting the table #2622

Open
3 of 8 tasks
johanl-db opened this issue Feb 9, 2024 · 0 comments
Open
3 of 8 tasks
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@johanl-db
Copy link
Collaborator

johanl-db commented Feb 9, 2024

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Delta tables currently lack the ability to change the type of a column or nested field after the table was created. Changing type then currently requires copying the while table over, whether that's by actually creating a copy of the table or doing the copy to the new type in place using e.g. column mapping.

This feature request targets specifically widening type changes. In the case of widening change, we are guaranteed that all values present in files that were written before the type change can be promoted to the new, wider type without the risk of overflow or precision loss.
In particular, the following type changes can be supported:

  • byte -> short -> int -> long
  • float -> double
  • decimal precision/scale increase as long as the precision increases by at least as much as the scale to avoid loss of precision/overflow
  • date -> timestamp_ntz - dates don't have a timezone and can only be promoted to timestamp without timezone unambiguously.

Motivation

The type of a column or field is mostly fixed once the table has been created: we only allow setting a column or field to nullable.
The type of a column can become too narrow to store the required values in the lifetime of a table, for example:

  • IDs stored in an integer column exceed 31 bits that the type can hold, and the column type needs to be extended to Long.
  • A decimal column was initially created with a given precision and new data with a higher precision needs to be ingested.

The only way to handle these situations today is to manually rewrite the table to add a new column with the type wanted and copy the data to the new column. This can be expensive for large tables that must be rewritten and will conflict with every concurrent operation.

Further details

Design Doc: https://docs.google.com/document/d/1KIqf6o6JMD7e8aMrGlUROSwTfzYeW4NCIZVAUMW_-Tc

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@johanl-db johanl-db added the enhancement New feature or request label Feb 9, 2024
@johanl-db johanl-db self-assigned this Feb 9, 2024
@johanl-db johanl-db added this to the 3.2.0 milestone Feb 29, 2024
vkorukanti pushed a commit that referenced this issue Feb 29, 2024
## Description
This change introduces the `typeWidening` delta table feature, allowing to widen the type of existing columns and fields in a delta table using the `ALTER TABLE CHANGE COLUMN TYPE` or `ALTER TABLE REPLACE COLUMNS` commands.

The table feature is introduced as `typeWidening-dev` during implementation and is available in testing only.

For now, only byte -> short -> int are supported. Other changes will require support in the Spark parquet reader that will be introduced in Spark 4.0

Type widening feature request: #2622
Type Widening protocol RFC: #2624

A new test suite `DeltaTypeWideningSuite` is created, containing:
- `DeltaTypeWideningAlterTableTests`: Covers applying supported and unsupported type changes on partitioned columns, non-partitioned columns and nested fields
- `DeltaTypeWideningTableFeatureTests`: Covers adding the `typeWidening` table feature

## This PR introduces the following *user-facing* changes

The table feature is available in testing only, there's no user-facing changes as of now.

The type widening table feature will introduce the following changes:
- Adding the `typeWidening` via a table property:
```
ALTER TABLE t SET TBLPROPERTIES (‘delta.enableTypeWidening' = true)
```
- Apply a widening type change:
```
ALTER TABLE t CHANGE COLUMN int_col TYPE long
```
or
```
ALTER TABLE t REPLACE COLUMNS int_col TYPE long
```

Note: both ALTER TABLE commands reuse the existing syntax for setting a table property and applying a type change, no new SQL syntax is being introduced by this feature.

Closes #2645

GitOrigin-RevId: 2ca0e6b22ec24b304241460553547d0d4c6026a2
allisonport-db pushed a commit that referenced this issue Mar 7, 2024
#### Which Delta project/connector is this regarding?

-Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

This change is part of the type widening table feature.
Type widening feature request: #2622
Type Widening protocol RFC: #2624

It introduces metadata to record information about type changes that were applied using `ALTER TABLE`. This metadata is stored in table schema, as specified in https://github.com/delta-io/delta/pull/2624/files#diff-114dec1ec600a6305fe7117bed7acb46e94180cdb1b8da63b47b12d6c40760b9R28

For example, changing a top-level column `a` from `int` to `long` will update the schema to include metadata:
```
{
    "name" : "a",
    "type" : "long",
    "nullable" : true,
    "metadata" : {
      "delta.typeChanges": [
        {
          "tableVersion": 1,
          "fromType": "integer",
          "toType": "long"
        },
        {
          "tableVersion": 5,
          "fromType": "integer",
          "toType": "long"
        }
      ]
    }
  }
```

- A new test suite `DeltaTypeWideningMetadataSuite` is created to cover methods handling type widening metadata.
- Tests covering adding metadata to the schema when running `ALTER TABLE CHANGE COLUMN` are added to `DeltaTypeWideningSuite`

Closes #2708

GitOrigin-RevId: cdbb7589f10a8355b66058e156bb7d1894268f4d
vkorukanti pushed a commit that referenced this issue Mar 15, 2024
This PR includes changes from
#2708 which isn't merged yet.
The changes related only to dropping the table feature are in commit
e2601a6


## Description
This change is part of the type widening table feature.
Type widening feature request:
#2622
Type Widening protocol RFC: #2624

It adds the ability to remove the type widening table feature by running
the `ALTER TABLE DROP FEATURE` command.
Before dropping the table feature, traces of it are removed from the
current version of the table:
- Files that were written before the latest type change and thus contain
types that differ from the current table schema are rewritten using an
internal `REORG TABLE` operation.
- Metadata in the table schema recording previous type changes is
removed.

## How was this patch tested?
- A new set of tests are added to `DeltaTypeWideningSuite` to cover
dropping the table feature with tables in various states: with/without
files to rewrite or metadata to remove.

## Does this PR introduce _any_ user-facing changes?
The table feature is available in testing only, there's no user-facing
changes as of now.

When the feature is available, this change enables the following user
action:
- Drop the type widening table feature:
```
ALTER TABLE t DROP FEATURE typeWidening
```
This succeeds immediately if no version of the table contains traces of
the table feature (= no type changes were applied in the available
history of the table.
Otherwise, if the current version contains traces of the feature, these
are removed: files are rewritten if needed and type widening metadata is
removed from the table schema. Then, an error
`DELTA_FEATURE_DROP_WAIT_FOR_RETENTION_PERIOD` is thrown, telling the
user to retry once the retention period expires.

If only previous versions contain traces of the feature, no action is
applied on the table, and an error
`DELTA_FEATURE_DROP_HISTORICAL_VERSIONS_EXIST` is thrown, telling the
user to retry once the retention period expires.
tdas pushed a commit that referenced this issue Mar 22, 2024
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This change is part of the type widening table feature.
Type widening feature request:
#2622
Type Widening protocol RFC: #2624

It adds automatic type widening as part of schema evolution in MERGE
INTO:
- During resolution of the `DeltaMergeInto` plan, when merging the
target and source schema to compute the schema after evolution, we keep
the wider source type when type widening is enabled on the table.
- When updating the table schema at the beginning of MERGE execution,
metadata is added to the schema to record type changes.

## How was this patch tested?
- A new test suite `DeltaTypeWideningSchemaEvolutionSuite` is added to
cover type evolution in MERGE

## This PR introduces the following *user-facing* changes
The table feature is available in testing only, there are no user-facing
changes as of now.

When automatic schema evolution is enabled in MERGE and the source
schema contains a type that is wider than the target schema:

With type widening disabled: the type in the target schema is not
changed. the ingestion behavior follows the `storeAssignmentPolicy`
configuration:
- LEGACY: source values that overflow the target type are stored as
`null`
- ANSI: a runtime check is injected to fail on source values that
overflow the target type.
- STRICT: the MERGE operation fails during analysis.

With type widening enabled: the type in the target schema is updated to
the wider source type. The MERGE operation always succeeds:
```
-- target: key int, value short
-- source: key int, value int
MERGE INTO target
USING source
ON target.key = source.key
WHEN MATCHED THEN UPDATE SET *
```
After the MERGE operation, the target schema is `key int, value int`.
tdas pushed a commit that referenced this issue Mar 25, 2024
#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [X] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This change is part of the type widening table feature.
Type widening feature request:
#2622
Type Widening protocol RFC: #2624

It adds automatic type widening as part of schema evolution in INSERT.
During resolution, when schema evolution and type widening are enabled,
type differences between the input query and the target table are
handled as follows:
- If the type difference qualifies for automatic type evolution: the
input type is left as is, the data will be inserted with the new type
and the table schema will be updated in `ImplicitMetadataOperation`
(already implemented as part of MERGE support)
- If the type difference doesn't qualify for automatic type evolution:
the current behavior is preserved: a cast is added from the input type
to the existing target type.

## How was this patch tested?
- Tests are added to `DeltaTypeWideningAutomaticSuite` to cover type
evolution in INSERT

## This PR introduces the following *user-facing* changes
The table feature is available in testing only, there's no user-facing
changes as of now.

When automatic schema evolution is enabled in INSERT and the source
schema contains a type that is wider than the target schema:

With type widening disabled: the type in the target schema is not
changed. A cast is added to the input to insert to match the expected
target type.

With type widening enabled: the type in the target schema is updated to
the wider source type.
```
-- target: key int, value short
-- source: key int, value int
INSERT INTO target SELECT * FROM source
```
After the INSERT operation, the target schema is `key int, value int`.
tdas pushed a commit that referenced this issue Apr 24, 2024
## Description
Expose the type widening table feature outside of testing and set its
preview user-facing name: typeWidening-preview (instead of
typeWidening-dev used until now).

Feature description: #2622
The type changes that are supported for not are `byte` -> `short` ->
`int`. Other types depend on Spark changes which are going to land in
Spark 4.0 and will be available once Delta picks up that Spark version.

## How was this patch tested?
Extensive testing in `DeltaTypeWidening*Suite`.

## Does this PR introduce _any_ user-facing changes?
User facing changes were already covered in PRs implementing this
feature. In short, it allows:
- Adding the type widening table feature (using a table property)
```
ALTER TABLE t SET TBLPROPERTIES (‘delta.enableTypeWidening = true);
```
- Manual type changes:
```
ALTER TABLE t CHANGE COLUMN col TYPE INT;
```
- Automatic type changes via schema evolution:
```
CREATE TABLE target (id int, value short);
CREATE TABLE source (id int, value in);
SET spark.databricks.delta.schema.autoMerge.enabled = true;
INSERT INTO target SELECT * FROM source;
-- value now has type int in target
```
- Dropping the table feature which rewrites data to make the table
reading by all readers:
```
ALTER TABLE t DROP FEATURE 'typeWidening'
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant