Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc: Add a page explaining migration from other table formats to iceberg #6600

Merged
merged 22 commits into from
Apr 22, 2023
Merged
Changes from 1 commit
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
8e6b9a7
add doc for iceberg-delta-lake
JonasJ-ap Jan 16, 2023
65f5888
add compatibility notice for TimestampType
JonasJ-ap Jan 24, 2023
198f389
make the doc the for general table migration
JonasJ-ap Mar 23, 2023
d731a89
refactor table migration doc draft 1
JonasJ-ap Mar 24, 2023
bb9ee84
add terminal graphs
JonasJ-ap Mar 27, 2023
73cc129
adjust web pages alignment
JonasJ-ap Mar 27, 2023
954efda
add pictures link (pictures should be contributed to iceberg-docs ins…
JonasJ-ap Mar 27, 2023
e7f37b3
Re-organize pages and revise contents
JonasJ-ap Mar 31, 2023
0ecbefb
revise delta lake migration page
JonasJ-ap Apr 4, 2023
04c2e91
revise migration introduction and add hive migration page
JonasJ-ap Apr 5, 2023
fd631d2
correct links in these pages
JonasJ-ap Apr 6, 2023
732f306
refactor delta-lake-migration and simplify the description
JonasJ-ap Apr 6, 2023
8960c32
Add link to CTAS, INSERT for full data migration
JonasJ-ap Apr 6, 2023
da2a8c0
add space before each section headers
JonasJ-ap Apr 6, 2023
886930b
remove timestampType read warning since the vectorized support for in…
JonasJ-ap Apr 13, 2023
8b59815
polish delta lake migration part
JonasJ-ap Apr 22, 2023
c2ee4bd
add reader/writer switching description to snapshot/migration steps
JonasJ-ap Apr 22, 2023
7a99342
revise hive migration link
JonasJ-ap Apr 22, 2023
e2a1f37
fix code examples in delta lake migration
JonasJ-ap Apr 22, 2023
5c31585
address final comments
JonasJ-ap Apr 22, 2023
ac462fa
updates link
JonasJ-ap Apr 22, 2023
436dd2d
finalize code example
JonasJ-ap Apr 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
correct links in these pages
JonasJ-ap committed Apr 6, 2023
commit fd631d2e4bf544c5fcbe0d750d716ecae87df841
20 changes: 10 additions & 10 deletions docs/delta-lake-migration.md
Original file line number Diff line number Diff line change
@@ -29,7 +29,7 @@ it is common to migrate all snapshots to maintain the history of the data.
Currently, Iceberg only supports the Snapshot Table action for migrating from Delta Lake to Iceberg tables. It is done via the `iceberg-delta-lake` module
JonasJ-ap marked this conversation as resolved.
Show resolved Hide resolved
by using [Delta Standalone](https://docs.delta.io/latest/delta-standalone.html) to read logs of Delta lake tables.
JonasJ-ap marked this conversation as resolved.
Show resolved Hide resolved
Since Delta Lake tables maintain snapshots, all available snapshots will be committed to the new Iceberg table as transactions in order.
JonasJ-ap marked this conversation as resolved.
Show resolved Hide resolved
For Delta Lake tables, any additional data files added after the initial migration will be included in their corresponding snapshots and subsequently added to the new Iceberg table using the Add Snapshot action.
For Delta Lake tables, any additional data files added after the initial migration will be included in their corresponding snapshots and subsequently added to the new Iceberg table using the Add Snapshot action.
The Add Snapshot action, a variant of the Add File action, is still under development.
JonasJ-ap marked this conversation as resolved.
Show resolved Hide resolved

## Enabling Migration from Delta Lake to Iceberg
@@ -82,19 +82,19 @@ The delta table's location is required to be provided when initializing the acti

| Argument Name | Required? | Type | Description |
|---------------|-----------|------|-------------|
|`sourceTableLocation` | Yes | String | The location of the source Delta Lake table |
|`sourceTableLocation` | ✔️ | String | The location of the source Delta Lake table |

#### Configurations
jackye1995 marked this conversation as resolved.
Show resolved Hide resolved
The configurations can be gave via method chaining

| Method Name | Arguments | Required? | Type | Description |
|---------------------------|----------------|-----------|--------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| `as` | `identifier` | Yes | org.apache.iceberg.catalog.TableIdentifier | The identifier of the Iceberg table to be created. |
| `icebergCatalog` | `catalog` | Yes | org.apache.iceberg.catalog.Catalog | The Iceberg catalog for the Iceberg table to be created |
| `deltaLakeConfiguration` | `conf` | Yes | org.apache.hadoop.conf.Configuration | The Hadoop Configuration to access Delta Lake Table's log and datafiles |
| `tableLocation` | `location` | No | String | The location of the Iceberg table to be created. Defaults to the same location as the given Delta Lake table |
| `tableProperty` | `name`,`value` | No | String, String | A property entry to add to the Iceberg table to be created |
| `tableProperties` | `properties` | No | Map<String, String> | Properties to add to the the Iceberg table to be created |
| `as` | `identifier` | ✔️ | org.apache.iceberg.catalog.TableIdentifier | The identifier of the Iceberg table to be created. |
| `icebergCatalog` | `catalog` | ✔️ | org.apache.iceberg.catalog.Catalog | The Iceberg catalog for the Iceberg table to be created |
| `deltaLakeConfiguration` | `conf` | ✔️ | org.apache.hadoop.conf.Configuration | The Hadoop Configuration to access Delta Lake Table's log and datafiles |
| `tableLocation` | `location` | | String | The location of the Iceberg table to be created. Defaults to the same location as the given Delta Lake table |
| `tableProperty` | `name`,`value` | | String, String | A property entry to add to the Iceberg table to be created |
| `tableProperties` | `properties` | | Map<String, String> | Properties to add to the the Iceberg table to be created |

#### Output
| Output Name | Type | Description |
@@ -129,7 +129,7 @@ DeltaLakeToIcebergMigrationActionsProvider.defaultActions()
```

## Migrate Delta Lake Table To Iceberg
Unsupported
**Not Yet Support**. This action should read the Delta Lake table's most recent snapshot and convert it to a new Iceberg table with the same name, schema and partitioning in one iceberg transaction. The source Delta Lake table should be dropped as the completion of this action.
JonasJ-ap marked this conversation as resolved.
Show resolved Hide resolved

## Add Files From Delta Lake Table to Iceberg
Unsupported
**Not Yet Support**. This action should add files from a Delta version of a Delta Lake table to an existing Iceberg table
6 changes: 3 additions & 3 deletions docs/hive-migration.md
Original file line number Diff line number Diff line change
@@ -40,14 +40,14 @@ To snapshot a Hive table, users can run the following Spark SQL:
```sql
CALL catalog_name.system.snapshot('db.source', 'db.dest')
```
See [Spark Procedure: snapshot](../spark-procedures/#table-migration) for more details.
See [Spark Procedure: snapshot](../spark-procedures/#snapshot) for more details.

## Migrate Hive Table To Iceberg
To migrate a Hive table to Iceberg, users can run the following Spark SQL:
```sql
CALL catalog_name.system.migrate('db.sample')
```
See [Spark Procedure: migrate](../spark-procedures/#table-migration) for more details.
See [Spark Procedure: migrate](../spark-procedures/#migrate) for more details.

## Add Files From Delta Lake Table to Iceberg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should say hive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Thank you for catching this. Sorry for the typo.

Have opened a new PR here: #7407, to address that.

To add data files from a Hive table to a given Iceberg table, users can run the following Spark SQL:
@@ -57,4 +57,4 @@ table => 'db.tbl',
source_table => 'db.src_tbl'
)
```
See [Spark Procedure: add_files](../spark-procedures/#table-migration) for more details.
See [Spark Procedure: add_files](../spark-procedures/#add_files) for more details.
2 changes: 1 addition & 1 deletion docs/table-migration.md
Original file line number Diff line number Diff line change
@@ -26,7 +26,7 @@ menu:
Apache Iceberg supports converting existing tables in other formats to Iceberg tables. This section introduces the general concept of table migration, its approaches, and existing implementations in Iceberg.

## Migration Approaches
There are two primary methods for executing table migration: full data migration and in-place metadata migration.
There are two methods for executing table migration: full data migration and in-place metadata migration.

Full data migration involves copying all data files from the source table to the new Iceberg table. This method makes the new table fully isolated from the source table, but is slower and doubles the space.
In practice, users can use operations like CTAS, INSERT, and Change-Data-Capture pipelines to perform the full data migration.