-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table #3843
base: master
Are you sure you want to change the base?
Changes from all commits
2c7c302
5219a94
0bb5889
b80667d
ec851e2
cd60d6c
c2226d4
fcafa9c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,15 +18,21 @@ | |
|
||
package org.apache.paimon.flink.clone; | ||
|
||
import org.apache.paimon.CoreOptions; | ||
import org.apache.paimon.catalog.Catalog; | ||
import org.apache.paimon.catalog.Identifier; | ||
import org.apache.paimon.flink.FlinkCatalogFactory; | ||
import org.apache.paimon.fs.Path; | ||
import org.apache.paimon.options.Options; | ||
import org.apache.paimon.schema.Schema; | ||
import org.apache.paimon.schema.TableSchema; | ||
import org.apache.paimon.table.FileStoreTable; | ||
import org.apache.paimon.utils.Preconditions; | ||
|
||
import org.apache.paimon.shade.guava30.com.google.common.collect.ImmutableList; | ||
import org.apache.paimon.shade.guava30.com.google.common.collect.ImmutableMap; | ||
import org.apache.paimon.shade.guava30.com.google.common.collect.Iterables; | ||
|
||
import org.apache.flink.api.java.tuple.Tuple2; | ||
import org.apache.flink.streaming.api.operators.AbstractStreamOperator; | ||
import org.apache.flink.streaming.api.operators.OneInputStreamOperator; | ||
|
@@ -37,6 +43,7 @@ | |
import java.util.ArrayList; | ||
import java.util.List; | ||
import java.util.Map; | ||
import java.util.Objects; | ||
|
||
/** | ||
* Pick the files to be cloned of a table based on the input record. The record type it produce is | ||
|
@@ -77,7 +84,7 @@ public void processElement(StreamRecord<Tuple2<String, String>> streamRecord) th | |
FileStoreTable sourceTable = (FileStoreTable) sourceCatalog.getTable(sourceIdentifier); | ||
targetCatalog.createDatabase(targetIdentifier.getDatabaseName(), true); | ||
targetCatalog.createTable( | ||
targetIdentifier, Schema.fromTableSchema(sourceTable.schema()), true); | ||
targetIdentifier, newSchemaFromTableSchema(sourceTable.schema()), true); | ||
|
||
List<CloneFileInfo> result = | ||
toCloneFileInfos( | ||
|
@@ -95,6 +102,18 @@ public void processElement(StreamRecord<Tuple2<String, String>> streamRecord) th | |
} | ||
} | ||
|
||
private static Schema newSchemaFromTableSchema(TableSchema tableSchema) { | ||
return new Schema( | ||
ImmutableList.copyOf(tableSchema.fields()), | ||
ImmutableList.copyOf(tableSchema.partitionKeys()), | ||
ImmutableList.copyOf(tableSchema.primaryKeys()), | ||
ImmutableMap.copyOf( | ||
Iterables.filter( | ||
tableSchema.options().entrySet(), | ||
entry -> !Objects.equals(entry.getKey(), CoreOptions.PATH.key()))), | ||
tableSchema.comment()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, when creating a new schema for the clone table, all table options from the original table are directly passed into the new schema, including the table path of the source table. This is then ignored during table creation, which seems incorrect. Additionally, after prohibiting the passing of custom table paths in the FileSystemCatalog, the clone table's unit tests would fail. Therefore, I have made a fix here. Moreover, the original fromTableSchema method, when constructing the new schema, directly used the partitionKeys and primaryKeys instances from the source schema, which is not safe. So, I have made a copy here. I have also moved this method from the Schema class to here, as this is the only place where it is needed. If you feel it is sufficiently general, I can put it back. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good to me. |
||
} | ||
|
||
private List<CloneFileInfo> toCloneFileInfos( | ||
List<Path> files, | ||
Path sourceTableRoot, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -78,6 +78,7 @@ | |
import java.util.List; | ||
import java.util.Locale; | ||
import java.util.Map; | ||
import java.util.Objects; | ||
import java.util.Optional; | ||
import java.util.Set; | ||
import java.util.function.Function; | ||
|
@@ -493,7 +494,18 @@ protected void createTableImpl(Identifier identifier, Schema schema) { | |
// if changes on Hive fails there is no harm to perform the same changes to files again | ||
TableSchema tableSchema; | ||
try { | ||
tableSchema = schemaManager(identifier).createTable(schema, usingExternalTable()); | ||
Path tableRoot; | ||
if (schema.options().containsKey(CoreOptions.PATH.key())) { | ||
checkArgument( | ||
Objects.equals(createTableType(), TableType.EXTERNAL), | ||
"The HiveCatalog only supports specifying location when creating an external table"); | ||
tableRoot = new Path(schema.options().get(CoreOptions.PATH.key())); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Location specification is only allowed when creating an external table. |
||
} else { | ||
tableRoot = getTableLocation(identifier); | ||
} | ||
|
||
tableSchema = | ||
schemaManager(identifier, tableRoot).createTable(schema, usingExternalTable()); | ||
} catch (Exception e) { | ||
throw new RuntimeException( | ||
"Failed to commit changes of table " | ||
|
@@ -707,10 +719,7 @@ public String warehouse() { | |
|
||
private Table newHmsTable(Identifier identifier, Map<String, String> tableParameters) { | ||
long currentTimeMillis = System.currentTimeMillis(); | ||
TableType tableType = | ||
OptionsUtils.convertToEnum( | ||
hiveConf.get(TABLE_TYPE.key(), TableType.MANAGED.toString()), | ||
TableType.class); | ||
TableType tableType = createTableType(); | ||
Table table = | ||
new Table( | ||
identifier.getTableName(), | ||
|
@@ -735,6 +744,11 @@ private Table newHmsTable(Identifier identifier, Map<String, String> tableParame | |
return table; | ||
} | ||
|
||
private TableType createTableType() { | ||
return OptionsUtils.convertToEnum( | ||
hiveConf.get(TABLE_TYPE.key(), TableType.MANAGED.toString()), TableType.class); | ||
} | ||
|
||
private void updateHmsTable(Table table, Identifier identifier, TableSchema schema) { | ||
StorageDescriptor sd = table.getSd() != null ? table.getSd() : new StorageDescriptor(); | ||
|
||
|
@@ -792,7 +806,11 @@ private void updateHmsTable(Table table, Identifier identifier, TableSchema sche | |
} | ||
|
||
// update location | ||
locationHelper.specifyTableLocation(table, getTableLocation(identifier).toString()); | ||
String location = | ||
schema.options().containsKey(CoreOptions.PATH.key()) | ||
? schema.options().get(CoreOptions.PATH.key()) | ||
: getTableLocation(identifier).toString(); | ||
locationHelper.specifyTableLocation(table, location); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If provided, use the table path provided by the user. |
||
} | ||
|
||
private void updateHmsTablePars(Table table, TableSchema schema) { | ||
|
@@ -816,8 +834,11 @@ private FieldSchema convertToFieldSchema(DataField dataField) { | |
} | ||
|
||
private SchemaManager schemaManager(Identifier identifier) { | ||
return new SchemaManager( | ||
fileIO, getTableLocation(identifier), identifier.getBranchNameOrDefault()) | ||
return schemaManager(identifier, getTableLocation(identifier)); | ||
} | ||
|
||
private SchemaManager schemaManager(Identifier identifier, Path path) { | ||
return new SchemaManager(fileIO, path, identifier.getBranchNameOrDefault()) | ||
.withLock(lock(identifier)); | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just relax this check? For example, If it is the same path, we should allow it.
I am quite concerned that SHOW CREATE TABLE may not run properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing. I would like to share my thoughts on the prohibition of setting table location in
FileSystemCatalog
.I think the
SHOW CREATE TABLE
command is usually used to create a test table with the same schema as the source table. Users typically useSHOW CREATE TABLE
to get the DDL of a table and then create a test table with a different name for testing purposes. In this scenario, the location information in the DDL will inevitably mismatch the path assigned by theFileSystemCatalog
for the new table, requiring users to modify the DDL in order to successfully create the table.So even if the restrictions were relaxed, I believe it would still be somewhat confusing for users, as success would only be guaranteed when the specified location matches the one assigned by the catalog. If that's the case, why bother passing the location at all?
Therefore, instead of relaxing the checks here, I suggest we optimize the documentation by clearly stating this restriction in the documentation. We can declare the location management restrictions of
FileSystemCatalog
and the recommended usage of Spark DDL. What do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also our current approach to managing Iceberg tables in our platform. We host the location for all Iceberg tables on behalf of the users and strongly advise against specifying a location when creating tables. So far, this has been working well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think a relax check will be better. "create a test table with a different name for testing purposes", we can have very clear exception for this case.