[Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table #3843

zhongyujiang · 2024-07-30T04:42:52Z

Purpose

~~This relocates the 'path' property in the table options to 'location' for better presentation.~~

Adapt 'path' to Spark's 'location' in table props ( the 'path' is still preserved in table props)
Support the customization of the table location when creating a table(only allowed for creating external table using hive catalog)

Spark uses the reserved property location to indicate the location of the table, and in the output of DESC EXTENDED, the location information will be displayed under the "# Detailed Table Information" section for better visibility.

+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                            |comment|
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|a                           |bigint                                                                                               |NULL   |
|b                           |varchar(10)                                                                                          |NULL   |
|c                           |char(10)                                                                                             |NULL   |
|                            |                                                                                                     |       |
|# Metadata Columns          |                                                                                                     |       |
|__paimon_file_path          |string                                                                                               |       |
|__paimon_row_index          |bigint                                                                                               |       |
|                            |                                                                                                     |       |
|# Detailed Table Information|                                                                                                     |       |
|Name                        |default.testTableAs                                                                                  |       |
|Type                        |MANAGED                                                                                              |       |
|Location                    |file:/var/folders/2r/v_2n6mbj41v7q14m8f3j9q4w0000gn/T/junit3802231298897594813/default.db/testTableAs|       |
|Provider                    |paimon                                                                                               |       |
|Owner                       |zhongyujiang                                                                                         |       |
|Table Properties            |[file.format=parquet]                                                                                |       |
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+

Tests

API and Format

Documentation

zhongyujiang · 2024-07-30T04:43:41Z

cc @Zouxxyy @YannByron Can you help review this? Thanks!

JingsongLi

-1, It is best for Spark to be unified with other engines, as using a new option may harm the interoperability between engines

zhongyujiang · 2024-07-30T11:07:44Z

Hi @JingsongLi This is not introducing a new option, but rather better adapting to the Spark engine. Because Spark has always used it to represent the location of the table.

the interoperability between engines

Are you suggesting that you want users to retrieve the table location information from the path option within the options when using the DESC command in different engines?

However, I would like to point out that the meaning of the DESC command inherently varies across different engines, closely related to the functionality of the engine. For instance, the DESC command in Trino and Flink does not even display table options information, but only column information. Yet, Flink can display primary key attribute of columns because primary keys are part of the Flink specification (this is not the case in Spark).

Therefore, I believe we should also adapt this to Spark, which would be more in line with the usage habits of Spark users.

BTW, for Iceberg and Hive tables, Spark's DESC EXTENDED DDL command displays the location information through the location field. However, for Paimon tables, the location information is hidden within the options under the path field, making it less convenient to users. This is the motivation behind this PR.

Zouxxyy · 2024-07-30T11:22:43Z

paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/SparkTable.scala

@@ -60,6 +60,8 @@ case class SparkTable(table: Table)
 properties.put(CoreOptions.PRIMARY_KEY.key, String.join(",", table.primaryKeys))
 }
 properties.put(TableCatalog.PROP_PROVIDER, SparkSource.NAME)
+ val location = properties.remove(CoreOptions.PATH.key())


This interface (default Map<String, String> properties()) is only used in show create or desc table, add the TableCatalog.PROP_LOCATION is ok, but wondering whether we need to remove path.

Besides, another point worth noting is that paimon currently does not support create table x location('') actually? The new result of show create table may cause misunderstanding, or we need support it first.

Hi @zhongyujiang , we can discuss in this thread, @Zouxxyy raised two very good points:

We could keep path option.

We should also support execution for show create table.

We could keep path option.

Sure

We should also support execution for show create table

I've tested this and found that the location info passed by Spark DDL will be ignored right now

IIUC, the table location is closely related to the warehouse address, and of course, different catalogs have different implementations:

AbstractCatalog calculates the table location based on the warehouse, database, and table name when loading a table, and both FileSystemCatalog and JdbcCatalog follow this behavior

HiveCatalog retrieves the table location information from the HMS (Hive Metastore)

Therefore, if we want to support this, we need to differentiate the catalog being used. This is feasible with HiveCatalog, but not with other catalogs. What do you think? @Zouxxyy @JingsongLi

For other catalogs, we can check the location, if not equal, throw exception.

can see ##2619

I have updated this to allow customizing the table location when using the Hive catalog and creating an external table. Please take a look when you have time. Thanks!

zhongyujiang · 2024-08-04T14:16:24Z

paimon-hive/paimon-hive-catalog/src/main/java/org/apache/paimon/hive/HiveCatalog.java

+ checkArgument(
+ Objects.equals(createTableType(), TableType.EXTERNAL),
+ "The HiveCatalog only supports specifying location when creating an external table");
+ tableRoot = new Path(schema.options().get(CoreOptions.PATH.key()));


Location specification is only allowed when creating an external table.

zhongyujiang · 2024-08-04T14:20:00Z

paimon-hive/paimon-hive-catalog/src/main/java/org/apache/paimon/hive/HiveCatalog.java

+ schema.options().containsKey(CoreOptions.PATH.key())
+ ? schema.options().get(CoreOptions.PATH.key())
+ : getDataTableLocation(identifier).toString();
+ locationHelper.specifyTableLocation(table, location);


If provided, use the table path provided by the user.

…on' for better presentation.

zhongyujiang · 2024-08-05T13:09:31Z

...imon-flink-common/src/main/java/org/apache/paimon/flink/clone/PickFilesForCloneOperator.java

+ Iterables.filter(
+ tableSchema.options().entrySet(),
+ entry -> !Objects.equals(entry.getKey(), CoreOptions.PATH.key()))),
+ tableSchema.comment());


Currently, when creating a new schema for the clone table, all table options from the original table are directly passed into the new schema, including the table path of the source table. This is then ignored during table creation, which seems incorrect. Additionally, after prohibiting the passing of custom table paths in the FileSystemCatalog, the clone table's unit tests would fail. Therefore, I have made a fix here.

Moreover, the original fromTableSchema method, when constructing the new schema, directly used the partitionKeys and primaryKeys instances from the source schema, which is not safe. So, I have made a copy here.

I have also moved this method from the Schema class to here, as this is the only place where it is needed. If you feel it is sufficiently general, I can put it back.

Sounds good to me.

JingsongLi · 2024-08-06T04:01:32Z

paimon-core/src/main/java/org/apache/paimon/catalog/FileSystemCatalog.java

@@ -117,6 +119,9 @@ protected void dropTableImpl(Identifier identifier) {

 @Override
 public void createTableImpl(Identifier identifier, Schema schema) {
+ checkArgument(
+ !schema.options().containsKey(CoreOptions.PATH.key()),


Can we just relax this check? For example, If it is the same path, we should allow it.

I am quite concerned that SHOW CREATE TABLE may not run properly.

Thanks for reviewing. I would like to share my thoughts on the prohibition of setting table location in FileSystemCatalog.

I think the SHOW CREATE TABLE command is usually used to create a test table with the same schema as the source table. Users typically use SHOW CREATE TABLE to get the DDL of a table and then create a test table with a different name for testing purposes. In this scenario, the location information in the DDL will inevitably mismatch the path assigned by the FileSystemCatalog for the new table, requiring users to modify the DDL in order to successfully create the table.

So even if the restrictions were relaxed, I believe it would still be somewhat confusing for users, as success would only be guaranteed when the specified location matches the one assigned by the catalog. If that's the case, why bother passing the location at all?

Therefore, instead of relaxing the checks here, I suggest we optimize the documentation by clearly stating this restriction in the documentation. We can declare the location management restrictions of FileSystemCatalog and the recommended usage of Spark DDL. What do you think?

This is also our current approach to managing Iceberg tables in our platform. We host the location for all Iceberg tables on behalf of the users and strongly advise against specifying a location when creating tables. So far, this has been working well.

I still think a relax check will be better. "create a test table with a different name for testing purposes", we can have very clear exception for this case.

JingsongLi · 2024-08-06T04:02:11Z

Thanks @zhongyujiang for update. Left some comments.

Zouxxyy

I see the new commits provide the support for external tables, thanks. Can you add the following test:
For hive catalog (can add in DDLWithHiveCatalogTestBase)

Create an internal table, drop it, and check if the files are deleted
Create an external table using location（check it), drop it, and check if the files are deleted

For filesystem catalog (can add to DDLTestBase)
All tables should be internal tables

JingsongLi requested changes Jul 30, 2024

View reviewed changes

Zouxxyy reviewed Jul 30, 2024

View reviewed changes

zhongyujiang commented Aug 4, 2024

View reviewed changes

zhongyujiang changed the title ~~[Spark]: Relocate the 'path' property in the table options to 'location' for better presentation~~ [Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table Aug 4, 2024

zhongyujiang force-pushed the spark-desc-table branch from 0bd7d35 to b752852 Compare August 4, 2024 14:31

zhongyujiang added 6 commits August 5, 2024 20:58

[Spark]: Relocate the 'path' property in the table options to 'locati…

2c7c302

…on' for better presentation.

Don't remove 'path'

5219a94

Support create table with user defined location.

0bb5889

Fix comment.

b80667d

Fix location.

ec851e2

Fix clone table.

cd60d6c

zhongyujiang force-pushed the spark-desc-table branch from 27eaf8d to cd60d6c Compare August 5, 2024 13:01

zhongyujiang commented Aug 5, 2024

View reviewed changes

zhongyujiang added 2 commits August 5, 2024 21:40

Fix style.

c2226d4

Fix tests.

fcafa9c

JingsongLi reviewed Aug 6, 2024

View reviewed changes

Zouxxyy reviewed Aug 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table #3843

[Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table #3843

zhongyujiang commented Jul 30, 2024 •

edited

Loading

zhongyujiang commented Jul 30, 2024

JingsongLi left a comment

zhongyujiang commented Jul 30, 2024

Zouxxyy Jul 30, 2024

JingsongLi Jul 30, 2024

zhongyujiang Jul 31, 2024

JingsongLi Aug 1, 2024

Zouxxyy Aug 1, 2024

zhongyujiang Aug 4, 2024

zhongyujiang Aug 4, 2024

zhongyujiang Aug 4, 2024

zhongyujiang Aug 5, 2024

JingsongLi Aug 6, 2024

JingsongLi Aug 6, 2024

zhongyujiang Aug 6, 2024

zhongyujiang Aug 6, 2024

JingsongLi Aug 7, 2024

JingsongLi commented Aug 6, 2024

Zouxxyy left a comment •

edited

Loading

[Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table #3843

Are you sure you want to change the base?

[Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table #3843

Conversation

zhongyujiang commented Jul 30, 2024 • edited Loading

Purpose

Tests

API and Format

Documentation

zhongyujiang commented Jul 30, 2024

JingsongLi left a comment

Choose a reason for hiding this comment

zhongyujiang commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JingsongLi commented Aug 6, 2024

Zouxxyy left a comment • edited Loading

Choose a reason for hiding this comment

zhongyujiang commented Jul 30, 2024 •

edited

Loading

Zouxxyy left a comment •

edited

Loading