Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table #3843

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

zhongyujiang
Copy link
Contributor

@zhongyujiang zhongyujiang commented Jul 30, 2024

Purpose

This relocates the 'path' property in the table options to 'location' for better presentation.

  • Adapt 'path' to Spark's 'location' in table props ( the 'path' is still preserved in table props)
  • Support the customization of the table location when creating a table(only allowed for creating external table using hive catalog)

Spark uses the reserved property location to indicate the location of the table, and in the output of DESC EXTENDED, the location information will be displayed under the "# Detailed Table Information" section for better visibility.

+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                            |comment|
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|a                           |bigint                                                                                               |NULL   |
|b                           |varchar(10)                                                                                          |NULL   |
|c                           |char(10)                                                                                             |NULL   |
|                            |                                                                                                     |       |
|# Metadata Columns          |                                                                                                     |       |
|__paimon_file_path          |string                                                                                               |       |
|__paimon_row_index          |bigint                                                                                               |       |
|                            |                                                                                                     |       |
|# Detailed Table Information|                                                                                                     |       |
|Name                        |default.testTableAs                                                                                  |       |
|Type                        |MANAGED                                                                                              |       |
|Location                    |file:/var/folders/2r/v_2n6mbj41v7q14m8f3j9q4w0000gn/T/junit3802231298897594813/default.db/testTableAs|       |
|Provider                    |paimon                                                                                               |       |
|Owner                       |zhongyujiang                                                                                         |       |
|Table Properties            |[file.format=parquet]                                                                                |       |
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+

Tests

API and Format

Documentation

@zhongyujiang
Copy link
Contributor Author

cc @Zouxxyy @YannByron Can you help review this? Thanks!

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1, It is best for Spark to be unified with other engines, as using a new option may harm the interoperability between engines

@zhongyujiang
Copy link
Contributor Author

Hi @JingsongLi This is not introducing a new option, but rather better adapting to the Spark engine. Because Spark has always used it to represent the location of the table.

the interoperability between engines

Are you suggesting that you want users to retrieve the table location information from the path option within the options when using the DESC command in different engines?

However, I would like to point out that the meaning of the DESC command inherently varies across different engines, closely related to the functionality of the engine. For instance, the DESC command in Trino and Flink does not even display table options information, but only column information. Yet, Flink can display primary key attribute of columns because primary keys are part of the Flink specification (this is not the case in Spark).

Therefore, I believe we should also adapt this to Spark, which would be more in line with the usage habits of Spark users.

BTW, for Iceberg and Hive tables, Spark's DESC EXTENDED DDL command displays the location information through the location field. However, for Paimon tables, the location information is hidden within the options under the path field, making it less convenient to users. This is the motivation behind this PR.

@@ -60,6 +60,8 @@ case class SparkTable(table: Table)
properties.put(CoreOptions.PRIMARY_KEY.key, String.join(",", table.primaryKeys))
}
properties.put(TableCatalog.PROP_PROVIDER, SparkSource.NAME)
val location = properties.remove(CoreOptions.PATH.key())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface (default Map<String, String> properties()) is only used in show create or desc table, add the TableCatalog.PROP_LOCATION is ok, but wondering whether we need to remove path.

Besides, another point worth noting is that paimon currently does not support create table x location('') actually? The new result of show create table may cause misunderstanding, or we need support it first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhongyujiang , we can discuss in this thread, @Zouxxyy raised two very good points:

  1. We could keep path option.
  2. We should also support execution for show create table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep path option.

Sure

We should also support execution for show create table

I've tested this and found that the location info passed by Spark DDL will be ignored right now

IIUC, the table location is closely related to the warehouse address, and of course, different catalogs have different implementations:

  • AbstractCatalog calculates the table location based on the warehouse, database, and table name when loading a table, and both FileSystemCatalog and JdbcCatalog follow this behavior
  • HiveCatalog retrieves the table location information from the HMS (Hive Metastore)

Therefore, if we want to support this, we need to differentiate the catalog being used. This is feasible with HiveCatalog, but not with other catalogs. What do you think? @Zouxxyy @JingsongLi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other catalogs, we can check the location, if not equal, throw exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can see ##2619

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this to allow customizing the table location when using the Hive catalog and creating an external table. Please take a look when you have time. Thanks!

checkArgument(
Objects.equals(createTableType(), TableType.EXTERNAL),
"The HiveCatalog only supports specifying location when creating an external table");
tableRoot = new Path(schema.options().get(CoreOptions.PATH.key()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Location specification is only allowed when creating an external table.

schema.options().containsKey(CoreOptions.PATH.key())
? schema.options().get(CoreOptions.PATH.key())
: getDataTableLocation(identifier).toString();
locationHelper.specifyTableLocation(table, location);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If provided, use the table path provided by the user.

@zhongyujiang zhongyujiang changed the title [Spark]: Relocate the 'path' property in the table options to 'location' for better presentation [Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table Aug 4, 2024
Iterables.filter(
tableSchema.options().entrySet(),
entry -> !Objects.equals(entry.getKey(), CoreOptions.PATH.key()))),
tableSchema.comment());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, when creating a new schema for the clone table, all table options from the original table are directly passed into the new schema, including the table path of the source table. This is then ignored during table creation, which seems incorrect. Additionally, after prohibiting the passing of custom table paths in the FileSystemCatalog, the clone table's unit tests would fail. Therefore, I have made a fix here.

Moreover, the original fromTableSchema method, when constructing the new schema, directly used the partitionKeys and primaryKeys instances from the source schema, which is not safe. So, I have made a copy here.

I have also moved this method from the Schema class to here, as this is the only place where it is needed. If you feel it is sufficiently general, I can put it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@@ -117,6 +119,9 @@ protected void dropTableImpl(Identifier identifier) {

@Override
public void createTableImpl(Identifier identifier, Schema schema) {
checkArgument(
!schema.options().containsKey(CoreOptions.PATH.key()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just relax this check? For example, If it is the same path, we should allow it.

I am quite concerned that SHOW CREATE TABLE may not run properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. I would like to share my thoughts on the prohibition of setting table location in FileSystemCatalog.

I think the SHOW CREATE TABLE command is usually used to create a test table with the same schema as the source table. Users typically use SHOW CREATE TABLE to get the DDL of a table and then create a test table with a different name for testing purposes. In this scenario, the location information in the DDL will inevitably mismatch the path assigned by the FileSystemCatalog for the new table, requiring users to modify the DDL in order to successfully create the table.

So even if the restrictions were relaxed, I believe it would still be somewhat confusing for users, as success would only be guaranteed when the specified location matches the one assigned by the catalog. If that's the case, why bother passing the location at all?

Therefore, instead of relaxing the checks here, I suggest we optimize the documentation by clearly stating this restriction in the documentation. We can declare the location management restrictions of FileSystemCatalog and the recommended usage of Spark DDL. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also our current approach to managing Iceberg tables in our platform. We host the location for all Iceberg tables on behalf of the users and strongly advise against specifying a location when creating tables. So far, this has been working well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think a relax check will be better. "create a test table with a different name for testing purposes", we can have very clear exception for this case.

@JingsongLi
Copy link
Contributor

Thanks @zhongyujiang for update. Left some comments.

Copy link
Contributor

@Zouxxyy Zouxxyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the new commits provide the support for external tables, thanks. Can you add the following test:
For hive catalog (can add in DDLWithHiveCatalogTestBase)

  1. Create an internal table, drop it, and check if the files are deleted
  2. Create an external table using location(check it), drop it, and check if the files are deleted

For filesystem catalog (can add to DDLTestBase)
All tables should be internal tables

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants