Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#2541] feat(spark-connector): support basic DDL and DML operations to iceberg catalog #2544

Merged
merged 92 commits into from
Apr 2, 2024

Conversation

caican00
Copy link
Collaborator

@caican00 caican00 commented Mar 15, 2024

What changes were proposed in this pull request?

  1. support DDL operations to iceberg catalog.
  2. support read and write operations to iceberg Table.

Why are the changes needed?

support basic DDL and DML operations for iceberg table using sparksql.

Fix: #2541

Does this PR introduce any user-facing change?

Yes, users can use sparksql to do iceberg table ddl and read&write operations.

How was this patch tested?

New Iceberg ITs.

@caican00 caican00 marked this pull request as draft March 15, 2024 07:22
@FANNG1
Copy link
Contributor

FANNG1 commented Mar 15, 2024

Thanks for your work! I prefer to delay the Iceberg merge work until finishing basic hive features like partition bucket properties in recent weeks. To support Iceberg catalog integrate test, there is a prepose work that make SparkIT to SparkCommonIT which contains the common tests shared by all catalogs. add new SparkHiveCatalogIT to test Hive specific tests and SparkIcebergCatalogIT to test Iceberg specific tests, both SparkXXCatalogIT extends SparkCommonIT. do you like to work on this first?

@FANNG1
Copy link
Contributor

FANNG1 commented Mar 15, 2024

Or you could do a POC first in your local environment, after the integrate test split work is done you could propose your PR.

@caican00
Copy link
Collaborator Author

caican00 commented Mar 15, 2024

Thanks for your work! I prefer to delay the Iceberg merge work until finishing basic hive features like partition bucket properties in recent weeks. To support Iceberg catalog integrate test, there is a prepose work that make SparkIT to SparkCommonIT which contains the common tests shared by all catalogs. add new SparkHiveCatalogIT to test Hive specific tests and SparkIcebergCatalogIT to test Iceberg specific tests, both SparkXXCatalogIT extends SparkCommonIT. do you like to work on this first?

sure, i'll try to separate the integrate test for different datasources first.

@caican00
Copy link
Collaborator Author

Thanks for your work! I prefer to delay the Iceberg merge work until finishing basic hive features like partition bucket properties in recent weeks. To support Iceberg catalog integrate test, there is a prepose work that make SparkIT to SparkCommonIT which contains the common tests shared by all catalogs. add new SparkHiveCatalogIT to test Hive specific tests and SparkIcebergCatalogIT to test Iceberg specific tests, both SparkXXCatalogIT extends SparkCommonIT. do you like to work on this first?

sure, i'll try to separate the integrate test for different datasources first.

By the way, is it better to use iceberg rest catalog for spark-connector to interact with iceberg in the first one PR? cc @FANNG1

@FANNG1
Copy link
Contributor

FANNG1 commented Mar 15, 2024

Thanks for your work! I prefer to delay the Iceberg merge work until finishing basic hive features like partition bucket properties in recent weeks. To support Iceberg catalog integrate test, there is a prepose work that make SparkIT to SparkCommonIT which contains the common tests shared by all catalogs. add new SparkHiveCatalogIT to test Hive specific tests and SparkIcebergCatalogIT to test Iceberg specific tests, both SparkXXCatalogIT extends SparkCommonIT. do you like to work on this first?

sure, i'll try to separate the integrate test for different datasources first.

By the way, is it better to use iceberg rest catalog for spark-connector to interact with iceberg in the first one PR? cc @FANNG1

I prefer to support Iceberg Hive Catalog because it's most used, and env is setup.

@caican00
Copy link
Collaborator Author

Thanks for your work! I prefer to delay the Iceberg merge work until finishing basic hive features like partition bucket properties in recent weeks. To support Iceberg catalog integrate test, there is a prepose work that make SparkIT to SparkCommonIT which contains the common tests shared by all catalogs. add new SparkHiveCatalogIT to test Hive specific tests and SparkIcebergCatalogIT to test Iceberg specific tests, both SparkXXCatalogIT extends SparkCommonIT. do you like to work on this first?

sure, i'll try to separate the integrate test for different datasources first.

By the way, is it better to use iceberg rest catalog for spark-connector to interact with iceberg in the first one PR? cc @FANNG1

I prefer to support Iceberg Hive Catalog because it's most used, and env is setup.

If we do not use hms in the future and directly use gravitino backend storage, we should still need rest catalog, right?

@caican00
Copy link
Collaborator Author

caican00 commented Mar 15, 2024

Hi @FANNG1 I would like to know if i can do something in this way:

  1. try to separate the integrate test for different datasources first
  2. and then help solve some hive related issues
  3. finally continue to work on iceberg related issues

@FANNG1
Copy link
Contributor

FANNG1 commented Mar 15, 2024

Hi @FANNG1 I would like to know if i can do something in this way:

  1. try to separate the integrate test for different datasources first
  2. and then help solve some hive related issues
  3. finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

@caican00
Copy link
Collaborator Author

Hi @FANNG1 I would like to know if i can do something in this way:

  1. try to separate the integrate test for different datasources first
  2. and then help solve some hive related issues
  3. finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

@FANNG1
Copy link
Contributor

FANNG1 commented Mar 16, 2024

Hi @FANNG1 I would like to know if i can do something in this way:

  1. try to separate the integrate test for different datasources first
  2. and then help solve some hive related issues
  3. finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

Because we didn't have enough time to support it in 0.5. If you could do most work, we may change the plans, we will disscuse it on Monday.

@caican00
Copy link
Collaborator Author

caican00 commented Mar 18, 2024

Hi @FANNG1 I would like to know if i can do something in this way:

  1. try to separate the integrate test for different datasources first
  2. and then help solve some hive related issues
  3. finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

Because we didn't have enough time to support it in 0.5. If you could do most work, we may change the plans, we will disscuse it on Monday.

Hi @FANNG1 I would like to know if i can do something in this way:

  1. try to separate the integrate test for different datasources first
  2. and then help solve some hive related issues
  3. finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

Because we didn't have enough time to support it in 0.5. If you could do most work, we may change the plans, we will disscuse it on Monday.

Yes, I could specialize in this. cc @FANNG1

@caican00
Copy link
Collaborator Author

Hi @FANNG1 I would like to know if i can do something in this way:

  1. try to separate the integrate test for different datasources first
  2. and then help solve some hive related issues
  3. finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

Because we didn't have enough time to support it in 0.5. If you could do most work, we may change the plans, we will disscuse it on Monday.

Hi @FANNG1 I would like to know if i can do something in this way:

  1. try to separate the integrate test for different datasources first
  2. and then help solve some hive related issues
  3. finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

Because we didn't have enough time to support it in 0.5. If you could do most work, we may change the plans, we will disscuse it on Monday.

Yes, I could specialize in this. cc @FANNG1

Hi @FANNG1, kindly ask if there are any conclusions about supporting iceberg in 0.5?

@FANNG1
Copy link
Contributor

FANNG1 commented Mar 19, 2024

Hi @FANNG1, kindly ask if there are any conclusions about supporting iceberg in 0.5?

I want to keep consistent about what will to do to support Iceberg:

  1. refactor SparkIT to support SparkIcebergCatalogIT, [#2566] Improvement(spark-connector): Refactoring integration tests for spark-connector #2578
  2. support basic DDL and DML operation for Iceberg in this PR, which should pass SparkIcebergCatalogIT
  3. Iceberg partition and distribution support, like month(timestamp)
  4. advanced features like low-level update or delete, [Subtask] support row-level operations to iceberg Table #2543 [Subtask] support delete operation to Iceberg Table #2542

Could you finish at least the first 3 features before 4.12 which is code freeze date?

@caican00
Copy link
Collaborator Author

Hi @FANNG1, kindly ask if there are any conclusions about supporting iceberg in 0.5?

I want to keep consistent about what will to do to support Iceberg:

  1. refactor SparkIT to support SparkIcebergCatalogIT, [#2566] Improvement(spark-connector): Refactoring integration tests for spark-connector #2578
  2. support basic DDL and DML operation for Iceberg in this PR, which should pass SparkIcebergCatalogIT
  3. Iceberg partition and distribution support, like month(timestamp)
  4. advanced features like low-level update or delete, [Subtask] support merge operation to iceberg Table #2543 [Subtask] support delete operation to Iceberg Table #2542

Could you finish at least the first 3 features before 4.12 which is code freeze date?

OK, I will give 100% effort to finish them.
And regarding the first point, I have a question, should we support SparkIcebergCatalogIT directly in this pr #2578 ? Currently iceberg's basic DDL and DML operations are not supported in spark-connector.

@FANNG1
Copy link
Contributor

FANNG1 commented Mar 19, 2024

Hi @FANNG1, kindly ask if there are any conclusions about supporting iceberg in 0.5?

I want to keep consistent about what will to do to support Iceberg:

  1. refactor SparkIT to support SparkIcebergCatalogIT, [#2566] Improvement(spark-connector): Refactoring integration tests for spark-connector #2578
  2. support basic DDL and DML operation for Iceberg in this PR, which should pass SparkIcebergCatalogIT
  3. Iceberg partition and distribution support, like month(timestamp)
  4. advanced features like low-level update or delete, [Subtask] support merge operation to iceberg Table #2543 [Subtask] support delete operation to Iceberg Table #2542

Could you finish at least the first 3 features before 4.12 which is code freeze date?

OK, I will give 100% effort to finish them. And regarding the first point, I have a question, should we support SparkIcebergCatalogIT directly in this pr #2578 ? Currently iceberg's basic DDL and DML operations are not supported in spark-connector.

We needn't to support SparkIcebergCatalogIT directly in #2578, #2578 aims to make add SparkIcebergCatalogIT easily like just extends SparkCommonIT.

@caican00
Copy link
Collaborator Author

Hi @FANNG1, kindly ask if there are any conclusions about supporting iceberg in 0.5?

I want to keep consistent about what will to do to support Iceberg:

  1. refactor SparkIT to support SparkIcebergCatalogIT, [#2566] Improvement(spark-connector): Refactoring integration tests for spark-connector #2578
  2. support basic DDL and DML operation for Iceberg in this PR, which should pass SparkIcebergCatalogIT
  3. Iceberg partition and distribution support, like month(timestamp)
  4. advanced features like low-level update or delete, [Subtask] support merge operation to iceberg Table #2543 [Subtask] support delete operation to Iceberg Table #2542

Could you finish at least the first 3 features before 4.12 which is code freeze date?

OK, I will give 100% effort to finish them. And regarding the first point, I have a question, should we support SparkIcebergCatalogIT directly in this pr #2578 ? Currently iceberg's basic DDL and DML operations are not supported in spark-connector.

We needn't to support SparkIcebergCatalogIT directly in #2578, #2578 aims to make add SparkIcebergCatalogIT easily like just extends SparkCommonIT.

got it.

@caican00 caican00 changed the title [#2541] feat(spark-connector): support DDL and readWrite operations to iceberg catalog [#2541] feat(spark-connector): support basic DDL and DML operations to iceberg catalog Mar 19, 2024
@caican00 caican00 force-pushed the iceberg-read-write branch from 16a4c98 to 69a7af6 Compare April 1, 2024 01:37
@jerryshao
Copy link
Contributor

Can you please check the CI exception here (https://github.com/datastrato/gravitino/actions/runs/8502545002/job/23286861918?pr=2544), I'm not sure is it related to your changes?

@jerryshao
Copy link
Contributor

@FANNG1 would you please help to review again?

@caican00 caican00 force-pushed the iceberg-read-write branch from ab43f2f to 3d494ac Compare April 1, 2024 12:54
@caican00
Copy link
Collaborator Author

caican00 commented Apr 1, 2024

Can you please check the CI exception here (https://github.com/datastrato/gravitino/actions/runs/8502545002/job/23286861918?pr=2544), I'm not sure is it related to your changes?

Fixed. Thanks for your review.

@caican00
Copy link
Collaborator Author

caican00 commented Apr 2, 2024

Hi @FANNG1 could you help review this again? Thanks

@FANNG1
Copy link
Contributor

FANNG1 commented Apr 2, 2024

Hi @FANNG1 could you help review this again? Thanks

just a few comment, could you fix it?

@caican00
Copy link
Collaborator Author

caican00 commented Apr 2, 2024

Hi @FANNG1 could you help review this again? Thanks

just a few comment, could you fix it?

all comments have been addressed.

@FANNG1
Copy link
Contributor

FANNG1 commented Apr 2, 2024

LGTM, let's wait CI finish

@FANNG1 FANNG1 merged commit 46ebaf6 into apache:main Apr 2, 2024
19 checks passed
@FANNG1
Copy link
Contributor

FANNG1 commented Apr 2, 2024

@caican00 thanks for your work, merge to main.

@caican00
Copy link
Collaborator Author

caican00 commented Apr 2, 2024

@caican00 thanks for your work, merge to main.

@FANNG1 @jerryshao @qqqttt123 Thank you for yours review.

yuqi1129 added a commit to yuqi1129/gravitino that referenced this pull request Apr 3, 2024
…L operations to iceberg catalog (apache#2544)"

This reverts commit 46ebaf6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] support DDL, read and write operations to Iceberg catalog
4 participants