Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix:Tune DeltaTable.vacuum operation #529

Closed
wants to merge 2 commits into from
Closed

Fix:Tune DeltaTable.vacuum operation #529

wants to merge 2 commits into from

Conversation

mohamed-a-abdelaziz
Copy link

  • Tune DeltaTable.vacuum operation to delete files in parallel depending on the number of partitions instead of deleting one by one
  • Passed unit testing(org.apache.spark.sql.delta.DeltaVacuumSuite)
  • Used it to delete around 300K files on gcs using cluster with 8 workers (n1-standard-4(4CPUs,15GB)) in 10 minutes
    and setting --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=1000

- Tune deltaTable.vacuum operation to delete files in parallel depending in number of partitions instead of deleting one by one
- Passed unit testing(org.apache.spark.sql.delta.DeltaVacuumSuite)
- Used it to delete around 300K files on gcs using cluster with 8 workers (n1-standard-4(4CPUs,15GB)) in 10 minutes
  and setting --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=1000
@mohamed-a-abdelaziz mohamed-a-abdelaziz changed the title fix:Tune DeltaTable.vacuum operation Fix:Tune DeltaTable.vacuum operation Oct 3, 2020
@brkyvz
Copy link
Collaborator

brkyvz commented Oct 15, 2020

Thanks @mohamed-a-abdelaziz for submitting this and your interest. We just merged an alternative PR to this: #522. Therefore I'm closing this

@brkyvz brkyvz closed this Oct 15, 2020
tdas pushed a commit to tdas/delta that referenced this pull request Jun 6, 2023
* [FlinkSQL_PR_1] Flink Delta Sink - Table API UPDATED (delta-io#389)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>
Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>
Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>
Co-authored-by: Paweł Kubit <pawel.kubit@getindata.com>
Co-authored-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* [FlinkSQL_PR_2] - SQL Support for Delta Source connector. (delta-io#487)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_3] - Delta catalog skeleton (delta-io#503)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_4] - Delta catalog - Interactions with DeltaLog. Create and get table. (delta-io#506)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_5] - Delta catalog - DDL option validation. (delta-io#509)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_6] - Delta catalog - alter table + tests. (delta-io#510)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_7] - Delta catalog - Restrict Delta Table factory to work only with Delta Catalog + tests. (delta-io#514)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_8] - Delta Catalog - DDL/Query hint validation + tests. (delta-io#520)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_9] - Delta Catalog - Adding Flink's Hive catalog as decorated catalog. (delta-io#524)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_10] - Table API support SELECT with filter on partition column. (delta-io#528)

* [FlinkSQL_PR_10] - Table API support SELECT with filter on partition column.

---------

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>
Co-authored-by: Scott Sandre <scott.sandre@databricks.com>

* [FlinkSQL_PR_11] - Delta Catalog - cache DeltaLog instances in DeltaCatalog. (delta-io#529)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_12] - UML diagrams. (delta-io#530)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_13] - Remove mergeSchema option from SQL API. (delta-io#531)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_14] - SQL examples. (delta-io#535)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* remove duplicate function after rebasing against master

---------

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>
Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>
Co-authored-by: kristoffSC <krzysiek.chmielewski@gmail.com>
Co-authored-by: Paweł Kubit <pawel.kubit@getindata.com>
Co-authored-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>
tdas pushed a commit to tdas/delta that referenced this pull request Jun 8, 2023
* [FlinkSQL_PR_1] Flink Delta Sink - Table API UPDATED (delta-io#389)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>
Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>
Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>
Co-authored-by: Paweł Kubit <pawel.kubit@getindata.com>
Co-authored-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* [FlinkSQL_PR_2] - SQL Support for Delta Source connector. (delta-io#487)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_3] - Delta catalog skeleton (delta-io#503)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_4] - Delta catalog - Interactions with DeltaLog. Create and get table. (delta-io#506)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_5] - Delta catalog - DDL option validation. (delta-io#509)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_6] - Delta catalog - alter table + tests. (delta-io#510)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_7] - Delta catalog - Restrict Delta Table factory to work only with Delta Catalog + tests. (delta-io#514)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_8] - Delta Catalog - DDL/Query hint validation + tests. (delta-io#520)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_9] - Delta Catalog - Adding Flink's Hive catalog as decorated catalog. (delta-io#524)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_10] - Table API support SELECT with filter on partition column. (delta-io#528)

* [FlinkSQL_PR_10] - Table API support SELECT with filter on partition column.

---------

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>
Co-authored-by: Scott Sandre <scott.sandre@databricks.com>

* [FlinkSQL_PR_11] - Delta Catalog - cache DeltaLog instances in DeltaCatalog. (delta-io#529)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_12] - UML diagrams. (delta-io#530)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_13] - Remove mergeSchema option from SQL API. (delta-io#531)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* [FlinkSQL_PR_14] - SQL examples. (delta-io#535)

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>

* remove duplicate function after rebasing against master

---------

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>
Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>
Co-authored-by: kristoffSC <krzysiek.chmielewski@gmail.com>
Co-authored-by: Paweł Kubit <pawel.kubit@getindata.com>
Co-authored-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants