Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added config option to enable parallel deletes for vacuum command #522

Closed

Conversation

Kimahriman
Copy link
Contributor

Resolves #395

#416 hasn't been updated in over four months, and it would be a verify useful feature for us to have, so I took my own stab at it.

  • A new config value is added vacuum.parallelDelete.enabled that defaults to false
  • I updated the default behavior to be coalesce to 1 instead of iterate on the driver so that you can see something being done by spark in the UI/console instead of it just sitting there. I'm not sure if there's a reason this would cause issues, so happy to revert this back if you think it should be.
  • If vacuum.parallelDelete.enabled is set to true, it maintains the existing partitions from the diff calculation. Because this is the result of a join, your partitions are then based off your spark.sql.shuffle.partitions. So your parallelism will be min(number of executors, shuffle partitions), and you can tweak your shuffle partitions if you want more/less parallelism

I removed the delete static method because the number of parameters that had to be passed to it made it seem like too much. Happy to move that back if that's not preferred.

Also happy to make any updates to the name or description of the new config.

@jose-torres
Copy link
Contributor

I think it's best to leave the default behavior to iterate on the driver and maintain the delete static method.

Other than that, looks good! Can you add a unit test to the vacuum test suite with the flag enabled?

@Kimahriman
Copy link
Contributor Author

I think it's best to leave the default behavior to iterate on the driver and maintain the delete static method.

Other than that, looks good! Can you add a unit test to the vacuum test suite with the flag enabled?

Oh duh, I couldn't think of a way to unit test the behavior but I guess just making sure it at least works with the flag enabled would be good.

@Kimahriman Kimahriman force-pushed the feature/vacuum-parallel-delete branch from acdf936 to 8d275af Compare September 18, 2020 14:14
@Kimahriman
Copy link
Contributor Author

I reverted back to the static method and delete on driver in the default case. Also added a basic test running a vacuum with the config enabled.

Signed-off-by: Adam Binford <adamq43@gmail.com>
@Kimahriman Kimahriman force-pushed the feature/vacuum-parallel-delete branch from 8d275af to 39719d8 Compare September 18, 2020 14:24
@itaydemri
Copy link

Hi, Are you going to merge this soon?
It should help me a lot!

@swapnil-chougule
Copy link

Hey Team,
It would be better if this change get merged soon
It is helpful & waiting for it.

@jose-torres
Copy link
Contributor

We're working on getting this merged. Sorry for the delay!

@itaydemri
Copy link

Thanks for closing this!
Any plans to merge it to 0.6?

@brkyvz brkyvz closed this in 40182f3 Oct 15, 2020
@zsxwing
Copy link
Member

zsxwing commented Nov 6, 2020

Any plans to merge it to 0.6?

In general, we don't backport new features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vacuum is very slow
5 participants