slove the problem that unable to delete table files by hivecatalog #3730

alexkingli · 2021-12-13T10:23:01Z

Add the cleanTable in HiveCatalog to slove the problem that unable to delete table files when the table is already deleted in metastore. Then we don't need to delete the table files by hadoop command. It will fail for security if the table exists.

pvary · 2021-12-13T13:39:29Z

@alexkingli: Could you please explain exactly what happened?
What I understand is:

Drop table (wrong user) - metadata dropped, deleting data failed
Drop table (correct user) - drop table failed

What I do no understand is that at the 2nd try we do not have metadata for the table. How is it possible to drop a table in this case?

#2583 is somewhat related. Could it help you?

Thanks,
Peter

kbendick · 2021-12-13T19:52:46Z

For your case, where the wrong user was used, there's also a Spark action that can potentially help this. RemoveReachableFiles.

It's already usable, but here's a PR to add it as a stored-procedure: #3719

Would that possibly help as well? My understanding has always been that situations like the 2nd was the reason for having that action.

alexkingli · 2021-12-15T06:12:07Z

@alexkingli: Could you please explain exactly what happened? What I understand is:

Drop table (wrong user) - metadata dropped, deleting data failed

Drop table (correct user) - drop table failed

What I do no understand is that at the 2nd try we do not have metadata for the table. How is it possible to drop a table in this case?

#2583 is somewhat related. Could it help you?

Thanks, Peter

Yes， the situation you described is same with me. And I think this cleanup method is necessary in hivecatalog. We don't need to clean up these files directly with hadoop commands every time.

pvary · 2021-12-15T08:13:12Z

I would be interested in:

How would you try drop the table after the first failure?
Who would call the new API method?

I think if we know the location of the former table, then we can use it to remove the directory by hand or through a FileSystem API. I am not sure we really need a new method for it in the Iceberg API, especially that it does not do anything Iceberg specific, just removes the directory recursively (which could be wrong in case of an Iceberg table as it could contain data files from anywhere).

I still might miss some points, so correct me if I am wrong in my assumptions/logic above.

Thanks,
Peter

alexkingli · 2021-12-15T08:41:33Z

Ok, I submitted a new commit. I change the logic of dropTable in HiveCatalog. I think it should remove the data file and then clean the metadata in metastore. The problem at present is the dropTable can't ensure the transaction for deleting metadata in hivemetastore and deleting data in S3 or HDFS. So It's a good idea to put the logic of deleting data first，it will ensure deleting data successfully as far as possible.

pvary · 2021-12-15T08:58:13Z

The problem at present is the dropTable can't ensure the transaction for deleting metadata in hivemetastore and deleting data in S3 or HDFS. So It's a good idea to put the logic of deleting data first，it will ensure deleting data successfully as far as possible.

We had this discussion in Hive before and the conclusion was that it is better to drop the metadata first. This way we might keep some unnecessary files but we always have a consistent Hive table structure. Also it is quite easy to delete the files by hand. If we go the other way around, then we might end up in a situation where we drop the files, but keep the metadata. This way we will have corrupted Hive table structure which the community decided that is a much worse situation for the Hive users than the former.

I think the Iceberg tables are not special in this way, and I think the argument above still holds true.

What do you think?

alexkingli · 2021-12-15T09:28:55Z

I think I understand the design of dropTable. Thanks a lot. Another problem is that the dropTable can't delete directories and some metadata file in hdfs if we use the hdfs as underlying storage. This problem may cause the failure of rebuilding table with same table name. Please take a look at this problem ， thanks again.

pvary · 2021-12-15T11:00:02Z

Another problem is that the dropTable can't delete directories and some metadata file in hdfs if we use the hdfs as underlying storage. This problem may cause the failure of rebuilding table with same table name. Please take a look at this problem ， thanks again.

Does the delete run into some errors, or it just does not try to remove files?
Would #3622 help here?

alexkingli · 2021-12-15T11:38:45Z

Does the delete run into some errors, or it just does not try to remove files? Would #3622 help here?

Even if this method is successfully executed, some directories and some meta files are not cleaned in hdfs。For example：/home/$lakehouse_dir/$database.db/$table/、/home/$lakehouse_dir/$database.db/$table/data、/home/$lakehouse_dir/$database.db/$table/metadata/*.metadata.json . Then if I create a table with the same name , i need to delete these directories and files by hadoop command first .

pvary · 2021-12-15T11:54:27Z

#3622

Does the delete run into some errors, or it just does not try to remove files? Would #3622 help here?

Even if this method is successfully executed, some directories and some meta files are not cleaned in hdfs。For example：/home/$lakehouse_dir/$database.db/$table/、/home/$lakehouse_dir/$database.db/$table/、/home/$lakehouse_dir/$database.db/$table/metadata/*.metadata.json . Then if I create a table with the same name , i need to delete these directories and files by hadoop command first .

I think #3622 should help by removing the /home/$lakehouse_dir/$database.db/$table/metadata/*.metadata.json files. If I understand correctly the reason for that change was that the old metadata files were kept, and after the PR they will be removed. I would guess that with #3622 the only remaining problems are the /home/$lakehouse_dir/$database.db/$table/, /home/$lakehouse_dir/$database.db/$table/ and the /home/$lakehouse_dir/$database.db/$table/metadata/ dirs.

It might be useful to remove them at the end of the dropTableData method, but we can do it only if they are empty (as other tables can use these dirs "legally" 😄). This would solve the issue for every engine.

If the drop happens from Hive then we might get away by adding it to the HiveIcebergMetaHook.commitDropTable method, but this is a Hive only solution, so the above one might be better, if the community agrees.
https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java#L222-L241

Thanks,
Peter

alexkingli · 2021-12-15T12:17:31Z

I think #3622 should help by removing the /home/$lakehouse_dir/$database.db/$table/metadata/*.metadata.json files. If I understand correctly the reason for that change was that the old metadata files were kept, and after the PR they will be removed. I would guess that with #3622 the only remaining problems are the /home/$lakehouse_dir/$database.db/$table/, /home/$lakehouse_dir/$database.db/$table/ and the /home/$lakehouse_dir/$database.db/$table/metadata/ dirs.

It might be useful to remove them at the end of the dropTableData method, but we can do it only if they are empty (as other tables can use these dirs "legally" 😄). This would solve the issue for every engine.

If the drop happens from Hive then we might get away by adding it to the HiveIcebergMetaHook.commitDropTable method, but this is a Hive only solution, so the above one might be better, if the community agrees. https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java#L222-L241

Thanks, Peter

Thanks a lot. I think it is necessary to delete these directories without any files. In fact, there may be some partition directories that need to be deleted. It is not convenient to delete these directories in hdfs or S3 by command every time. Or we can make a tool to solve this problem.

pvary · 2021-12-15T12:37:20Z

The FileIO API is quite narrow by design. I would not think that the community would agree to add new method to it.

I think the easiest way forward would be to convince Hive to drop the table data.
It should be possible to do it based on the value of ifPurge
https://github.com/apache/hive/blob/rel/release-3.1.2/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L2548

alexkingli · 2021-12-16T07:46:45Z

The FileIO API is quite narrow by design. I would not think that the community would agree to add new method to it.

I think the easiest way forward would be to convince Hive to drop the table data. It should be possible to do it based on the value of ifPurge https://github.com/apache/hive/blob/rel/release-3.1.2/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L2548

I think we can slove this problem by merging pr1839., but it was made long time ago and wasn't merged into the release. Maybe you are right,it is not a big problem for the moment.

pvary · 2021-12-16T12:06:23Z

So basically this is the same discussion as in #1839.
Were you successful finding out why the HiveMetaStore does not delete the table directory in Hive 3.1.2?

rdblue · 2022-01-18T17:47:39Z

It looks like there's quite a bit of discussion about this being the right direction. For now, I'm going to close this PR until we figure out how to handle the case.

Add the cleanTable in HiveCatalog

4404b1a

github-actions bot added the hive label Dec 13, 2021

alexkingli mentioned this pull request Dec 13, 2021

Slove the problem that failed to remove the data files when using HiveCatalog.dropTable #3726

Closed

fix the bugs of cleanTable in HiveCatalog

07bf089

github-actions bot added API AWS core spark labels Dec 14, 2021

edit the dropTable in HiveCatalog and edit some log info

dd29eee

alexkingli changed the title ~~Add the cleanTable in HiveCatalog to slove the problem that unable to delete table files when the table is already deleted in metastore.~~ slove the problem that unable to delete table files by hivecatalog Dec 15, 2021

alexkingli closed this Dec 15, 2021

edit some log info

e5b5037

alexkingli reopened this Dec 15, 2021

rdblue closed this Jan 18, 2022

slove the problem that unable to delete table files by hivecatalog #3730

slove the problem that unable to delete table files by hivecatalog #3730

Uh oh!

Conversation

alexkingli commented Dec 13, 2021

Uh oh!

pvary commented Dec 13, 2021

Uh oh!

kbendick commented Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexkingli commented Dec 15, 2021

Uh oh!

pvary commented Dec 15, 2021

Uh oh!

alexkingli commented Dec 15, 2021

Uh oh!

pvary commented Dec 15, 2021

Uh oh!

alexkingli commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Dec 15, 2021

Uh oh!

alexkingli commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Dec 15, 2021

Uh oh!

alexkingli commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Dec 15, 2021

Uh oh!

alexkingli commented Dec 16, 2021

Uh oh!

pvary commented Dec 16, 2021

Uh oh!

rdblue commented Jan 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kbendick commented Dec 13, 2021 •

edited

Loading

alexkingli commented Dec 15, 2021 •

edited

Loading

alexkingli commented Dec 15, 2021 •

edited

Loading

alexkingli commented Dec 15, 2021 •

edited

Loading