-
Notifications
You must be signed in to change notification settings - Fork 2.9k
slove the problem that unable to delete table files by hivecatalog #3730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alexkingli: Could you please explain exactly what happened?
What I do no understand is that at the 2nd try we do not have metadata for the table. How is it possible to drop a table in this case? #2583 is somewhat related. Could it help you? Thanks, |
|
For your case, where the wrong user was used, there's also a Spark action that can potentially help this. RemoveReachableFiles. It's already usable, but here's a PR to add it as a stored-procedure: #3719 Would that possibly help as well? My understanding has always been that situations like the 2nd was the reason for having that action. |
Yes, the situation you described is same with me. And I think this cleanup method is necessary in hivecatalog. We don't need to clean up these files directly with hadoop commands every time. |
|
I would be interested in:
I think if we know the location of the former table, then we can use it to remove the directory by hand or through a FileSystem API. I am not sure we really need a new method for it in the Iceberg API, especially that it does not do anything Iceberg specific, just removes the directory recursively (which could be wrong in case of an Iceberg table as it could contain data files from anywhere). I still might miss some points, so correct me if I am wrong in my assumptions/logic above. Thanks, |
|
Ok, I submitted a new commit. I change the logic of dropTable in HiveCatalog. I think it should remove the data file and then clean the metadata in metastore. The problem at present is the dropTable can't ensure the transaction for deleting metadata in hivemetastore and deleting data in S3 or HDFS. So It's a good idea to put the logic of deleting data first,it will ensure deleting data successfully as far as possible. |
We had this discussion in Hive before and the conclusion was that it is better to drop the metadata first. This way we might keep some unnecessary files but we always have a consistent Hive table structure. Also it is quite easy to delete the files by hand. If we go the other way around, then we might end up in a situation where we drop the files, but keep the metadata. This way we will have corrupted Hive table structure which the community decided that is a much worse situation for the Hive users than the former. I think the Iceberg tables are not special in this way, and I think the argument above still holds true. What do you think? |
|
I think I understand the design of dropTable. Thanks a lot. Another problem is that the dropTable can't delete directories and some metadata file in hdfs if we use the hdfs as underlying storage. This problem may cause the failure of rebuilding table with same table name. Please take a look at this problem , thanks again. |
Does the delete run into some errors, or it just does not try to remove files? |
Even if this method is successfully executed, some directories and some meta files are not cleaned in hdfs。For example:/home/$lakehouse_dir/$database.db/$table/、/home/$lakehouse_dir/$database.db/$table/data、/home/$lakehouse_dir/$database.db/$table/metadata/*.metadata.json . Then if I create a table with the same name , i need to delete these directories and files by hadoop command first . |
I think #3622 should help by removing the /home/$lakehouse_dir/$database.db/$table/metadata/*.metadata.json files. If I understand correctly the reason for that change was that the old metadata files were kept, and after the PR they will be removed. I would guess that with #3622 the only remaining problems are the /home/$lakehouse_dir/$database.db/$table/, /home/$lakehouse_dir/$database.db/$table/ and the /home/$lakehouse_dir/$database.db/$table/metadata/ dirs. It might be useful to remove them at the end of the If the drop happens from Hive then we might get away by adding it to the Thanks, |
Thanks a lot. I think it is necessary to delete these directories without any files. In fact, there may be some partition directories that need to be deleted. It is not convenient to delete these directories in hdfs or S3 by command every time. Or we can make a tool to solve this problem. |
|
The I think the easiest way forward would be to convince Hive to drop the table data. |
I think we can slove this problem by merging pr1839., but it was made long time ago and wasn't merged into the release. Maybe you are right,it is not a big problem for the moment. |
|
So basically this is the same discussion as in #1839. |
|
It looks like there's quite a bit of discussion about this being the right direction. For now, I'm going to close this PR until we figure out how to handle the case. |
Add the cleanTable in HiveCatalog to slove the problem that unable to delete table files when the table is already deleted in metastore. Then we don't need to delete the table files by hadoop command. It will fail for security if the table exists.