[SPARK-28098][SQL] Supports reading hive tables when there are subdirectories under the table or partition location. #52694
+206
−56
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add a new variable
partitionDirToChildrenFilestoInMemoryFileIndexto distinguish the use ofcachedLeafDirToChildrenFiles.Why are the changes needed?
When table location or partition location have multi-level non partitioned paths. In particular, TEZ will generate the
HIVE_UNION_SUBDIR_1directory. Spark cannot read the contents.The reason is that when there are subdirectories, the key stored in
cachedLeafDirToChildrenFilesmight be something like/xx/table/pt=1/sudir. Then when usingcachedLeafDirToChildrenFiles.get('/xx/table/pt=1')to retrieve the file corresponding to the path, will get nothing, resulting in data loss.The root cause lies in the use of
cachedLeafDirToChildrenFiles. Three underlying code usescachedLeafDirToChildrenFiles. The correct use ofcachedLeafDirToChildrenFilesin inferPartitioning is because partitions are easier to infer from lower directory levels. However, the other two uses ofcachedLeafDirToChildrenFilesare incorrect. The correct use is to retrieve files from the partition or table location, but not from the leaf directory. Here add newpartitionDirToChildrenFilesto solve this problem.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit test, and real job.