Skip to content

Conversation

@zhengchenyu
Copy link
Contributor

@zhengchenyu zhengchenyu commented Oct 22, 2025

What changes were proposed in this pull request?

Add a new variable partitionDirToChildrenFiles to InMemoryFileIndex to distinguish the use of cachedLeafDirToChildrenFiles.

Why are the changes needed?

When table location or partition location have multi-level non partitioned paths. In particular, TEZ will generate the HIVE_UNION_SUBDIR_1 directory. Spark cannot read the contents.

The reason is that when there are subdirectories, the key stored in cachedLeafDirToChildrenFiles might be something like /xx/table/pt=1/sudir. Then when using cachedLeafDirToChildrenFiles.get('/xx/table/pt=1') to retrieve the file corresponding to the path, will get nothing, resulting in data loss.

The root cause lies in the use of cachedLeafDirToChildrenFiles. Three underlying code uses cachedLeafDirToChildrenFiles. The correct use of cachedLeafDirToChildrenFiles in inferPartitioning is because partitions are easier to infer from lower directory levels. However, the other two uses of cachedLeafDirToChildrenFiles are incorrect. The correct use is to retrieve files from the partition or table location, but not from the leaf directory. Here add new partitionDirToChildrenFiles to solve this problem.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test, and real job.

@zhengchenyu zhengchenyu changed the title [SPARK-28098][SQL]Support read hive table when location have multi-level non partitioned paths" [SPARK-28098][SQL]Support read hive table when location have multi-level non partitioned paths Oct 22, 2025
@github-actions github-actions bot added the SQL label Oct 22, 2025
@zhengchenyu
Copy link
Contributor Author

I know some pr try to solve this problem.

So I submit this new PR.

@zhengchenyu zhengchenyu marked this pull request as draft October 23, 2025 07:49
@zhengchenyu zhengchenyu deleted the SPARK-28098 branch October 23, 2025 12:40
@zhengchenyu zhengchenyu changed the title [SPARK-28098][SQL]Support read hive table when location have multi-level non partitioned paths [SPARK-28098][SQL] Supports reading hive tables when there are subdirectories under the table or partition directory. Oct 24, 2025
@zhengchenyu zhengchenyu changed the title [SPARK-28098][SQL] Supports reading hive tables when there are subdirectories under the table or partition directory. [SPARK-28098][SQL] Supports reading hive tables when there are subdirectories under the table or partition location. Oct 24, 2025
…ectories under the table or partition location.
@zhengchenyu zhengchenyu marked this pull request as ready for review October 25, 2025 09:24
@zhengchenyu
Copy link
Contributor Author

@cloud-fan @dongjoon-hyun @HyukjinKwon Can you please review this PR? I think it is more reasonable to read files under subdirectories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant