Skip to content

Conversation

@LantaoJin
Copy link
Contributor

@LantaoJin LantaoJin commented Nov 14, 2019

What changes were proposed in this pull request?

SPARK-27990 (#24830) provide a way to recursively load data from datasource. In SQL, when query a hive table, this property passed by the relation.tableMeta.properties. But it is filtered out now. So we can not lookup file recursively for a Hive table.

In this PR, I don't add a new property or feature. The property recursiveFileLookup in TBLPROPERTIES should work in current implementation. But it's filtered out bugly.

CREATE TABLE test1 (id bigint)
STORED AS PARQUET LOCATION '$baseDir'
TBLPROPERTIES (
'recursiveFileLookup'='true')

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add an UT

@SparkQA
Copy link

SparkQA commented Nov 14, 2019

Test build #113796 has finished for PR 26525 at commit 217815e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|CREATE TABLE test1 (id bigint)
|STORED AS PARQUET LOCATION '$baseDir'
|TBLPROPERTIES (
| 'recursiveFileLookup'='true')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to ask tangential questions, but I'm curious: Will the Metastore track this property somehow? i.e. If I create a table with 'recursiveFileLookup'='true' using Spark, can I query it from Presto and see the same data, provided that both are pointed at the same Metastore? Will the Metastore just track the table property, or will it also track the list of data paths that were detected when the table was created or refreshed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to point me this. Maybe 'spark.recursiveFileLookup' is much more meaningful for user.

@cloud-fan
Copy link
Contributor

can you describe the expected behavior? To me, the hive metastore already tells the directory structure: if it's partitioned, then data files are under each partition directory. Otherwise, data files are under table directory. Why do we need to lookup files recursively?

@LantaoJin
Copy link
Contributor Author

@cloud-fan The reason is very simple but I am not sure it's correct for Hive:
We found some data source paths of hive table are nested. And I found a way to handle this in Spark datasource (#24830). Seems datasource API has some reasons to load data recursively. So I think table might have the same approach. I can close this since the issue can be fixed by removing nested paths if this patch looks unreasonable.

@cloud-fan
Copy link
Contributor

cloud-fan commented Nov 19, 2019

load files recursively may make sense to some data sources but not tables. We have a clear policy about the files layout for tables. Please close this.

@LantaoJin
Copy link
Contributor Author

@cloud-fan thanks for pointing this. close

@LantaoJin LantaoJin closed this Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants