[SPARK-12369][SQL]DataFrameReader fails on globbing parquet paths #10379

yanakad · 2015-12-18T13:26:52Z

No description provided.

yanakad · 2015-12-18T14:25:36Z

@liancheng I think you added this code originally

SparkQA · 2015-12-18T17:17:57Z

Test build #2233 has finished for PR 10379 at commit 25c0f41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-12-19T08:02:36Z

@yanakad Thanks for your contribution! However, I'd argue that building a partial DataFrame can be error-prone and dangerous since nonexistent paths are silently ignored without any error/warning. For example, there might be trivial spelling errors in one of the paths, but the user may still think that all the data are loaded correctly without any problem.

liancheng · 2015-12-19T08:04:56Z

Also, the PR title is ambiguous. "DataFrameReader fails on globbing parquet paths that contain nonexistent path(s)" might be more accurate.

yanakad · 2015-12-19T20:01:50Z

@liancheng Would logging the fail paths at WARN or ERROR level be an acceptable compromise? I am not sure if you're advising that the fix is not good enough or if you're disagreeing that there is an issue.
I think the original behavior is a problem -- if you have paths like this /root/account=number/date='yyyy-mo'/... , you create a DF at the root level and you execute 'select * where account=nonexistent' you'd get an empty data frame. If you execute a query with where date in(mo1,mo2,mo3) and there is no mo3 partition, you'd still get data for months1 & 2. On the other hand, if you try to create a DF at /root/account=nonexistent you'd get an exception. I have a very large, heavily partitioned space, which is why I am creating dataframes as low as possible, running into this problem when a partition path is missing.

liancheng · 2015-12-20T08:59:57Z

@yanakad Thanks for your explanation, now I understand your use case. I agree that this is somewhat inconvenient under this use case. But I still tend to say this shouldn't be an issue, because:

At application level, this issue can be worked around by first globbing the lowest directories first, and then passing result path(s) to DataFrameReader.parquet() method.
Changes made in this PR bring negative impact to the public API:
- As mentioned above, the behavior becomes more error-prone and dangerous
- The behavior becomes inconsistent with other data sources. For example, ORC, JSON, and JDBC all throws exception when the input path/JDBC URL is invalid or doesn't exist.

yanakad and others added 2 commits December 17, 2015 12:05

Bugfix and test

a79c0f2

Scalastyle fixes

25c0f41

srowen mentioned this pull request May 11, 2016

[BUILD] Test closing stale PRs #13052

Closed

asfgit closed this in 5bb62b8 May 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-12369][SQL]DataFrameReader fails on globbing parquet paths #10379

[SPARK-12369][SQL]DataFrameReader fails on globbing parquet paths #10379

Uh oh!

yanakad commented Dec 18, 2015

Uh oh!

yanakad commented Dec 18, 2015

Uh oh!

SparkQA commented Dec 18, 2015

Uh oh!

liancheng commented Dec 19, 2015

Uh oh!

liancheng commented Dec 19, 2015

Uh oh!

yanakad commented Dec 19, 2015

Uh oh!

liancheng commented Dec 20, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-12369][SQL]DataFrameReader fails on globbing parquet paths #10379

[SPARK-12369][SQL]DataFrameReader fails on globbing parquet paths #10379

Uh oh!

Conversation

yanakad commented Dec 18, 2015

Uh oh!

yanakad commented Dec 18, 2015

Uh oh!

SparkQA commented Dec 18, 2015

Uh oh!

liancheng commented Dec 19, 2015

Uh oh!

liancheng commented Dec 19, 2015

Uh oh!

yanakad commented Dec 19, 2015

Uh oh!

liancheng commented Dec 20, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants