Skip to content

Conversation

@yanakad
Copy link

@yanakad yanakad commented Dec 18, 2015

No description provided.

@yanakad
Copy link
Author

yanakad commented Dec 18, 2015

@liancheng I think you added this code originally

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #2233 has finished for PR 10379 at commit 25c0f41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

@yanakad Thanks for your contribution! However, I'd argue that building a partial DataFrame can be error-prone and dangerous since nonexistent paths are silently ignored without any error/warning. For example, there might be trivial spelling errors in one of the paths, but the user may still think that all the data are loaded correctly without any problem.

@liancheng
Copy link
Contributor

Also, the PR title is ambiguous. "DataFrameReader fails on globbing parquet paths that contain nonexistent path(s)" might be more accurate.

@yanakad
Copy link
Author

yanakad commented Dec 19, 2015

@liancheng Would logging the fail paths at WARN or ERROR level be an acceptable compromise? I am not sure if you're advising that the fix is not good enough or if you're disagreeing that there is an issue.
I think the original behavior is a problem -- if you have paths like this /root/account=number/date='yyyy-mo'/... , you create a DF at the root level and you execute 'select * where account=nonexistent' you'd get an empty data frame. If you execute a query with where date in(mo1,mo2,mo3) and there is no mo3 partition, you'd still get data for months1 & 2. On the other hand, if you try to create a DF at /root/account=nonexistent you'd get an exception. I have a very large, heavily partitioned space, which is why I am creating dataframes as low as possible, running into this problem when a partition path is missing.

@liancheng
Copy link
Contributor

@yanakad Thanks for your explanation, now I understand your use case. I agree that this is somewhat inconvenient under this use case. But I still tend to say this shouldn't be an issue, because:

  1. At application level, this issue can be worked around by first globbing the lowest directories first, and then passing result path(s) to DataFrameReader.parquet() method.
  2. Changes made in this PR bring negative impact to the public API:
    • As mentioned above, the behavior becomes more error-prone and dangerous
    • The behavior becomes inconsistent with other data sources. For example, ORC, JSON, and JDBC all throws exception when the input path/JDBC URL is invalid or doesn't exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants