-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17969]I think it's user unfriendly to process standard json file with DataFrame #15511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
I don't quite understand this -- what does "standard" mean? This still doesn't load a 'standard JSON' file. |
|
In standard json file, multi lines json object is legal, but currently, we can just load single-line json obejct directly. |
| val jsonRDD = sparkSession.sparkContext.wholeTextFiles(path) | ||
| .map(line => line.toString().replaceAll("\\s+", "")) | ||
| .map { jsonLine => | ||
| val index = jsonLine.indexOf(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind if I ask what this line means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this code is bad, I just want to get the json contents
such as: ("filename",json_contents)
|
I guess it'd be nicer if this PR resembles #14151 Also, as we have a |
|
BTW, I guess per-line JSON also complies a standard - https://tools.ietf.org/html/rfc7159#section-4. We should add a test, fix the title to summarise what the PR proposes and fill the PR description. I think also we can also alternatively close this, wait until 14151 is merged and then open again whan you are ready to start working on this.. |
|
Compile is ok, but when we call show(), we will get a _corrupt_record, besides when we call select on this df, we will get an exception. |
|
OK, I think in both cases "standard" JSON is read, and in both cases, each record is a JSON document. These aren't different cases. If you mean to read small JSON files as records, you just use wholeTextFiles, as you show. I do not think wrapping this up with an extra flag helps enough to justify this because callers can easily implement this. There are a hundred other variations on this, and the reason we don't implement them all is exactly because there are so many variations to bottle up like this. |
|
@srowen , you are right! I propose this method just to make it more user friendly, With this method, user can load a standard json file directly. |
|
I think we should close this. I don't believe it's worth a new API method. |
What changes were proposed in this pull request?
Currently, with DataFrame API, we can't load standard json file directly, so we can provide an override method to process this.
How was this patch tested?
manual tests
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.