[SPARK-17969]I think it's user unfriendly to process standard json file with DataFrame #15511

codlife · 2016-10-17T08:39:35Z

What changes were proposed in this pull request?

Currently, with DataFrame API, we can't load standard json file directly, so we can provide an override method to process this.

How was this patch tested?

manual tests

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

NEW

new

AmplabJenkins · 2016-10-17T08:42:14Z

Can one of the admins verify this patch?

srowen · 2016-10-17T08:43:54Z

I don't quite understand this -- what does "standard" mean? This still doesn't load a 'standard JSON' file.

codlife · 2016-10-17T08:47:00Z

In standard json file, multi lines json object is legal, but currently, we can just load single-line json obejct directly.

HyukjinKwon · 2016-10-17T08:55:55Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+      val jsonRDD = sparkSession.sparkContext.wholeTextFiles(path)
+        .map(line => line.toString().replaceAll("\\s+", ""))
+        .map { jsonLine =>
+          val index = jsonLine.indexOf(",")


Do you mind if I ask what this line means?

maybe this code is bad, I just want to get the json contents
such as: ("filename",json_contents)

HyukjinKwon · 2016-10-17T08:58:40Z

I guess it'd be nicer if this PR resembles #14151
The change suggested in the JIRA is to read each JSON object per file which I guess we can share some codes in the PR.

Also, as we have a JSONOptions and DataFrameReader.option(...) API, I think it'd be nicer if this one is added as an option rather than introducing another API.

HyukjinKwon · 2016-10-17T09:13:00Z

BTW, I guess per-line JSON also complies a standard - https://tools.ietf.org/html/rfc7159#section-4. We should add a test, fix the title to summarise what the PR proposes and fill the PR description. I think also we can also alternatively close this, wait until 14151 is merged and then open again whan you are ready to start working on this..

codlife · 2016-10-17T09:28:23Z

Compile is ok, but when we call show(), we will get a _corrupt_record, besides when we call select on this df, we will get an exception.

srowen · 2016-10-17T10:09:25Z

OK, I think in both cases "standard" JSON is read, and in both cases, each record is a JSON document. These aren't different cases. If you mean to read small JSON files as records, you just use wholeTextFiles, as you show. I do not think wrapping this up with an extra flag helps enough to justify this because callers can easily implement this. There are a hundred other variations on this, and the reason we don't implement them all is exactly because there are so many variations to bottle up like this.

codlife · 2016-10-17T10:25:12Z

@srowen , you are right! I propose this method just to make it more user friendly, With this method, user can load a standard json file directly.
You can have a look about this https://issues.apache.org/jira/browse/SPARK-17969

srowen · 2016-11-04T18:03:16Z

I think we should close this. I don't believe it's worth a new API method.

codlife and others added 15 commits September 10, 2016 10:02

solve spark-17447

673c29b

Update Partitioner.scala

a460905

solve spark-17447

7829bd0

fix code style

8ddc442

solve spark-17447

81c0eb9

Update Partitioner.scala

f5d1e24

Merge branch 'master' of https://github.com/codlife/spark

e717f65

solve SPARK-17521

e426ccf

Merge pull request #2 from apache/master

af1a102

NEW

fix

8bfcd6b

Update SparkContext.scala

379cd5a

Merge branch 'master' of https://github.com/codlife/spark

f454668

support stand json file

1d0d4fc

Merge pull request #3 from apache/master

9639a14

new

Update DataFrameReader.scala

2084079

Update DataFrameReader.scala

43bf4e5

HyukjinKwon reviewed Oct 17, 2016

View reviewed changes

codlife closed this Nov 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17969]I think it's user unfriendly to process standard json file with DataFrame #15511

[SPARK-17969]I think it's user unfriendly to process standard json file with DataFrame #15511

Uh oh!

codlife commented Oct 17, 2016

Uh oh!

AmplabJenkins commented Oct 17, 2016

Uh oh!

srowen commented Oct 17, 2016

Uh oh!

codlife commented Oct 17, 2016

Uh oh!

HyukjinKwon Oct 17, 2016

Uh oh!

codlife Oct 17, 2016

Uh oh!

HyukjinKwon commented Oct 17, 2016 •

edited

Loading

Uh oh!

HyukjinKwon commented Oct 17, 2016

Uh oh!

codlife commented Oct 17, 2016

Uh oh!

srowen commented Oct 17, 2016

Uh oh!

codlife commented Oct 17, 2016 •

edited

Loading

Uh oh!

srowen commented Nov 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-17969]I think it's user unfriendly to process standard json file with DataFrame #15511

[SPARK-17969]I think it's user unfriendly to process standard json file with DataFrame #15511

Uh oh!

Conversation

codlife commented Oct 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Oct 17, 2016

Uh oh!

srowen commented Oct 17, 2016

Uh oh!

codlife commented Oct 17, 2016

Uh oh!

HyukjinKwon Oct 17, 2016

Choose a reason for hiding this comment

Uh oh!

codlife Oct 17, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Oct 17, 2016

Uh oh!

codlife commented Oct 17, 2016

Uh oh!

srowen commented Oct 17, 2016

Uh oh!

codlife commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Nov 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Oct 17, 2016 •

edited

Loading

codlife commented Oct 17, 2016 •

edited

Loading