Skip to content

Conversation

@codlife
Copy link
Contributor

@codlife codlife commented Oct 17, 2016

What changes were proposed in this pull request?

Currently, with DataFrame API, we can't load standard json file directly, so we can provide an override method to process this.

How was this patch tested?

manual tests

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Oct 17, 2016

I don't quite understand this -- what does "standard" mean? This still doesn't load a 'standard JSON' file.

@codlife
Copy link
Contributor Author

codlife commented Oct 17, 2016

In standard json file, multi lines json object is legal, but currently, we can just load single-line json obejct directly.

val jsonRDD = sparkSession.sparkContext.wholeTextFiles(path)
.map(line => line.toString().replaceAll("\\s+", ""))
.map { jsonLine =>
val index = jsonLine.indexOf(",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I ask what this line means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this code is bad, I just want to get the json contents
such as: ("filename",json_contents)

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 17, 2016

I guess it'd be nicer if this PR resembles #14151
The change suggested in the JIRA is to read each JSON object per file which I guess we can share some codes in the PR.

Also, as we have a JSONOptions and DataFrameReader.option(...) API, I think it'd be nicer if this one is added as an option rather than introducing another API.

@HyukjinKwon
Copy link
Member

BTW, I guess per-line JSON also complies a standard - https://tools.ietf.org/html/rfc7159#section-4. We should add a test, fix the title to summarise what the PR proposes and fill the PR description. I think also we can also alternatively close this, wait until 14151 is merged and then open again whan you are ready to start working on this..

@codlife
Copy link
Contributor Author

codlife commented Oct 17, 2016

Compile is ok, but when we call show(), we will get a _corrupt_record, besides when we call select on this df, we will get an exception.

@srowen
Copy link
Member

srowen commented Oct 17, 2016

OK, I think in both cases "standard" JSON is read, and in both cases, each record is a JSON document. These aren't different cases. If you mean to read small JSON files as records, you just use wholeTextFiles, as you show. I do not think wrapping this up with an extra flag helps enough to justify this because callers can easily implement this. There are a hundred other variations on this, and the reason we don't implement them all is exactly because there are so many variations to bottle up like this.

@codlife
Copy link
Contributor Author

codlife commented Oct 17, 2016

@srowen , you are right! I propose this method just to make it more user friendly, With this method, user can load a standard json file directly.
You can have a look about this https://issues.apache.org/jira/browse/SPARK-17969

@srowen
Copy link
Member

srowen commented Nov 4, 2016

I think we should close this. I don't believe it's worth a new API method.

@codlife codlife closed this Nov 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants