-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15463][SQL] support creating dataframe out of Dataset[String] for csv data #13300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon @falaki Could you review the PR? Thanks! |
|
Takeshi Yamamuro suggested on https://issues.apache.org/jira/browse/SPARK-15463 that the new API should take a Dataset[String] as input instead of an RDD[String] |
|
Yes. I am adding the Dataset[String] API also. will push soon. |
|
Do we still need the interface for RDD[String]? |
|
@maropu The API that converts Dataset[String] to DataFrame is using the one for RDD[String]. So i am thinking it could be beneficial to provide both there? |
|
@HyukjinKwon Let me try to understand your question. Right now, we have |
|
@xwu0226 Ah, for example, it seems a new method, |
|
Let's not add so many new APIs. You can just add the Dataset[String] one, since RDD[String] can be easily converted into Dataset. |
|
@rxin OK. Thanks! will do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is used in both csv.DefaultSource and DataFrameReader.csv(ds: Dataset[String]). So I refactored it here to take care both the default schema type and inferSchemaFlag=true cases.
|
@rxin Please help double check! Many thanks!! |
|
test this please |
|
@xwu0226 the unit tests you added seem sufficient |
|
@HyukjinKwon @rxin @falaki Would it be feasible to get this merged for Spark 2.0 release? |
|
@pjfanning we are now focusing on bug fixes and stability fixes rather than adding new features. |
|
@rxin Do you think we can revisit this feature and have it in 2.1? Thanks! |
|
Would it be feasible to get this merged for Spark 2.1.0? |
7d5728e to
35d1180
Compare
|
What's the status of this pr? It seems to be more natural that we implement |
|
I checked the feasibility to implement |
|
@maropu Thanks for the comments! It seems like adding such new datasource API in DataFrameReader is not in the priority now. That is why it has been in relatively idle state now. What you are proposing can solve similar problem, it is just that the user can not use datasource API to do it. |
|
This pr seems stale and inactive. I know this kind of API changes has lower priorities now. So, how about closing this pr for now and setting |
|
Actually, this feature might not be urgent as said above but IMO I like this feature to be honest. I guess the reason it was hold is that IMHO it does not look a clean fix. I recently refactored this code path and I have one left PR, #16680. After hopefully merging, there can be a easy clean fix consistently with json one within 15-ish line additions, for example, something like one below in def csv(csv: Dataset[String]): DataFrame = {
val parsedOptions: CSVOptions = new CSVOptions(extraOptions.toMap)
val caseSensitive = sparkSession.sessionState.conf.caseSensitive
val schema = userSpecifiedSchema.getOrElse {
InferSchema.infer(csv, caseSensitive, parsedOptions)
}
val parsed = csv.mapPartitions { iter =>
val parser = new UnivocityParser(schema, caseSensitive, parsedOptions)
iter.flatMap(parser.parse)
}
Dataset.ofRows(
sparkSession,
LogicalRDD(schema.toAttributes, parsed)(sparkSession))
}I remember there have been a quite bit of questions about this feature in spark-csv as thirdparty (and also spark-xml too). |
|
@HyukjinKwon Thanks! After your #16680 is merged, submit a PR with the code you show above. then. |
|
Yea, I also think |
|
Oh, I remember the answer to my previous similar question, which was that we should not add some APIs just for consistency. I have some references about the requests for this feature but don't have ones for the others. So, I am less sure. |
|
Aha, I see. Anyway, we need to keep discussion not here but the JIRA! (because this is the closed..) |
What changes were proposed in this pull request?
Currently only
DataFrameReader.csv(...): DataFramedoes not support converting Dataset[String] to a DataFrame.This PR is to add the API
DataFrameReader.csv(rdd: Dataset[String]): DataFrame. Also in order to easily invoke the helper functions that are already implemented for csv parsing, I moved some of the private methods fromcsv.DefaultSourcetoCSVRelation.How was this patch tested?
A test case is added to load csv files to Dataset[String] and covert to DataFrame and check the results.
Regression test is run.