Skip to content

Conversation

@RaghavendraS
Copy link

@RaghavendraS RaghavendraS commented Jun 20, 2016

Replace NullType to StringType in a DataFrame Schema, then we can able to write DataFrame in a Parquet Format.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@hvanhovell
Copy link
Contributor

@RaghavendraS I don't think we should do this. Having a variable with a null type indicates that something should be fixed in the application code. Changing the meaning of the column can lead to surprises. Can you give an example of when this is useful?

@RaghavendraS
Copy link
Author

Thanks @AmplabJenkins , @hvanhovell @marmbrus @akatz

In my case: When we are fetching incremental data from Mongo DB and storing it into parquet file, then we are getting NullType Error. Because in parquet there is no NullType data type. So I come up with below solutions.

Case-1: If we convert NullType to StringType.
This helps us for doing union of last n days incremental parquet data, without getting any error. We only need to compare schema from bottom to top and make data type changes accordingly, apply schema to data frames and make a union from bottom to top.

Case-2: If we drop NullType field.
In this case we need to transform each RDD according to final schema. In Case-1 we are only transforming schema but not transforming RDD, so Case-1 is better than Case-2.

@hvanhovell Please let me know if you know any other solution.

@hvanhovell
Copy link
Contributor

My main point is that you are trying to solve an application problem (having null types in your data) in the SQL engine. This might be what you want, but this is very likely to cause a lot of confusion and bugs for other users.

IIUC you have the following dataflow: Mongo DB -> RDD -> DataFrame. You can check and enforce a schema when you convert from an RDD to a DataFrame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants