Schema DDL: add Spark Dataframe Schema AST #253

chuwy · 2017-05-15T15:49:00Z

In Analytics SDK we need to provide precise Dataframe schema for stored data.

Generally, this is transformation is similar to what happens in JSON Schema -> SQL DDL, but should happen in job runtime, where we immediately hit binary compatibility problem: we'll have to publish Schema DDL linked to all minor Spark version.

I see API as following:

val rdd = ???
val resolver = ???  // Iglu resolver
val jsonSchema = resolver.lookup("com.acme/event/jsonschema/1-0-0")  // Actually this is Validated[JValue], but for simplicity let's assume this is ready AST
val unstructEventSchema: StructType = generateDataframeSchema(jsonSchema)
val dataframeSchema: StructType = getDataframeSchema(unstructEventSchema) // This merges derived unstruct event schema with predefined POJO schema
val df = spark.createDataFrame(rdd, dataframeSchema)

generateDataframeSchema is function performing transformations like:

generateDataframeSchema("""{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": ["null", "integer"] }
  }
}""")

StructType(StructField("name", StringType, nullable = false), StructField("age", IntegerType, nullable = true))

getDataframeSchema merges Dataframe schema for POJO (app_id, etl_tstamp etc) with schemas derived for contexts and unstruct events.

Not loaded schemas should not affect schema application.

The text was updated successfully, but these errors were encountered:

alexanderdean · 2017-05-15T16:14:08Z

Any feedback on this @rolanddb?

rolanddb · 2017-05-16T07:26:15Z

This looks good.
The starting point here is an RDD, would it work by just using spark.read.csv on the raw TSV and then applying the APIs above to arrive at a DataFrame with the correct schema?

getDataframeSchema merges Dataframe schema for POJO (app_id, etl_tstamp etc) with schemas derived for contexts and unstruct events.

How will the method know which contexts to include in the schema? Do we specify the custom contexts somewhere (in a config file)? Or does it include all contexts in the git repository?

There may be situations where the Snowplow core schema is changed (I've seen a field renamed between two releases). In this case it would be nice if the API allows the user to specify the schema version, so you can process old and new data. (The user would be responsible for keeping track of the schema version for the raw data)

chuwy · 2017-05-16T08:10:59Z

Hey @rolanddb,

I believe there's no way to automatically determine all possible contexts in dataframe, except deriving them though Spark, but I also don't think we need to do this automatically - manual way would work even better.

We can add contexts through another argument in getDataframeSchema, so as first argument it would accept unstruct event schema (actually multiple schemas, as we have many unstruct events in dataset) and as second schemas for contexts. So, analyst defines a list of schemas that she/he is aware of in dataset and builds queries against these specific schemas.

And then I agree with you - we need a way to specify Snowplow POJO schema. Currently we support only latest R73 format of events, but this will be changed and getDataframeSchema should be able to build schemas for older and newer snowplow releases, so I think it could be a third argument to a function:

// SnowplowTsv is simple enum with all known Snowplow enriched TSV formats known for moment of release
def getDataframeSchema(unstructEventSchemas: List[StructType], contextSchemas: List[StructType], tsvFormat: SnowplowTsv): StructType = ???

alexanderdean · 2017-05-16T08:56:14Z

I don't think a function with this signature ^^ belongs in snowplow/iglu? It's clearly-Snowplow specific...

chuwy · 2017-05-16T09:21:24Z

@alexanderdean yes, you're definitely right. I was puzzled about how to split functionality between Iglu and Analytics SDK. Now I think that generateDataframeSchema(schema: JsonSchema): StructType should belong to Schema DDL and above function should belong to Analytics SDK.

alexanderdean · 2017-05-16T09:21:41Z

SGTM!

aldemirenes · 2019-07-22T12:50:53Z

Migrated to snowplow/schema-ddl#23

chuwy mentioned this issue May 15, 2017

Add Spark Dataframe schema generation snowplow/snowplow-scala-analytics-sdk#35

Open

aldemirenes mentioned this issue Jul 22, 2019

Add Spark Dataframe Schema AST snowplow/schema-ddl#23

Open

aldemirenes closed this as completed Jul 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema DDL: add Spark Dataframe Schema AST #253

Schema DDL: add Spark Dataframe Schema AST #253

chuwy commented May 15, 2017

alexanderdean commented May 15, 2017

rolanddb commented May 16, 2017

chuwy commented May 16, 2017 •

edited

Loading

alexanderdean commented May 16, 2017

chuwy commented May 16, 2017

alexanderdean commented May 16, 2017

aldemirenes commented Jul 22, 2019

Schema DDL: add Spark Dataframe Schema AST #253

Schema DDL: add Spark Dataframe Schema AST #253

Comments

chuwy commented May 15, 2017

alexanderdean commented May 15, 2017

rolanddb commented May 16, 2017

chuwy commented May 16, 2017 • edited Loading

alexanderdean commented May 16, 2017

chuwy commented May 16, 2017

alexanderdean commented May 16, 2017

aldemirenes commented Jul 22, 2019

chuwy commented May 16, 2017 •

edited

Loading