Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema DDL: add Spark Dataframe Schema AST #253

Closed
chuwy opened this issue May 15, 2017 · 7 comments
Closed

Schema DDL: add Spark Dataframe Schema AST #253

chuwy opened this issue May 15, 2017 · 7 comments

Comments

@chuwy
Copy link
Contributor

chuwy commented May 15, 2017

In Analytics SDK we need to provide precise Dataframe schema for stored data.

Generally, this is transformation is similar to what happens in JSON Schema -> SQL DDL, but should happen in job runtime, where we immediately hit binary compatibility problem: we'll have to publish Schema DDL linked to all minor Spark version.

I see API as following:

val rdd = ???
val resolver = ???  // Iglu resolver
val jsonSchema = resolver.lookup("com.acme/event/jsonschema/1-0-0")  // Actually this is Validated[JValue], but for simplicity let's assume this is ready AST
val unstructEventSchema: StructType = generateDataframeSchema(jsonSchema)
val dataframeSchema: StructType = getDataframeSchema(unstructEventSchema) // This merges derived unstruct event schema with predefined POJO schema
val df = spark.createDataFrame(rdd, dataframeSchema)
  • generateDataframeSchema is function performing transformations like:
generateDataframeSchema("""{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": ["null", "integer"] }
  }
}""")

StructType(StructField("name", StringType, nullable = false), StructField("age", IntegerType, nullable = true))
  • getDataframeSchema merges Dataframe schema for POJO (app_id, etl_tstamp etc) with schemas derived for contexts and unstruct events.

Not loaded schemas should not affect schema application.

@alexanderdean
Copy link
Member

Any feedback on this @rolanddb?

@rolanddb
Copy link

This looks good.
The starting point here is an RDD, would it work by just using spark.read.csv on the raw TSV and then applying the APIs above to arrive at a DataFrame with the correct schema?

getDataframeSchema merges Dataframe schema for POJO (app_id, etl_tstamp etc) with schemas derived for contexts and unstruct events.

How will the method know which contexts to include in the schema? Do we specify the custom contexts somewhere (in a config file)? Or does it include all contexts in the git repository?

There may be situations where the Snowplow core schema is changed (I've seen a field renamed between two releases). In this case it would be nice if the API allows the user to specify the schema version, so you can process old and new data. (The user would be responsible for keeping track of the schema version for the raw data)

@chuwy
Copy link
Contributor Author

chuwy commented May 16, 2017

Hey @rolanddb,

I believe there's no way to automatically determine all possible contexts in dataframe, except deriving them though Spark, but I also don't think we need to do this automatically - manual way would work even better.

We can add contexts through another argument in getDataframeSchema, so as first argument it would accept unstruct event schema (actually multiple schemas, as we have many unstruct events in dataset) and as second schemas for contexts. So, analyst defines a list of schemas that she/he is aware of in dataset and builds queries against these specific schemas.

And then I agree with you - we need a way to specify Snowplow POJO schema. Currently we support only latest R73 format of events, but this will be changed and getDataframeSchema should be able to build schemas for older and newer snowplow releases, so I think it could be a third argument to a function:

// SnowplowTsv is simple enum with all known Snowplow enriched TSV formats known for moment of release
def getDataframeSchema(unstructEventSchemas: List[StructType], contextSchemas: List[StructType], tsvFormat: SnowplowTsv): StructType = ???

@alexanderdean
Copy link
Member

I don't think a function with this signature ^^ belongs in snowplow/iglu? It's clearly-Snowplow specific...

@chuwy
Copy link
Contributor Author

chuwy commented May 16, 2017

@alexanderdean yes, you're definitely right. I was puzzled about how to split functionality between Iglu and Analytics SDK. Now I think that generateDataframeSchema(schema: JsonSchema): StructType should belong to Schema DDL and above function should belong to Analytics SDK.

@alexanderdean
Copy link
Member

SGTM!

@aldemirenes
Copy link

Migrated to snowplow/schema-ddl#23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants