-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema DDL: add Spark Dataframe Schema AST #253
Comments
Any feedback on this @rolanddb? |
This looks good.
How will the method know which contexts to include in the schema? Do we specify the custom contexts somewhere (in a config file)? Or does it include all contexts in the git repository? There may be situations where the Snowplow core schema is changed (I've seen a field renamed between two releases). In this case it would be nice if the API allows the user to specify the schema version, so you can process old and new data. (The user would be responsible for keeping track of the schema version for the raw data) |
Hey @rolanddb, I believe there's no way to automatically determine all possible contexts in dataframe, except deriving them though Spark, but I also don't think we need to do this automatically - manual way would work even better. We can add contexts through another argument in And then I agree with you - we need a way to specify Snowplow POJO schema. Currently we support only latest R73 format of events, but this will be changed and // SnowplowTsv is simple enum with all known Snowplow enriched TSV formats known for moment of release
def getDataframeSchema(unstructEventSchemas: List[StructType], contextSchemas: List[StructType], tsvFormat: SnowplowTsv): StructType = ??? |
I don't think a function with this signature ^^ belongs in snowplow/iglu? It's clearly-Snowplow specific... |
@alexanderdean yes, you're definitely right. I was puzzled about how to split functionality between Iglu and Analytics SDK. Now I think that |
SGTM! |
Migrated to snowplow/schema-ddl#23 |
In Analytics SDK we need to provide precise Dataframe schema for stored data.
Generally, this is transformation is similar to what happens in JSON Schema -> SQL DDL, but should happen in job runtime, where we immediately hit binary compatibility problem: we'll have to publish Schema DDL linked to all minor Spark version.
I see API as following:
generateDataframeSchema
is function performing transformations like:getDataframeSchema
merges Dataframe schema for POJO (app_id
,etl_tstamp
etc) with schemas derived for contexts and unstruct events.Not loaded schemas should not affect schema application.
The text was updated successfully, but these errors were encountered: