Filtering SCHEMA Messages Via Field Selection? #11

dmosorast · 2021-09-15T19:13:59Z

dmosorast
Sep 15, 2021

We've gotten a PR on tap-stripe about filtering the SCHEMA messages based on field selection so that targets don't have to create null columns if the columns are not selected.

This makes sense, but I'm not sure if it's a standard (the singer-io/getting-started repository doesn't have anything about this when it discusses field selection). With the general lack of guidance around how to build targets, this also doesn't really help that space either.

As part of my little conquest I've started on aggregating "Standards" (things that live in the library/best-practices space above the strict "Singer Messaging Specification"), I've added it to that initiative's issue tracking comment.

Since I'm less of a target developer, I thought it'd be good to break open this space and start up a discussion thread here to get everyone else's perspective on the practice. Guiding questions:

Should a target have a recommended best practice to handle schema messages that don't have record values associated?
Should the tap be responsible for emitting schema messages with only selected fields?
Did I forget a really important edge case here?

aaronsteers · 2021-09-15T23:12:33Z

aaronsteers
Sep 15, 2021

@dmosorast - Thanks for starting this discussion! I agree we should have standard or best-practice recommendations here. I have found it always confusing to have columns created in the target when those columns are deselected. In some cases, this can even raise security red flags - for instance, if you explicitly want to remove PII and you end up with a target table that nevertheless contains those excluded columns. An auditor could of course query the table to see there are no values there, but this introduces trust questions and severely limited confidence from a cursory review of table columns. (Speaking from actual past experience here.)

Should a target have a recommended best practice to handle schema messages that don't have record values associated?

I don't think this is feasible, since it requires preknowledge of all the records' values at the point when the table is created - the point before any records have yet been emitted.

Should the tap be responsible for emitting schema messages with only selected fields?

I vote yes - at least as a suggested best practice. (The SDK actually does this by default now, which means the developer and user can expect this filtering automatically in the tap is built on the SDK.)

0 replies

dmosorast · 2021-09-17T13:54:18Z

dmosorast
Sep 17, 2021
Author

Thanks AJ, that's a good point on the PII and trust issue. I'm also leaning yes as a Best Practice. I think the thing that threw me for a loop is that I didn't realize the targets were building the DDL right out of the schema message, but instead I thought they'd be waiting until they got data and translating it into DDL at that time with the guidance of the matching schema.

However, as far as code complexity goes, it does make sense to directly translated the schema to a DDL statement in the target of choice, and filtering the schema wouldn't inhibit the other more reactive approach to DDL generation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering SCHEMA Messages Via Field Selection? #11

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Filtering SCHEMA Messages Via Field Selection? #11

dmosorast Sep 15, 2021

Replies: 2 comments

aaronsteers Sep 15, 2021

dmosorast Sep 17, 2021 Author

dmosorast
Sep 15, 2021

aaronsteers
Sep 15, 2021

dmosorast
Sep 17, 2021
Author