-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selection of (nested) subset of fields of streams to sync (e.g. avoid PII) #1815
Comments
Yes this might be duplicate from #886 as we don't handle nested streams very well yet |
I'd disagree this is a dupe. While that ticket is about post-processing (e.g., likely about generically extracting nested data into separate tables), this ticket is about
That is, not: It might make sense to focus this ticket on the underlying PII processing issue, as for non-sensitive data, filtering normalized data is likely easier and preferable. |
Yes, we have multiple issues at the moment, that's why this hasn't been fixed yet because it requires some changes in multiple places in Airbyte...
The catalog is the metadata being fetched by the source so the UI can do the once, the data is filtered into subsets and sent by the source in some raw json blob that is being persisted by the destination in raw tables, we'll have to adapt the post-processing normalize that will extract it in different tables. |
Right, #1315 could be seen as a general / parent of this issue, which would then focus on the PII question. |
100% on board with this issue. The guarantee that we want to offer is that: the unsafe data NEVER makes it to the destination. There are two changes that will make it possible:
The transfo piece would likely happen at the worker level while the selection piece should be at the source level (and using the worker as a safety net) WDYT? |
Yeah. The worker may need to be more than a safety net though, as a source may not let you pull data of interest without adding data you want to avoid. As a super-specific example, in the Shopify REST API, you can configure top-level fields send/no-send, but not deeper. So e.g. to get In other words, the worker may need to pull more than necessary, and filter before persisting to disk. Perhaps this is an edgy corner case, if there are on-disk caches :) |
Issue was linked to Harvestr Discovery: Hashing PII fields |
Apologies if this is a dupe or already possible; would close!
Tell us about the problem you're trying to solve
Some sources have APIs that (by default) include PII in their response, which one might want to avoid or minimize pulling at all. Either just to keep the data pull minimal (this goes against EL-T, I know I know 😅, but PII and legal are different...), or to respect data processor agreements that may be in place.
The API responses can be quite nested, for instance a "customer" object may have an array of "addresses". One might want a numerical "customer ID", but not "email", and the "country" in each address but not "street", etc. etc.
Related but orthogonal: #1758 (pull sensitive fields, but transform them in-flight, so they do not land on disk -- even though hashed PII can still be PII, data processor agreements can make exceptions for such cases if they are unavoidable).
Describe the solution you’d like
If I understand correctly, this can be configured e.g. in Singer tap catalogs, the task is to visualize and sync settings.
Likely GraphQL APIs would tend to allow full configuration, while REST APIs may only in some cases allow selecting which fields to return.
Potentially, connectors themselves would classify each field as "is/contains PII" yes/no; this would allow a global tick box "deselect all PII"
Your enterprise edition might go beyond selection and add enforcement (so only specific users can change settings).
Describe the alternative you’ve considered or used
┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: