Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selection of (nested) subset of fields of streams to sync (e.g. avoid PII) #1815

Open
ghost opened this issue Jan 25, 2021 · 8 comments
Open

Comments

@ghost
Copy link

ghost commented Jan 25, 2021

Apologies if this is a dupe or already possible; would close!

Tell us about the problem you're trying to solve

Some sources have APIs that (by default) include PII in their response, which one might want to avoid or minimize pulling at all. Either just to keep the data pull minimal (this goes against EL-T, I know I know 😅, but PII and legal are different...), or to respect data processor agreements that may be in place.

The API responses can be quite nested, for instance a "customer" object may have an array of "addresses". One might want a numerical "customer ID", but not "email", and the "country" in each address but not "street", etc. etc.

Related but orthogonal: #1758 (pull sensitive fields, but transform them in-flight, so they do not land on disk -- even though hashed PII can still be PII, data processor agreements can make exceptions for such cases if they are unavoidable).

Describe the solution you’d like

  • Data catalog gains knowledge (if it's not already there) on which fields are technically mandatory / optional to sync for each API
  • Data source connection settings page has tick boxes for each stream field, to include or not
  • Nested stream fields (arrays/maps) can be unfolded, to reveal inner fields, which again be selected or not via tick boxes

If I understand correctly, this can be configured e.g. in Singer tap catalogs, the task is to visualize and sync settings.

Likely GraphQL APIs would tend to allow full configuration, while REST APIs may only in some cases allow selecting which fields to return.

Potentially, connectors themselves would classify each field as "is/contains PII" yes/no; this would allow a global tick box "deselect all PII"

Your enterprise edition might go beyond selection and add enforcement (so only specific users can change settings).

Describe the alternative you’ve considered or used

  • Pull all the data and filter in a DBT step: This would go against some data processor agreements, and open the door to abuse or mistakes
  • Configure source manually in container / work with forked source connectors: Workaround only for self-hosted; needs to be repeated for every data source

┆Issue is synchronized with this Asana task by Unito

@ghost ghost added the type/enhancement New feature or request label Jan 25, 2021
@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Jan 25, 2021

Yes this might be duplicate from #886 as we don't handle nested streams very well yet

@ghost
Copy link
Author

ghost commented Jan 25, 2021

I'd disagree this is a dupe. While that ticket is about post-processing (e.g., likely about generically extracting nested data into separate tables), this ticket is about

  • sources declaring their data fully
  • fetching only subsets of the data

That is, not: fetch > normalize > filter, but fetch-subset directly.

It might make sense to focus this ticket on the underlying PII processing issue, as for non-sensitive data, filtering normalized data is likely easier and preferable.

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Jan 25, 2021

Yes, we have multiple issues at the moment, that's why this hasn't been fixed yet because it requires some changes in multiple places in Airbyte...

frontend truncates the nesting in the catalog so the correct catalog is not persisted / sent to the worker

The catalog is the metadata being fetched by the source so the UI can do the fetch-subset as you describe

once, the data is filtered into subsets and sent by the source in some raw json blob that is being persisted by the destination in raw tables, we'll have to adapt the post-processing normalize that will extract it in different tables.

@ChristopheDuong
Copy link
Contributor

This is also reported in #1315 and I've been trying to link all these issues to a common issue #886 for the moment to gather the different use cases

@ghost
Copy link
Author

ghost commented Jan 25, 2021

Right, #1315 could be seen as a general / parent of this issue, which would then focus on the PII question.

@michel-tricot
Copy link
Contributor

100% on board with this issue.

The guarantee that we want to offer is that: the unsafe data NEVER makes it to the destination.

There are two changes that will make it possible:

  1. Allow hashing or dropping PII from source connectors #1758 (in case you want to keep a pseudo-anonymized version of the the value)
  2. Allow the selection of fields in the connection

The transfo piece would likely happen at the worker level while the selection piece should be at the source level (and using the worker as a safety net)

WDYT?

@ghost
Copy link
Author

ghost commented Jan 27, 2021

Yeah. The worker may need to be more than a safety net though, as a source may not let you pull data of interest without adding data you want to avoid. As a super-specific example, in the Shopify REST API, you can configure top-level fields send/no-send, but not deeper. So e.g. to get order.customer.id, you need to pull all of order.customer, but that also contains order.customer.email.

In other words, the worker may need to pull more than necessary, and filter before persisting to disk.

Perhaps this is an edgy corner case, if there are on-disk caches :)

@misteryeo
Copy link
Contributor

Issue was linked to Harvestr Discovery: Hashing PII fields

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants