Selection of (nested) subset of fields of streams to sync (e.g. avoid PII) #1815

ghost · 2021-01-25T11:24:28Z

Apologies if this is a dupe or already possible; would close!

Tell us about the problem you're trying to solve

Some sources have APIs that (by default) include PII in their response, which one might want to avoid or minimize pulling at all. Either just to keep the data pull minimal (this goes against EL-T, I know I know 😅, but PII and legal are different...), or to respect data processor agreements that may be in place.

The API responses can be quite nested, for instance a "customer" object may have an array of "addresses". One might want a numerical "customer ID", but not "email", and the "country" in each address but not "street", etc. etc.

Related but orthogonal: #1758 (pull sensitive fields, but transform them in-flight, so they do not land on disk -- even though hashed PII can still be PII, data processor agreements can make exceptions for such cases if they are unavoidable).

Describe the solution you’d like

Data catalog gains knowledge (if it's not already there) on which fields are technically mandatory / optional to sync for each API
Data source connection settings page has tick boxes for each stream field, to include or not
Nested stream fields (arrays/maps) can be unfolded, to reveal inner fields, which again be selected or not via tick boxes

If I understand correctly, this can be configured e.g. in Singer tap catalogs, the task is to visualize and sync settings.

Likely GraphQL APIs would tend to allow full configuration, while REST APIs may only in some cases allow selecting which fields to return.

Potentially, connectors themselves would classify each field as "is/contains PII" yes/no; this would allow a global tick box "deselect all PII"

Your enterprise edition might go beyond selection and add enforcement (so only specific users can change settings).

Describe the alternative you’ve considered or used

Pull all the data and filter in a DBT step: This would go against some data processor agreements, and open the door to abuse or mistakes
Configure source manually in container / work with forked source connectors: Workaround only for self-hosted; needs to be repeated for every data source

┆Issue is synchronized with this Asana task by Unito

ChristopheDuong · 2021-01-25T14:08:30Z

Yes this might be duplicate from #886 as we don't handle nested streams very well yet

ghost · 2021-01-25T14:20:26Z

I'd disagree this is a dupe. While that ticket is about post-processing (e.g., likely about generically extracting nested data into separate tables), this ticket is about

sources declaring their data fully
fetching only subsets of the data

That is, not: fetch > normalize > filter, but fetch-subset directly.

It might make sense to focus this ticket on the underlying PII processing issue, as for non-sensitive data, filtering normalized data is likely easier and preferable.

ChristopheDuong · 2021-01-25T14:50:55Z

Yes, we have multiple issues at the moment, that's why this hasn't been fixed yet because it requires some changes in multiple places in Airbyte...

frontend truncates the nesting in the catalog so the correct catalog is not persisted / sent to the worker

The catalog is the metadata being fetched by the source so the UI can do the fetch-subset as you describe

once, the data is filtered into subsets and sent by the source in some raw json blob that is being persisted by the destination in raw tables, we'll have to adapt the post-processing normalize that will extract it in different tables.

ChristopheDuong · 2021-01-25T14:52:05Z

This is also reported in #1315 and I've been trying to link all these issues to a common issue #886 for the moment to gather the different use cases

ghost · 2021-01-25T15:55:31Z

Right, #1315 could be seen as a general / parent of this issue, which would then focus on the PII question.

michel-tricot · 2021-01-26T00:17:02Z

100% on board with this issue.

The guarantee that we want to offer is that: the unsafe data NEVER makes it to the destination.

There are two changes that will make it possible:

Allow hashing or dropping PII from source connectors #1758 (in case you want to keep a pseudo-anonymized version of the the value)
Allow the selection of fields in the connection

The transfo piece would likely happen at the worker level while the selection piece should be at the source level (and using the worker as a safety net)

WDYT?

ghost · 2021-01-27T09:15:54Z

Yeah. The worker may need to be more than a safety net though, as a source may not let you pull data of interest without adding data you want to avoid. As a super-specific example, in the Shopify REST API, you can configure top-level fields send/no-send, but not deeper. So e.g. to get order.customer.id, you need to pull all of order.customer, but that also contains order.customer.email.

In other words, the worker may need to pull more than necessary, and filter before persisting to disk.

Perhaps this is an edgy corner case, if there are on-disk caches :)

misteryeo · 2022-03-15T03:32:55Z

Issue was linked to Harvestr Discovery: Hashing PII fields

ghost added the type/enhancement New feature or request label Jan 25, 2021

archaean mentioned this issue Sep 20, 2021

Allow hashing or dropping PII from source connectors #1758

Closed

sherifnada added the area/connectors Connector related issues label Nov 15, 2021

bleonard added autoteam team/extensibility labels Apr 26, 2022

sherifnada added team/platform-move and removed team/extensibility labels May 3, 2022

andyjih removed type/enhancement New feature or request area/connectors Connector related issues team/platform-move autoteam labels Oct 6, 2022

octavia-squidington-iv added the team/triage label Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selection of (nested) subset of fields of streams to sync (e.g. avoid PII) #1815

Selection of (nested) subset of fields of streams to sync (e.g. avoid PII) #1815

ghost commented Jan 25, 2021 •

edited by sync-by-unito bot

Loading

ChristopheDuong commented Jan 25, 2021 •

edited

Loading

ghost commented Jan 25, 2021

ChristopheDuong commented Jan 25, 2021 •

edited

Loading

ChristopheDuong commented Jan 25, 2021

ghost commented Jan 25, 2021

michel-tricot commented Jan 26, 2021

ghost commented Jan 27, 2021

misteryeo commented Mar 15, 2022

Selection of (nested) subset of fields of streams to sync (e.g. avoid PII) #1815

Selection of (nested) subset of fields of streams to sync (e.g. avoid PII) #1815

Comments

ghost commented Jan 25, 2021 • edited by sync-by-unito bot Loading

Tell us about the problem you're trying to solve

Describe the solution you’d like

Describe the alternative you’ve considered or used

ChristopheDuong commented Jan 25, 2021 • edited Loading

ghost commented Jan 25, 2021

ChristopheDuong commented Jan 25, 2021 • edited Loading

ChristopheDuong commented Jan 25, 2021

ghost commented Jan 25, 2021

michel-tricot commented Jan 26, 2021

ghost commented Jan 27, 2021

misteryeo commented Mar 15, 2022

ghost commented Jan 25, 2021 •

edited by sync-by-unito bot

Loading

ChristopheDuong commented Jan 25, 2021 •

edited

Loading

ChristopheDuong commented Jan 25, 2021 •

edited

Loading