From 8d14c72d7b8ef356b4c8ca67381146d43dbda388 Mon Sep 17 00:00:00 2001 From: michael_calvo Date: Tue, 30 Jul 2024 15:41:05 +1200 Subject: [PATCH 1/4] faker libary, rather than just faker instance, is now available to mapper expressions (provided a faker config is found). --- singer_sdk/mapper.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/singer_sdk/mapper.py b/singer_sdk/mapper.py index fce1277fb..a2e7bc956 100644 --- a/singer_sdk/mapper.py +++ b/singer_sdk/mapper.py @@ -337,7 +337,10 @@ def _eval( names["config"] = self.map_config # Allow map config access within transform if self.fake: + from faker import Faker # noqa: PLC0415 + names["fake"] = self.fake + names["Faker"] = Faker if property_name and property_name in record: # Allow access to original property value if applicable From e86573ebaab5a1190aa12a56ee7009657323dde6 Mon Sep 17 00:00:00 2001 From: michael_calvo Date: Wed, 31 Jul 2024 10:39:26 +1200 Subject: [PATCH 2/4] added section on using faker for data masking, including how to use the faker library to re-seed --- docs/stream_maps.md | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/docs/stream_maps.md b/docs/stream_maps.md index 66d647685..c193c6cee 100644 --- a/docs/stream_maps.md +++ b/docs/stream_maps.md @@ -249,6 +249,8 @@ can be referenced directly by mapping expressions. - `fake` - a [`Faker`](inv:faker:std:doc#index) instance, configurable via `faker_config` (see previous example) - see the built-in [standard providers](inv:faker:std:doc#providers) for available methods +- `Faker` - the `Faker` library itself. This was made available to enable consistent data + masking by allowing users to call `Faker.seed()`. ```{tip} The `fake` object is only available if the plugin specifies `faker` as an additional dependency (through the `singer-sdk` `faker` extra, or directly). @@ -435,6 +437,46 @@ stream_maps: ``` ```` +### Masking data with Faker + +It is best practice (or even a legal requirement) to mask PII/PHI in lower environments. Stream mappers have access to the `Faker` library, which can be used to generate random data in various forms/formats. + +```yaml +stream_maps: + customers: + # IMPORTANT: the `fake` variable name will only be available if faker_config is defined + first_name: fake.first_name() # generates a new random name each time +faker_config: + # set specific seed + seed: 0 + # set specific locales + locale: + - en_US + - en_GB +``` + +Be sure to checkout the [`faker` documentation](https://faker.readthedocs.io/en/master/) for all the fake data generation possibilities. + +Note that in the example above, `faker` will generate a new random value each time the `first_name()` function is invoked. This means if 3 records have a `first_name` value of `Mike`, then they will each have a different name after being mapped (for example, `Alistair`, `Debra`, `Scooby`). This can actually lead to issues when developing in the lower environments. + +Some users require consistent masking (for example, the first name `Mike` is always masked as `Debra`). Consistent masking preserves the relationship between tables and rows, while still hiding the real value. When a random mask is generated every time, relationships between tables/rows are effectively lost, making it impossible to test things like sql `JOIN`s. This can cause highly unpredictable behavior when running the same code in lower environments vs production. + +To generate consistent masked values, you must provide the **same seed each time** before invoking the faker function. + +```yaml +stream_maps: + customers: + # will always generate the same value for the same seed + first_name: Faker.seed(_['first_name']) or fake.first_name() +faker_config: + # IMPORTANT: `fake` and `Faker` names are only available if faker_config is defined. + locale: en_US +``` + +Remember, these expressions are evaluated by the [`simpleval`](https://github.com/danthedeckie/simpleeval) expression library, which only allows a single python expression (which is the reason for the `or` syntax above). + +This means if you require more advanced masking logic, which cannot be defined in a single python expression, you may need to consider a custom stream mapper. + #### Q: What is the difference between `primary_keys` and `key_properties`? **A:** These two are _generally_ identical - and will only differ in cases like the above where `key_properties` is manually From 7c2ef28f491c394bbc90f9b12fcc06036496149d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edgar=20Ram=C3=ADrez=20Mondrag=C3=B3n?= <16805946+edgarrmondragon@users.noreply.github.com> Date: Wed, 31 Jul 2024 13:14:01 -0600 Subject: [PATCH 3/4] Update docs/stream_maps.md --- docs/stream_maps.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/stream_maps.md b/docs/stream_maps.md index c193c6cee..239ba8f41 100644 --- a/docs/stream_maps.md +++ b/docs/stream_maps.md @@ -249,7 +249,7 @@ can be referenced directly by mapping expressions. - `fake` - a [`Faker`](inv:faker:std:doc#index) instance, configurable via `faker_config` (see previous example) - see the built-in [standard providers](inv:faker:std:doc#providers) for available methods -- `Faker` - the `Faker` library itself. This was made available to enable consistent data +- `Faker` - the [`Faker`](nv:faker:std:doc#fakerclass) class. This was made available to enable consistent data masking by allowing users to call `Faker.seed()`. ```{tip} From afe538de8c323d9e80e43695f7fb80de740a2420 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edgar=20Ram=C3=ADrez=20Mondrag=C3=B3n?= <16805946+edgarrmondragon@users.noreply.github.com> Date: Wed, 31 Jul 2024 13:21:48 -0600 Subject: [PATCH 4/4] Update docs/stream_maps.md --- docs/stream_maps.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/stream_maps.md b/docs/stream_maps.md index 239ba8f41..fffe92853 100644 --- a/docs/stream_maps.md +++ b/docs/stream_maps.md @@ -249,7 +249,7 @@ can be referenced directly by mapping expressions. - `fake` - a [`Faker`](inv:faker:std:doc#index) instance, configurable via `faker_config` (see previous example) - see the built-in [standard providers](inv:faker:std:doc#providers) for available methods -- `Faker` - the [`Faker`](nv:faker:std:doc#fakerclass) class. This was made available to enable consistent data +- `Faker` - the [`Faker`](inv:faker:std:doc#fakerclass) class. This was made available to enable consistent data masking by allowing users to call `Faker.seed()`. ```{tip}