-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sensitive data masking utility #1173
Comments
@heitorlessa - would you be able to help put more details on how this could be implemented? It might be something i can have a look into. |
My idea was something like this: from aws_lambda_powertools import Logger
from aws_lambda_powertools import obfuscation_filter
logger = Logger(service="payment")
@logger.inject_lambda_context
def lambda_handler(event, context):
logger.info(event)
# {
# Records: [
# { firstName: "personal" secondName: "identifiable" email: "inform@ti.on", groupID: "123" },
# { firstName: "second" secondName: "personal" email: "inform@ti.on", some_other_value: "abc" }
# ]
# }
# pass object and list of json paths to obfuscate
obfuscation_filter(
object=event,
obfuscation_paths=["Records.*.firstName", "Records.*.secondName", "Records.*.email"]
)
logger.info(event)
# {
# Records: [
# { firstName: "********" secondName: "************" email: "******@**.**", groupID: "123" },
# { firstName: "******" secondName: "********" email: "******@**.**", some_other_value: "abc" }
# ]
# } thoughts? |
I couldn't block time to speak to a few Security experts on the best way to do this. One thing I know for sure is that not only you want to mask data, but there are times you want to encrypt it from one party and decrypt only from another party -- it could be optional. Recursion works, AST also works, JMESPath wouldn't work as you can't do an in-place edit nor know the nodes you travelled. Let me ping some Security experts to get their opinion on it. |
That's a similar UX we would be looking at. I've pinged two security experts I trust at AWS to help brainstorm this, but here's essentially the problem space we're looking at.
We can handle 1 without any tokenization process, however if we ever want to support 2 we would need tokenization to obfuscate and de-obfuscate data in separate operations, and the latter being done only by authorised personnel. There could be simpler ways to do it without bringing a heavy dependency but given tokenisation is serious thing and must be uniquely random we need to be careful, hence some Security experts opinion first.... or else we might end up with 1 only that's similar to what you did @keithrozario |
True, my understanding of tokenization is that it requires a store of data (unlike encryption). It'll be quite difficult to execute either tokenization or encryption without some external service providing that functionality. |
I might look a little more into this. Especially once we feel like the UX it there for this. |
Any update about this issue ? |
Hi @faboulaye, please see #1076 for info on the current status of AWS Lambda Powertools for Python. For the TypeScript version we have a similar feature request (#728) but at the moment we are focused on refining the experience for the core utilities (Logger, Tracer, Metrics) before considering new feature requests. |
I have implemented this before in Java and my thoughts are not to mention path because it will be very messy for larger projects, just mentioned the fields which needs masking, my idea is:
|
Thanks a lot for the suggestion @ammar-khan-cultureamp! That however handles Logging only, we're only for something generic that could be reused by a Logger Filter (like the example above). The most recent POC I've done for a customer was using Parser (Pydantic) with AWS Encryption SDK using Caching to reduce perf hit - that was roughly ~1s at cold start and ~40ms warm start for a deeply nested structure. I've also tried with ItsDangerous as an alternative to do it in memory. POC looked like this where you could easily encrypt one or more fields of any data structure, and decrypting was only possible by people with access to key in a given data classification: from typing import Optional
from aws_lambda_powertools.utilities.parser import BaseModel, validator
from .constants import ENCRYPTION_KEYS
from .data_masking import EncryptionManager
def encrypt_pii(value: str) -> str:
# NOTE: Need a separate one to account for Decimals
if value is None:
return value
enc_client = EncryptionManager(keys=[ENCRYPTION_KEYS])
ctx = {"policy_number": "test"}
return enc_client.encrypt(plaintext=value, context=ctx)
class Policy(BaseModel):
oldest_driver_dob: Optional[str]
youngest_driver_dob: Optional[str]
# many other fields
# # validators
_scrub_oldest_driver_dob = validator("oldest_driver_dob", allow_reuse=True)(encrypt_pii)
_scrub_youngest_driver_dob = validator("youngest_driver_dob", allow_reuse=True)(encrypt_pii)
# usage
# data = Policy(**event) Redacting sensitive data can be easily solved, the challenge here is masking and unmasking data by authorised personnel with an acceptable performance hit. In the latter, we still need a RFC to think through the following questions:
|
any ETA for this feature considering it was opened over 1 year ago? |
We'll be looking at it in Q1 next year. We need a foundational work to be complete and stable first (V2). Completing that, and past re:Invent, we can more easily recommend customers to bring optional dependencies to make this feature easier to maintain -- this will depend on the AWS Encryption SDK (got a working POC already). We'd welcome any RFC on this topic too POC looks more or less like this - relies on AWS Encryption SDK and Pydantic for optimal performance. # data_masking.py
"""Example showing basic encryption and decryption of a value already in memory."""
import base64
from typing import Any, Optional, Union
import botocore
from aws_encryption_sdk import (
CachingCryptoMaterialsManager,
EncryptionSDKClient,
LocalCryptoMaterialsCache,
StrictAwsKmsMasterKeyProvider,
)
class SingletonMeta(type):
"""Metaclass to cache class instances to optimize encryption"""
_instances: dict["EncryptionManager", Any] = {}
def __call__(cls, *args, **kwargs):
if cls not in cls._instances:
instance = super().__call__(*args, **kwargs)
cls._instances[cls] = instance
return cls._instances[cls]
class EncryptionManager(metaclass=SingletonMeta):
CACHE_CAPACITY: int = 100
MAX_ENTRY_AGE_SECONDS: float = 300.0
MAX_MESSAGES: int = 200
# NOTE: You can also set max messages/bytes per data key
cache = LocalCryptoMaterialsCache(CACHE_CAPACITY)
session = botocore.session.Session()
def __init__(self, keys: list[str], client: Optional[EncryptionSDKClient] = None) -> None:
self.client = client or EncryptionSDKClient()
self.keys = keys
self.key_provider = StrictAwsKmsMasterKeyProvider(key_ids=keys, botocore_session=self.session)
self.cache_cmm = CachingCryptoMaterialsManager(
master_key_provider=self.key_provider,
cache=self.cache,
max_age=self.MAX_ENTRY_AGE_SECONDS,
max_messages_encrypted=self.MAX_MESSAGES,
)
def encrypt(self, plaintext: Union[bytes, str], context: dict) -> str:
ciphertext, header = self.client.encrypt(
source=plaintext, encryption_context=context, materials_manager=self.cache_cmm
)
return base64.b64encode(ciphertext).decode()
def decrypt(self, encoded_ciphertext: str, context: dict) -> str:
ciphertext = base64.b64decode(encoded_ciphertext)
ciphertext, header = self.client.decrypt(source=ciphertext, key_provider=self.key_provider)
policy_number = context.get("policy_number")
if policy_number != header.encryption_context.get("policy_number"):
raise ValueError("Encryption context mismatch")
return base64.b64decode(ciphertext).decode() Usage from typing import Optional
from aws_lambda_powertools.utilities.parser import BaseModel, validator
from .constants import ENCRYPTION_KEYS
from .data_masking import EncryptionManager
def encrypt_pii(value: str) -> str:
# NOTE: Need a separate one to account for Decimals
if value is None:
return value
enc_client = EncryptionManager(keys=[ENCRYPTION_KEYS])
ctx = {"policy_number": "test"}
return enc_client.encrypt(plaintext=value, context=ctx)
class Policy(BaseModel):
oldest_driver_dob: Optional[str]
youngest_driver_dob: Optional[str]
# many other fields
# validators in Pydantic will use `encrypt_pii` data to fetch KMS Key only once
# use envelope encryption for faster in-memory operations, cache, and respect encryption thresholds
_scrub_oldest_driver_dob = validator("oldest_driver_dob", allow_reuse=True)(encrypt_pii)
_scrub_youngest_driver_dob = validator("youngest_driver_dob", allow_reuse=True)(encrypt_pii)
# usage
# data = Policy(**event) |
Question: how does this stand now that CloudWatch gained native data masking functionality? https://aws.amazon.com/blogs/aws/protect-sensitive-data-with-amazon-cloudwatch-logs/ |
RFC is up: #1858 We'd appreciate everyone's comments here as this will be shipped this year. |
@heitorlessa is still set to be added? |
Yea @justinhauer - I had to prioritise supply chain security work so it slipped. I spoke with Seshu (author) last week, and I'm reviewing this week. |
|
Runtime e.g. Python, Java, all of them. Python
Is your feature request related to a problem? Please describe.
As a customer, I'd like to obfuscate incoming data for known fields that contain PII, so that they're not passed downstream or accidentally logged.
Describe the solution you'd like
It could be any of these ideas or something better if anyone wants to chime in:
***
,###
)A more complex operation would be for data depersonalization where I'd want to encrypt and store the correct data somewhere, in which a separate actor would have permission to decrypt it.
Describe alternatives you've considered
str
, and run a string replacement to mask any sensitive data encounteredIs this something you'd like to contribute if you had guidance?
Additional context
The text was updated successfully, but these errors were encountered: