Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensitive data masking utility #1173

Closed
heitorlessa opened this issue May 13, 2021 · 17 comments · Fixed by #2197
Closed

Sensitive data masking utility #1173

heitorlessa opened this issue May 13, 2021 · 17 comments · Fixed by #2197
Assignees
Labels
feature-request feature request help wanted Could use a second pair of eyes/hands

Comments

@heitorlessa
Copy link
Contributor

Runtime e.g. Python, Java, all of them. Python

Is your feature request related to a problem? Please describe.

As a customer, I'd like to obfuscate incoming data for known fields that contain PII, so that they're not passed downstream or accidentally logged.

Describe the solution you'd like

It could be any of these ideas or something better if anyone wants to chime in:

A more complex operation would be for data depersonalization where I'd want to encrypt and store the correct data somewhere, in which a separate actor would have permission to decrypt it.

Describe alternatives you've considered

  1. Mask data on my own with a function that walks through a graph and recursively change it.
  2. Use Amazon Comprehend Medical to detect PHI data, convert it to a str, and run a string replacement to mask any sensitive data encountered

Is this something you'd like to contribute if you had guidance?

Additional context

@michaelbrewer
Copy link
Contributor

@heitorlessa - would you be able to help put more details on how this could be implemented? It might be something i can have a look into.

@keithrozario
Copy link

My idea was something like this:

from aws_lambda_powertools import Logger
from aws_lambda_powertools import obfuscation_filter

logger = Logger(service="payment")

@logger.inject_lambda_context
def lambda_handler(event, context):

    logger.info(event)
    #   {
    #     Records: [
    #         { firstName: "personal" secondName: "identifiable" email: "inform@ti.on", groupID: "123" },
    #         { firstName: "second" secondName: "personal" email: "inform@ti.on", some_other_value: "abc" }
    #       ]
    #   }

    # pass object and list of json paths to obfuscate
    obfuscation_filter(
        object=event,
        obfuscation_paths=["Records.*.firstName", "Records.*.secondName", "Records.*.email"]
    )

    logger.info(event)
    # {
    #   Records: [
    #       { firstName: "********" secondName: "************" email: "******@**.**", groupID: "123" },
    #       { firstName: "******" secondName: "********" email: "******@**.**", some_other_value: "abc" }
    #     ]
    # }

thoughts?

@heitorlessa
Copy link
Contributor Author

I couldn't block time to speak to a few Security experts on the best way to do this. One thing I know for sure is that not only you want to mask data, but there are times you want to encrypt it from one party and decrypt only from another party -- it could be optional.

Recursion works, AST also works, JMESPath wouldn't work as you can't do an in-place edit nor know the nodes you travelled.

Let me ping some Security experts to get their opinion on it.

@heitorlessa
Copy link
Contributor Author

My idea was something like this:

from aws_lambda_powertools import Logger
from aws_lambda_powertools import obfuscation_filter

logger = Logger(service="payment")

@logger.inject_lambda_context
def lambda_handler(event, context):

    logger.info(event)
    #   {
    #     Records: [
    #         { firstName: "personal" secondName: "identifiable" email: "inform@ti.on", groupID: "123" },
    #         { firstName: "second" secondName: "personal" email: "inform@ti.on", some_other_value: "abc" }
    #       ]
    #   }

    # pass object and list of json paths to obfuscate
    obfuscation_filter(
        object=event,
        obfuscation_paths=["Records.*.firstName", "Records.*.secondName", "Records.*.email"]
    )

    logger.info(event)
    # {
    #   Records: [
    #       { firstName: "********" secondName: "************" email: "******@**.**", groupID: "123" },
    #       { firstName: "******" secondName: "********" email: "******@**.**", some_other_value: "abc" }
    #     ]
    # }

thoughts?

That's a similar UX we would be looking at. I've pinged two security experts I trust at AWS to help brainstorm this, but here's essentially the problem space we're looking at.

  1. As a customer, I'd like to mask sensitive data my function is receiving so that downstream systems will not need any additional work on handling PII data.
  2. As a customer, I'd like to recover sensitive data masked into its original form so that I can handle sensitive requests around that data on a as-needed basis.

We can handle 1 without any tokenization process, however if we ever want to support 2 we would need tokenization to obfuscate and de-obfuscate data in separate operations, and the latter being done only by authorised personnel.

There could be simpler ways to do it without bringing a heavy dependency but given tokenisation is serious thing and must be uniquely random we need to be careful, hence some Security experts opinion first.... or else we might end up with 1 only that's similar to what you did @keithrozario

@keithrozario
Copy link

True, my understanding of tokenization is that it requires a store of data (unlike encryption). It'll be quite difficult to execute either tokenization or encryption without some external service providing that functionality.

@michaelbrewer
Copy link
Contributor

I might look a little more into this. Especially once we feel like the UX it there for this.

@faboulaye
Copy link

Any update about this issue ?

@dreamorosi
Copy link
Contributor

Any update about this issue ?

Hi @faboulaye, please see #1076 for info on the current status of AWS Lambda Powertools for Python.

For the TypeScript version we have a similar feature request (#728) but at the moment we are focused on refining the experience for the core utilities (Logger, Tracer, Metrics) before considering new feature requests.

@heitorlessa heitorlessa transferred this issue from aws-powertools/powertools-lambda Apr 28, 2022
@heitorlessa heitorlessa removed the python label Jun 1, 2022
@heitorlessa heitorlessa added the feature-request feature request label Jul 4, 2022
@ammar-khan-cultureamp
Copy link

ammar-khan-cultureamp commented Jul 6, 2022

I have implemented this before in Java and my thoughts are not to mention path because it will be very messy for larger projects, just mentioned the fields which needs masking, my idea is:

# example log event:
#   {
#     Records: [
#         { firstName: "personal" secondName: "identifiable" email: "inform@ti.on", groupID: "123" },
#         { firstName: "second" secondName: "personal" email: "inform@ti.on", some_other_value: "abc" }
#     ]
#   }

from aws_lambda_powertools import Logger

logger = Logger(service="payment")

@logger.inject_lambda_context
@logger.obfuscation({fields: ["firstName", "secondName", "email"]}) # This can be move to a config file
def lambda_handler(event, context):
    logger.info(event)
    
# Push logs to cloud watch as:
# {
#   Records: [
#       { firstName: "********" secondName: "************" email: "******@**.**", groupID: "123" },
#       { firstName: "******" secondName: "********" email: "******@**.**", some_other_value: "abc" }
#     ]
# }

@heitorlessa heitorlessa added the help wanted Could use a second pair of eyes/hands label Aug 1, 2022
@heitorlessa
Copy link
Contributor Author

Thanks a lot for the suggestion @ammar-khan-cultureamp! That however handles Logging only, we're only for something generic that could be reused by a Logger Filter (like the example above).

The most recent POC I've done for a customer was using Parser (Pydantic) with AWS Encryption SDK using Caching to reduce perf hit - that was roughly ~1s at cold start and ~40ms warm start for a deeply nested structure. I've also tried with ItsDangerous as an alternative to do it in memory.

POC looked like this where you could easily encrypt one or more fields of any data structure, and decrypting was only possible by people with access to key in a given data classification:

from typing import Optional
from aws_lambda_powertools.utilities.parser import BaseModel, validator

from .constants import ENCRYPTION_KEYS
from .data_masking import EncryptionManager


def encrypt_pii(value: str) -> str:
    # NOTE: Need a separate one to account for Decimals
    if value is None:
        return value

    enc_client = EncryptionManager(keys=[ENCRYPTION_KEYS])
    ctx = {"policy_number": "test"}
    return enc_client.encrypt(plaintext=value, context=ctx)

class Policy(BaseModel):
    oldest_driver_dob: Optional[str]
    youngest_driver_dob: Optional[str]

    # many other fields

    # # validators
    _scrub_oldest_driver_dob = validator("oldest_driver_dob", allow_reuse=True)(encrypt_pii)
    _scrub_youngest_driver_dob = validator("youngest_driver_dob", allow_reuse=True)(encrypt_pii)


# usage
# data = Policy(**event)

Redacting sensitive data can be easily solved, the challenge here is masking and unmasking data by authorised personnel with an acceptable performance hit. In the latter, we still need a RFC to think through the following questions:

  • What's the most performant, maintainable and potentially cross-language support for traversing and redacting data? e.g., AST, Pydantic, recursion, etc.
  • If instructed, what ideal data model should we use to persist masked data to ease retrieval?
  • What mechanism should we use to mask and unmask data?
  • Should we make any original data will be irretrievable unless instructed otherwise?
  • How can we ensure least privilege to separate masking and unmasking concerns?
  • If we use encryption or any managed service, is envelope encryption + caching supported for key materials?

@polamayster
Copy link

any ETA for this feature considering it was opened over 1 year ago?

@heitorlessa
Copy link
Contributor Author

any ETA for this feature considering it was opened over 1 year ago?

We'll be looking at it in Q1 next year. We need a foundational work to be complete and stable first (V2). Completing that, and past re:Invent, we can more easily recommend customers to bring optional dependencies to make this feature easier to maintain -- this will depend on the AWS Encryption SDK (got a working POC already).

We'd welcome any RFC on this topic too


POC looks more or less like this - relies on AWS Encryption SDK and Pydantic for optimal performance.

# data_masking.py

"""Example showing basic encryption and decryption of a value already in memory."""
import base64
from typing import Any, Optional, Union

import botocore
from aws_encryption_sdk import (
    CachingCryptoMaterialsManager,
    EncryptionSDKClient,
    LocalCryptoMaterialsCache,
    StrictAwsKmsMasterKeyProvider,
)


class SingletonMeta(type):
    """Metaclass to cache class instances to optimize encryption"""

    _instances: dict["EncryptionManager", Any] = {}

    def __call__(cls, *args, **kwargs):
        if cls not in cls._instances:
            instance = super().__call__(*args, **kwargs)
            cls._instances[cls] = instance
        return cls._instances[cls]


class EncryptionManager(metaclass=SingletonMeta):
    CACHE_CAPACITY: int = 100
    MAX_ENTRY_AGE_SECONDS: float = 300.0
    MAX_MESSAGES: int = 200
    # NOTE: You can also set max messages/bytes per data key

    cache = LocalCryptoMaterialsCache(CACHE_CAPACITY)
    session = botocore.session.Session()

    def __init__(self, keys: list[str], client: Optional[EncryptionSDKClient] = None) -> None:
        self.client = client or EncryptionSDKClient()
        self.keys = keys
        self.key_provider = StrictAwsKmsMasterKeyProvider(key_ids=keys, botocore_session=self.session)
        self.cache_cmm = CachingCryptoMaterialsManager(
            master_key_provider=self.key_provider,
            cache=self.cache,
            max_age=self.MAX_ENTRY_AGE_SECONDS,
            max_messages_encrypted=self.MAX_MESSAGES,
        )

    def encrypt(self, plaintext: Union[bytes, str], context: dict) -> str:
        ciphertext, header = self.client.encrypt(
            source=plaintext, encryption_context=context, materials_manager=self.cache_cmm
        )
        return base64.b64encode(ciphertext).decode()

    def decrypt(self, encoded_ciphertext: str, context: dict) -> str:
        ciphertext = base64.b64decode(encoded_ciphertext)
        ciphertext, header = self.client.decrypt(source=ciphertext, key_provider=self.key_provider)
        policy_number = context.get("policy_number")

        if policy_number != header.encryption_context.get("policy_number"):
            raise ValueError("Encryption context mismatch")

        return base64.b64decode(ciphertext).decode()

Usage

from typing import Optional
from aws_lambda_powertools.utilities.parser import BaseModel, validator

from .constants import ENCRYPTION_KEYS
from .data_masking import EncryptionManager


def encrypt_pii(value: str) -> str:
    # NOTE: Need a separate one to account for Decimals
    if value is None:
        return value

    enc_client = EncryptionManager(keys=[ENCRYPTION_KEYS])
    ctx = {"policy_number": "test"}
    return enc_client.encrypt(plaintext=value, context=ctx)

class Policy(BaseModel):
    oldest_driver_dob: Optional[str]
    youngest_driver_dob: Optional[str]

    # many other fields

    # validators in Pydantic will use `encrypt_pii` data to fetch KMS Key only once
    # use envelope encryption for faster in-memory operations, cache, and respect encryption thresholds
    _scrub_oldest_driver_dob = validator("oldest_driver_dob", allow_reuse=True)(encrypt_pii)
    _scrub_youngest_driver_dob = validator("youngest_driver_dob", allow_reuse=True)(encrypt_pii)


# usage
# data = Policy(**event)

@rubenfonseca
Copy link
Contributor

Question: how does this stand now that CloudWatch gained native data masking functionality? https://aws.amazon.com/blogs/aws/protect-sensitive-data-with-amazon-cloudwatch-logs/

@heitorlessa
Copy link
Contributor Author

RFC is up: #1858

We'd appreciate everyone's comments here as this will be shipped this year.

@justinhauer
Copy link

@heitorlessa is still set to be added?

@heitorlessa
Copy link
Contributor Author

Yea @justinhauer - I had to prioritise supply chain security work so it slipped. I spoke with Seshu (author) last week, and I'm reviewing this week.

@github-actions
Copy link
Contributor

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

@leandrodamascena leandrodamascena moved this from Coming soon to Shipped in Powertools for AWS Lambda (Python) Oct 15, 2023
@heitorlessa heitorlessa added this to the Sensitive Data Masking milestone Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request feature request help wanted Could use a second pair of eyes/hands
Projects
Status: Shipped
Development

Successfully merging a pull request may close this issue.

10 participants