Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data-masking): add support for Pydantic models, dataclasses, and standard classes #6413

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

VatsalGoel3
Copy link
Contributor

Issue number: #3473

Summary

Changes

This PR adds support to the DataMasking utility to handle complex Python input types such as:

  • Pydantic models
  • Dataclasses
  • Standard Python classes with .dict() method

To support this, a new prepare_data function was introduced, which performs type introspection and converts the input data into a dictionary before processing.

This function is now invoked at the beginning of the erase, encrypt, and decrypt methods, allowing these methods to seamlessly accept structured objects in addition to primitive types like dict, str, list, etc.

User experience

Before:

from aws_lambda_powertools.utilities.data_masking import DataMasking
from pydantic import BaseModel

class MyModel(BaseModel):
    name: str
    age: int

data = MyModel(name="powertools", age=5)
masker = DataMasking()
masked = masker.erase(data, fields=["age"])  # ❌ This raised errors or did not work

After:

# ✅ Now works correctly and returns: {'name': 'powertools', 'age': '*****'}
masked = masker.erase(data, fields=["age"])

This allows customers to use the utility directly with modern application architectures that use type-safe data structures.

Checklist

Is this a breaking change?

RFC issue number: N/A

Checklist:

  • Migration process documented
  • Implement warnings (if it can live side by side)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Disclaimer: We value your time and bandwidth. As such, any pull requests created on non-triaged issues might not be successful.

@VatsalGoel3 VatsalGoel3 requested a review from a team as a code owner April 6, 2025 09:23
@VatsalGoel3 VatsalGoel3 requested a review from anafalcao April 6, 2025 09:23
@boring-cyborg boring-cyborg bot added the tests label Apr 6, 2025
@pull-request-size pull-request-size bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 6, 2025
@VatsalGoel3
Copy link
Contributor Author

@leandrodamascena, for now I applied the prepare_data() function as you suggested in the Issue, but I have an idea for making the function more robust and covering more edge cases, it would be like

"""
Recursively convert complex objects into dictionaries (or simple types) so that they can be
processed by the data masking utility. This function handles:

- Dataclasses (using dataclasses.asdict)
- Pydantic models (using model_dump)
- Custom classes with a dict() method
- Fallback to using __dict__ if available
- Recursively traverses dicts, lists, tuples, and sets
- Guards against circular references

Parameters
----------
data : Any
    The input data which may be a complex type.
_visited : set, optional
    Internal set of visited object IDs to prevent infinite recursion on cyclic references.

Returns
-------
Any
    A primitive type, or a recursively converted structure (dict, list, etc.)
"""

If that is more relevant, will implement this one.

@leandrodamascena
Copy link
Contributor

@leandrodamascena, for now I applied the prepare_data() function as you suggested in the Issue, but I have an idea for making the function more robust and covering more edge cases, it would be like

"""
Recursively convert complex objects into dictionaries (or simple types) so that they can be
processed by the data masking utility. This function handles:

- Dataclasses (using dataclasses.asdict)
- Pydantic models (using model_dump)
- Custom classes with a dict() method
- Fallback to using __dict__ if available
- Recursively traverses dicts, lists, tuples, and sets
- Guards against circular references

Parameters
----------
data : Any
    The input data which may be a complex type.
_visited : set, optional
    Internal set of visited object IDs to prevent infinite recursion on cyclic references.

Returns
-------
Any
    A primitive type, or a recursively converted structure (dict, list, etc.)
"""

If that is more relevant, will implement this one.

Hey @VatsalGoel3, can you show me an example with some pseucode? Is the idea here like a raw dict containing keys that can be dict, Pydantic models and data class models? If so, I like the idea and would like to see an example.

@leandrodamascena
Copy link
Contributor

Hi @VatsalGoel3, thanks a lot for another great contribution addressing complex issues in Powertools that will help customers. I'll review this tomorrow.

I was wondering if you are aware of the AWS Community Builder program. It sounds like you might want to check out this program and maybe apply. This program is for people who are helping the entire AWS ecosystem grow, creating content, making contributions, and for sure your contributions have actually helped customers using Powertools in TypeScript and Python.

Please note that I don't run this program, so I'm not saying whether you'll get accepted or not, but it's definitely worth checking out.

@VatsalGoel3
Copy link
Contributor Author

@leandrodamascena Yes, exactly – the idea is to take an input that might be a raw dictionary with keys whose values could be dictionaries, Pydantic models, dataclass instances, or even custom objects with a dict() method, and recursively convert all of them into plain dictionaries or simple types.

Here’s some pseudocode to illustrate the concept:

def prepare_data(data, _visited=None):
    # Initialize _visited set to keep track of seen objects and avoid circular references.
    if _visited is None:
        _visited = set()

    # If data is a simple type (str, int, float, bool, None), return it immediately.
    if isinstance(data, (str, int, float, bool, type(None))):
        return data

    # If we've seen this object already (by id), return it to avoid infinite recursion.
    if id(data) in _visited:
        return data
    _visited.add(id(data))

    # If data is a dataclass, use dataclasses.asdict() and recursively process it.
    if hasattr(data, "__dataclass_fields__"):
        return prepare_data(asdict(data), _visited=_visited)

    # If data is a Pydantic model, call model_dump() and process recursively.
    if callable(getattr(data, "model_dump", None)):
        return prepare_data(data.model_dump(), _visited=_visited)

    # If data has a dict() method and isn’t already a dict, use that.
    if callable(getattr(data, "dict", None)) and not isinstance(data, dict):
        return prepare_data(data.dict(), _visited=_visited)

    # If data is a dict, recursively process keys and values.
    if isinstance(data, dict):
        return {prepare_data(key, _visited=_visited): prepare_data(value, _visited=_visited)
                for key, value in data.items()}

    # If data is an iterable (list, tuple, or set), process each element recursively.
    if isinstance(data, (list, tuple, set)):
        return type(data)(prepare_data(item, _visited=_visited) for item in data)

    # If data has __dict__, use that as a fallback.
    if hasattr(data, "__dict__"):
        return prepare_data(vars(data), _visited=_visited)

    # If none of the above, return data as is.
    return data

@VatsalGoel3
Copy link
Contributor Author

Hi @VatsalGoel3, thanks a lot for another great contribution addressing complex issues in Powertools that will help customers. I'll review this tomorrow.

I was wondering if you are aware of the AWS Community Builder program. It sounds like you might want to check out this program and maybe apply. This program is for people who are helping the entire AWS ecosystem grow, creating content, making contributions, and for sure your contributions have actually helped customers using Powertools in TypeScript and Python.

Please note that I don't run this program, so I'm not saying whether you'll get accepted or not, but it's definitely worth checking out.

@leandrodamascena, thank you for letting me know, I was not aware of this, I have just applied while I believe the applications for this year is over, would love to be part of the program next year, also is there any way I can DM you for some advice.

Thank you

@leandrodamascena
Copy link
Contributor

@leandrodamascena Yes, exactly – the idea is to take an input that might be a raw dictionary with keys whose values could be dictionaries, Pydantic models, dataclass instances, or even custom objects with a dict() method, and recursively convert all of them into plain dictionaries or simple types.

Here’s some pseudocode to illustrate the concept:

Thanks for sharing this! I really like this idea! We have something like this in this method https://github.com/aws-powertools/powertools-lambda-python/blob/develop/aws_lambda_powertools/event_handler/openapi/encoders.py#L29. In this case, we call this function recursively for each item in the JSON, I don't know if it makes sense here. What do you think?

@leandrodamascena
Copy link
Contributor

@leandrodamascena, thank you for letting me know, I was not aware of this, I have just applied while I believe the applications for this year is over, would love to be part of the program next year, also is there any way I can DM you for some advice.

Thank you

Sure, send me an email at aws-powertools-maintainers@amazon.com and I’ll be more than happy to share my calendar with you. We can then schedule a meeting to talk about your contributions to Powertools, how we build community at Powertools, your challenges building workloads on AWS, and any other topics you’d like to share and we can help with.

@VatsalGoel3
Copy link
Contributor Author

VatsalGoel3 commented Apr 6, 2025

@leandrodamascena Yes, exactly – the idea is to take an input that might be a raw dictionary with keys whose values could be dictionaries, Pydantic models, dataclass instances, or even custom objects with a dict() method, and recursively convert all of them into plain dictionaries or simple types.
Here’s some pseudocode to illustrate the concept:

Thanks for sharing this! I really like this idea! We have something like this in this method https://github.com/aws-powertools/powertools-lambda-python/blob/develop/aws_lambda_powertools/event_handler/openapi/encoders.py#L29. In this case, we call this function recursively for each item in the JSON, I don't know if it makes sense here. What do you think?

@leandrodamascena Yes, I think a recursive approach is exactly the right idea. It ensures that every nested element is processed and converted into a plain type that the data masking logic can handle.

@leandrodamascena
Copy link
Contributor

@leandrodamascena Yes, I think a recursive approach is exactly the right idea. It ensures that every nested element is processed and converted into a plain type that the data masking logic can handle.

Super nice, please go ahead! Just please try to comment the code of this function to make it easier to understand for future changes.

@leandrodamascena leandrodamascena changed the title feat(data-masking): support masking of Pydantic models, dataclasses, and standard classes (#3473) feat(data-masking): add support for Pydantic models, dataclasses, and standard classes Apr 6, 2025
@leandrodamascena leandrodamascena requested review from leandrodamascena and removed request for anafalcao April 6, 2025 19:33
@github-actions github-actions bot added the feature New feature or functionality label Apr 6, 2025
Copy link

codecov bot commented Apr 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.33%. Comparing base (3cb392e) to head (69c5ade).
Report is 2 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #6413   +/-   ##
========================================
  Coverage    96.33%   96.33%           
========================================
  Files          243      243           
  Lines        11758    11770   +12     
  Branches       871      874    +3     
========================================
+ Hits         11327    11339   +12     
  Misses         337      337           
  Partials        94       94           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pull-request-size pull-request-size bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 7, 2025
@VatsalGoel3
Copy link
Contributor Author

@leandrodamascena, I have the code with the function as we discussed also uodated the test code to provide more robust checking, I am unclear if I need to update any documentation for this, please let me know, If I can help with that

@leandrodamascena
Copy link
Contributor

@leandrodamascena, I have the code with the function as we discussed also uodated the test code to provide more robust checking, I am unclear if I need to update any documentation for this, please let me know, If I can help with that

In this section we say that we don't support Pydantic/Dataclass and other data types, so it would be nice if we updated this with examples using Pydantic, Dataclass and other things. You can submit a first version of the modification and then I can review it to refine it.

Thanks again for this fantastic work.

@VatsalGoel3
Copy link
Contributor Author

📄 Documentation Update

Update the "Current limitations" section under ### Choosing parts of your data

Replace:

  1. We support JSON data types only - see data serialization for more details

With:

  1. We support JSON data types and common Python objects such as Pydantic models, Dataclasses, and custom classes with dict() or __dict__.

✅ Add a dedicated ### Supported input types section)

Supported input types

You can now use the erase operation on a variety of common Python object types. These are recursively converted into dictionaries so their fields can be masked appropriately.

Supported input types:

  • ✅ Dictionaries & JSON strings
  • ✅ Dataclasses
  • ✅ Pydantic models (v2+ via .model_dump())
  • ✅ Custom classes implementing dict() method
  • ✅ Custom classes with __dict__ attribute

Pydantic Example

from pydantic import BaseModel
from aws_lambda_powertools.utilities.data_masking import DataMasking

class User(BaseModel):
    username: str
    password: str

masked = DataMasking().erase(User(username="test", password="123"), fields=["password"])
# Output: {'username': 'test', 'password': '*****'}

Dataclass Example

from dataclasses import dataclass
from aws_lambda_powertools.utilities.data_masking import DataMasking

@dataclass
class Customer:
    name: str
    ssn: str

masked = DataMasking().erase(Customer(name="Jane", ssn="123-45-6789"), fields=["ssn"])
# Output: {'name': 'Jane', 'ssn': '*****'}

Custom Class with dict()

class MyClass:
    def __init__(self):
        self.secret = "top"
        self.name = "public"

    def dict(self):
        return {"secret": self.secret, "name": self.name}

masked = DataMasking().erase(MyClass(), fields=["secret"])
# Output: {'secret': '*****', 'name': 'public'}

@leandrodamascena, I think these would be good, I did not wanted to make the changes directly in the repo, let me know what you think of this for first version

@leandrodamascena
Copy link
Contributor

@leandrodamascena, I think these would be good, I did not wanted to make the changes directly in the repo, let me know what you think of this for first version

There is room to improve this, but just sent the commit and we can work together, ok?

@boring-cyborg boring-cyborg bot added the documentation Improvements or additions to documentation label Apr 7, 2025
@VatsalGoel3
Copy link
Contributor Author

@leandrodamascena, I have updated the docs

Copy link
Contributor

@leandrodamascena leandrodamascena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @VatsalGoel3! I left some comments before we have another round of review.

@@ -26,7 +26,72 @@

logger = logging.getLogger(__name__)


def prepare_data(data: Any, _visited: set[int] | None = None) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruff is complaining that this function is complex with too many returns (https://docs.astral.sh/ruff/rules/too-many-return-statements/). Although I understand that returning early avoids if checks and stuff like that. Do you see room to improve this function? If you want, I can try to optimize this code.

Copy link
Contributor Author

@VatsalGoel3 VatsalGoel3 Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have resolved the other issues, would you help me with this, I would love your input on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for resolving the comments @VatsalGoel3! Let me work in this function to see if I can improve it.

@boring-cyborg boring-cyborg bot added commons dependencies Pull requests that update a dependency file labels Apr 7, 2025
Copy link

sonarqubecloud bot commented Apr 7, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
commons dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation feature New feature or functionality size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tests
Projects
None yet
2 participants