Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to assign transformers for specific sdtypes #1372

Closed
npatki opened this issue Apr 13, 2023 · 1 comment
Closed

Unable to assign transformers for specific sdtypes #1372

npatki opened this issue Apr 13, 2023 · 1 comment
Labels
bug Something isn't working resolution:duplicate This issue or pull request already exists

Comments

@npatki
Copy link
Contributor

npatki commented Apr 13, 2023

Environment Details

  • SDV version: 1.0.0
  • RDT version: 1.3.0 (latest), also applies to 1.4.0.dev0

Error Description

There are certain sdtypes that the SDV is not able to assign transformers for. It seems like the RDT is having trouble locating the Faker function.

As an example, consider the sdtype "postcode". The intended Faker function is from the standard address provider: providers.address.Provider.postcode

Steps to reproduce

Borrowed from the code in #1370

import pandas as pd
from sdv.metadata import SingleTableMetadata
from rdt.transformers.pii import AnonymizedFaker
from sdv.single_table import GaussianCopulaSynthesizer

DATA = pd.DataFrame(
    data={
        "name": ["simon ross", "eliot lee"],
        "age": [22, 23],
        "sex": ["M", "F"],
        "postcode": ["xc12 3bq", "gd1 9ja"],
    }
)

METADATA = {
    "columns": {
        "name": {"sdtype": "name"},
        "age": {"sdtype": "numerical"},
        "sex": {"sdtype": "categorical"},
        "postcode": {"sdtype": "postcode"},
    },
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
}

metadata = SingleTableMetadata.load_from_dict(METADATA)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.auto_assign_transformers(DATA) # <---- Error happens here

Stack Trace (last few frames):

[/usr/local/lib/python3.9/dist-packages/sdv/data_processing/data_processor.py](https://localhost:8080/#) in _create_config(self, data, columns_created_by_constraints)
    465             elif pii:
    466                 enforce_uniqueness = bool(column in self._keys)
--> 467                 transformers[column] = self.create_anonymized_transformer(
    468                     sdtype,
    469                     column_metadata,

[/usr/local/lib/python3.9/dist-packages/sdv/data_processing/data_processor.py](https://localhost:8080/#) in create_anonymized_transformer(sdtype, column_metadata, enforce_uniqueness)
    368             kwargs['enforce_uniqueness'] = True
    369 
--> 370         return get_anonymized_transformer(sdtype, kwargs)
    371 
    372     def create_regex_generator(self, column_name, sdtype, column_metadata, is_numeric):

[/usr/local/lib/python3.9/dist-packages/sdv/metadata/anonymization.py](https://localhost:8080/#) in get_anonymized_transformer(function_name, function_kwargs)
     97     })
     98 
---> 99     return AnonymizedFaker(**function_kwargs)

[/usr/local/lib/python3.9/dist-packages/rdt/transformers/pii/anonymizer.py](https://localhost:8080/#) in __init__(self, provider_name, function_name, function_kwargs, locales, enforce_uniqueness)
     98         self.function_name = function_name if function_name else 'lexify'
     99         self.function_kwargs = deepcopy(function_kwargs) if function_kwargs else {}
--> 100         self.check_provider_function(self.provider_name, self.function_name)
    101         self.output_properties = {None: {'next_transformer': None}}
    102 

[/usr/local/lib/python3.9/dist-packages/rdt/transformers/pii/anonymizer.py](https://localhost:8080/#) in check_provider_function(provider_name, function_name)
     61 
     62         except AttributeError as exception:
---> 63             raise TransformerProcessingError(
     64                 f"The '{provider_name}' module does not contain a function named "
     65                 f"'{function_name}'.\nRefer to the Faker docs to find the correct function: "

TransformerProcessingError: The 'en_US' module does not contain a function named 'postcode'.
Refer to the Faker docs to find the correct function: https://faker.readthedocs.io/en/master/providers.html
@npatki
Copy link
Contributor Author

npatki commented Apr 13, 2023

Update: Seems like this issue is a dupe of #1346. It will be solved via the upcoming releases of RDT and SDV.

Workaround

In the meantime, you may find a workaround by updating the sdtype to something else that works, and then assigning the transformer yourself. Example from #1370:

import pandas as pd
from sdv.metadata import SingleTableMetadata
from rdt.transformers.pii import AnonymizedFaker
from sdv.single_table import GaussianCopulaSynthesizer

DATA = pd.DataFrame(
    data={
        "name": ["simon ross", "eliot lee"],
        "age": [22, 23],
        "sex": ["M", "F"],
        "postcode": ["xc12 3bq", "gd1 9ja"],
    }
)

METADATA = {
    "columns": {
        "name": {"sdtype": "name"},
        "age": {"sdtype": "numerical"},
        "sex": {"sdtype": "categorical"},
        "postcode": {"sdtype": "address"}, # WORKAROUND: Change sdtype in metadata
    },
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
}

metadata = SingleTableMetadata.load_from_dict(METADATA)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.auto_assign_transformers(DATA)

# WORKAROUND: Update transformers manually
synthesizer.update_transformers(
    column_name_to_transformer={
        "postcode": AnonymizedFaker(provider_name="address", function_name="postcode", locales=["en_GB"])
    }
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

1 participant