Table of contents
This guide describes in detail the main technical components of RDT as well as how to develop them.
The goal of RDT is to be able to transform data that is not machine learning ready into data that is. By machine learning ready, we mean that the data should consist of data types that most machine learning models can process. Usually this means outputting numeric data with no nulls.
The data types used by RDT are called sdtypes
. You can think of them as representing the
semantic or statistical meaning of a datatype. On top of this, RDT also enforces that
those transformations can be reversed, so that data of the original form can be obtained again.
RDT accomplishes this with the use of two main classes:
BaseTransformer
HyperTransformer
Every Transformer in RDT inherits from the BaseTransformer
. The goal of this class or any of
its subclasses is to take data of a certain sdtype, and convert it into machine learning ready
data. To enable transformers to do this, the BaseTransformer
has the following attributes:
INPUT_SDTYPE
(str) - The input sdtype for the transformer.output_properties
(dict) - Dictionary mapping transformed column names to a dictionary describing itssdtype
andnext_transformer
.columns
(list) - List of column names that the transformer will transform. Set duringfit
.output_columns
(list) - List of column names in the output from callingtransform
. Set duringfit
.column_prefix
(str) - Prefix that will be added to the output columns to make them unique on the table. For transformers that act on a single column this is the column name, and for transformers that act on multiple columns this is their names joined by #. Set duringfit
.
It also has the following default methods, which the Transformer
subclasses may overwrite when
necessary:
get_output_sdtypes()
- Returns the name of the columns that the transform method creates. By default this will be thesdtype
property of theoutput_properties
dictionary with the column names prepended with thecolumn_prefix
with a dot (.) as the separator.get_next_transformers()
- Returns a dictionary mapping the names of the columns that thetransform
method creates to the next transformer to use for those columns. By default this will be theNEXT_TRANSFORMERS
dictionary with the keys prepended with thecolumn_prefix
with a dot (.) as a separator.fit(data, columns)
- Takes inpandas.DataFrame
and list of column names and stores the information needed to runtransform
.transform(data, drop)
- Takes inpandas.DataFrame
and bool saying whether or not to drop original columns in the output. Returnspandas.DataFrame
containing transformed data.reverse_transform(data, drop)
- Takes inpandas.DataFrame
and bool saying whether or not to transformed columns in the output. Returnspandas.DataFrame
containing reverse transformed data.
Any subclass of the BaseTransformer
can add extra methods or attributes that it needs.
The HyperTransformer
class is used to transform an entire table. Under the hood, the
HyperTransformer
figures out which Transformer classes to use on each column in order to
get a machine learning ready output. It does so using the following methods:
fit(data)
- Takes in apandas.DataFrame
. For every column or group of columns in the data, it find a transformer to use on it and calls that transformer's fit method with those columns. If the output of the transformer is not machine learning ready, it will recursively find transformers to use on that sdtype until it is. A sequence of transformers to use is created.transform(data)
- Takes in apandas.DataFrame
. Goes through the sequence of transformers created duringfit
and calls their underlyingtransform
method.reverse_transform(data)
- Takes in apandas.DataFrame
. Goes through the sequence of transformers created duringfit
in reverse and calls their underlyingreverse_transform
method.
In order to create a new Transformer class, the class should inherit from the BaseTransformer
.
It should also set the values for the attributes defined above.
Note: Some attributes might not be able to be determined until after fit
is called. In this
case, those attributes should be set in the _fit
method.
The only methods that need to be implemented for a new Transformer class are:
_fit(columns_data)
_transform(columns_data)
_reverse_transform()
Take note of the _ preceding each method. The BaseTransformer
will call these methods when
fit
, transform
and reverse_transform
are called. This is because the
BaseTransformer
figures out which columns to pass down behind the scenes. All of the _
methods take in a pandas.Series
or pandas.DataFrame
containing only the columns that will
be used by the transformer.
If for some reason, the new transformer requires access to all of the data, then the fit
,
transform
and reverse_transform
methods can be overwritten.
Now that we have some background information on how Transformers work in RDT, let's create a new
one. For this example, we will create a simple USPhoneNumberTransformer
. The goal of this
transformer is to take strings containing phone numbers into numeric data. For the sake of
simplicity, we will assume all phone numbers are of the format ###-###-#### or
#-###-###-####.
Let's start by setting the necessary attributes and writing the __init__
method.
class USPhoneNumberTransformer(BaseTransformer):
INPUT_SDTYPE = 'phone_number'
def __init__(self):
self.has_country_code = None
Now we can write the _fit
method.
def _fit(self, columns_data):
number = ''.join(columns_data.loc[0].split('-'))
self.has_country_code = len(number) == 11
Since the country_code
may or may not be present, we can overwrite the
get_next_transformers
and get_output_sdtypes
methods accordingly.
def get_output_sdtypes(self):
output_sdtypes = {
'area_code': 'categorical',
'exchange': 'integer',
'line': 'integer'
}
if self.has_country_code:
output_sdtypes['country_code'] = 'categorical'
return self._get_output_to_property(output_sdtypes)
def get_next_transformers(self):
next_transformers = {
'country_code': 'FrequencyEncoder',
'area_code': 'FrequencyEncoder'
}
if self.has_country_code:
next_transformers['country_code'] = 'FrequencyEncoder'
return self._get_output_to_property(next_transformers)
_get_output_to_property
is a private method that prepends the column_prefix
attributes
to every key in a dictionary. Now that we have this information, we can write the _transform
and _reverse_transform
methods.
def _transform(self, data):
return data.str.split('-', expand=True)
def _reverse_transform(self, data):
if self.has_country_code:
country_code = data.iloc[:, 0].astype('str')
area_code = data.iloc[:, 1].astype('str')
exchange = data.iloc[:, 2].astype('str')
line = data.iloc[:, 3].astype('str')
return country_code + '-' + area_code + '-' + exchange + '-' + line
area_code = data.iloc[:, 0].astype('str')
exchange = data.iloc[:, 1].astype('str')
line = data.iloc[:, 2].astype('str')
return area_code + '-' + exchange + '-' + line
We don't have to worry about the naming of the output columns because the BaseTransformer
handles that for us. Let's view the complete class below.
Now we can see our USPhoneNumberTransformer in action.
In [1]: transformer = USPhoneNumberTransformer()
data = pd.DataFrame({
'phone_numbers': ['1-202-555-0191', '1-202-555-0151', '1-202-867-5309']
})
transformer.fit(data, ['phone_numbers'])
transformed = transformer.transform(data)
In [2]: transformed
Out [2]:
phone_numbers.area_code phone_numbers.exchange phone_numbers.line phone_numbers.country_code
0 1 202 555 0191
1 1 202 555 0151
2 1 202 867 5309
In [3] reverse_transformed = transformer.reverse_transform(transformed)
In [4] reverse_transformed
Out [4]
phone_numbers
0 1-202-555-0191
1 1-202-555-0151
2 1-202-867-5309
We can also run it using the HyperTransformer.
In [1]: ht = HyperTransformer(
default_sdtype_transformers={'phone_number': USPhoneNumberTransformer},
field_sdtypes={'phone_numbers': 'phone_number'}
)
ht.fit(data)
transformed = ht.transform(data)
In [2]: transformed
Out [2]:
phone_numbers.area_code.value phone_numbers.exchange phone_numbers.line phone_numbers.country_code.value
0 0.5 202 555 0.500000
1 0.5 202 555 0.166667
2 0.5 202 867 0.833333
In [3]: reverse_transformed = ht.reverse_transform(transformed)
In [4]: reverse_transformed
Out [4]:
phone_numbers
0 1-202-555-0191
1 1-202-555-0151
2 1-202-867-5309
In RDT, performance tests are run to assure that each transformer is efficient. In order to run
these tests, we have classes that generate datasets of a certain sdtype. If a new transformer
introduces a new sdtype, the a DatasetGenerator
class will need to be added for it.
All dataset generators inherit from the BaseDatasetGenerator
class. It has the following
class attribute:
SDTYPE
(str) - The sdtype for the class to generate.
They must implement the following methods.
generate(num_rows)
- Takes in an int representing the number of rows to generate. Returns anumpy.ndarray
of sizenum_rows
where each value is of the class'SDTYPE
.get_performance_thresholds()
- Returns a dict mapping each of the main methods for a transformer (fit
,transform
,reverse_transform
) to the expected time and memory it takes for those methods to run on 1 row.
To create a new DatasetGenerator
, the methods described above need to be implemented. The
class should be placed in a new file in the following location tests/datasets/{SDTYPE}.py
.
Each generator must inherit from the base class as well as abc.ABC
.
Let's create a DatasetGenerator
for the phone_number
sdtype that we introduced earlier.
We can start by implementing the generate
method and setting the SDTYPE
.
from abc import ABC
import numpy as np
from tests.datasets.base import BaseDatasetGenerator
class USPhoneNumberGenerator(BaseDatasetGenerator, ABC):
SDTYPE = 'phone_number'
@staticmethod
def generate(num_rows):
area_codes = np.random.randint(low=100, high=999, size=num_rows).astype(str)
exchange = np.random.randint(low=100, high=999, size=num_rows).astype(str)
line = np.random.randint(low=1000, high=9999, size=num_rows).astype(str)
return np.apply_along_axis('-'.join, 0, [area_codes, exchange, line])
In order for the tests to run, the generator must also implement the get_performance_thresholds
method. The times are specified in seconds and the memory in bytes.
@staticmethod
def get_performance_thresholds():
"""Return the expected thresholds."""
return {
'fit': {
'time': 1,
'memory': 100.0
},
'transform': {
'time': 1,
'memory': 1000.0
},
'reverse_transform': {
'time': 1,
'memory': 1000.0,
}
}
To view the result of the generator we can run the following:
In [1]: USPhoneNumberGenerator.generate(100)
Out [1]:
array(['160-919-7653', '347-212-8425', '717-820-4356', '483-675-6853',
'656-141-2176', '681-981-5310', '314-989-4289', '138-343-6582',
'406-683-8597', '639-156-5496', '625-600-1649', '110-477-8992',
'770-731-6200', '166-491-9881', '418-682-9540', '889-169-1878',
'660-213-4713', '270-506-9422', '323-691-2507', '189-158-5409',
'605-218-6776', '944-980-8854', '773-290-6675', '969-724-8712',
'617-979-3609', '145-828-6455', '570-923-8982', '260-800-5404',
'301-453-3972', '454-629-5258', '298-394-6958', '700-285-1703',
'439-683-2711', '935-387-1178', '151-643-7354', '549-741-6070',
'617-142-6518', '759-653-4626', '482-778-1256', '909-538-2919',
'772-617-8616', '691-559-2419', '274-200-5514', '744-163-6255',
'760-709-7880', '909-782-6044', '826-607-6956', '902-609-2589',
'345-796-8422', '818-867-9468', '430-906-3757', '143-788-5794',
'340-705-3813', '211-447-7218', '912-799-7431', '840-211-5830',
'752-600-1938', '236-659-2646', '591-946-1546', '903-564-4356',
'928-847-8630', '315-775-9896', '384-323-8186', '192-282-8873',
'861-497-3333', '839-304-2029', '674-261-5948', '721-642-9755',
'761-787-2193', '429-720-9832', '126-876-2681', '327-533-3443',
'170-210-5689', '916-945-8487', '619-332-6223', '515-453-5862',
'509-666-4074', '231-687-8172', '489-862-2525', '602-456-5236',
'549-936-9406', '471-989-5828', '424-436-1012', '405-996-8833',
'786-811-5453', '851-897-7043', '462-381-9671', '328-267-1474',
'482-171-7564', '245-353-7712', '589-535-6689', '864-252-5314',
'990-737-8649', '112-189-9047', '126-316-8627', '985-724-3452',
'119-612-8449', '456-529-1190', '344-956-1910', '125-962-2067'],
dtype='<U12')