Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create standard Merlin dtypes in the merlin.dtypes module #170

Merged
merged 48 commits into from
Jan 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
982ed40
Prototype: `merlin.dtypes` module
karlhigley Nov 20, 2022
eb4f3b0
Incorporate Merlin dtypes into `ColumnSchema`
karlhigley Nov 21, 2022
39f0b9a
Extract dtypes and mappings from `merlin.dtype` module's init
karlhigley Nov 22, 2022
7dcdc56
Minor fixes
karlhigley Nov 22, 2022
ac0d0f7
Add `ElementUnit` enum and `Dtype.elemunit` field
karlhigley Nov 28, 2022
998f1f3
Add better error messages to `DType.to()` when mappings not found
karlhigley Nov 28, 2022
56f35af
Add convenience methods `to_python`/`to_numpy to `DType` class
karlhigley Nov 28, 2022
abe2036
Extend the available `datetime64` types to include units
karlhigley Nov 28, 2022
7f84f70
Add the `datetime` types to the Numpy dtype mapping
karlhigley Nov 28, 2022
5fbf5c6
Prefer `np.dtype(...)` objects over dtype classes in Numpy mapping
karlhigley Nov 28, 2022
b6317be
Add mapping for Triton dtypes (if `tritonclient` is available)
karlhigley Nov 28, 2022
7686daa
Remove stray `tritonclient` import
karlhigley Nov 28, 2022
81b6252
Add matching and translation methods to `DTypeMapping`
karlhigley Nov 28, 2022
e4bee54
Add dtype mappings for Tensorflow and PyTorch
karlhigley Nov 28, 2022
ceeff89
Apply new matching and translation methods
karlhigley Nov 28, 2022
be4fca0
Add `is_integer`/`is_float`
karlhigley Nov 28, 2022
c4519da
Add a partial dtype mapping for Pandas
karlhigley Nov 29, 2022
9b74bb5
Clean up the Tensorflow/PyTorch dtype mappings
karlhigley Nov 29, 2022
81f4d4d
Migrate dtype normalization code out of `ColumnSchema`
karlhigley Nov 30, 2022
7a33336
Remove stray preprocessing lambda (for now)
karlhigley Nov 30, 2022
4b1d4bc
Remove unused `shape`-related properties (for now)
karlhigley Nov 30, 2022
71d0e31
Remove unused imports
karlhigley Nov 30, 2022
1135296
Remove `shape`-related tests
karlhigley Nov 30, 2022
da49e34
Normalize `ColumnSchema` bool fields to avoid having `None`
karlhigley Nov 30, 2022
3112636
Rename some method parameters for clarity
karlhigley Nov 30, 2022
c9e0620
Improve the organization of the dtype mapping code
karlhigley Nov 30, 2022
a979cbd
Move the fallback Numpy dtype conversion to the `numpy` mapping
karlhigley Nov 30, 2022
57fecce
Fix a typo
karlhigley Nov 30, 2022
448f2ea
Flesh out the Numpy dtype translation fallback function
karlhigley Nov 30, 2022
920623f
Remove the unknown dtype exception test (no longer feasible)
karlhigley Nov 30, 2022
7dc0123
Clean up the main `dtype` module `__call__()` method
karlhigley Nov 30, 2022
d05f5d6
Slight comment tweak
karlhigley Nov 30, 2022
f77aff8
Make the TF dtype mapping base class param explicit
karlhigley Nov 30, 2022
b68af65
Add docstrings and fix linter errors
karlhigley Nov 30, 2022
a4fba37
Remove stray whitespace
karlhigley Nov 30, 2022
6049ace
Apply the `is_integer`/`is_float` helpers
karlhigley Nov 30, 2022
a5aa4f9
Improve the docstrings and comments in the `dtype` module
karlhigley Nov 30, 2022
2e830b5
Clean up `dtype` test organization
karlhigley Nov 30, 2022
c334a45
Placate pylint about modules not being callable
karlhigley Nov 30, 2022
5eebc79
Fix import formatting
karlhigley Nov 30, 2022
ff0d88f
Merge branch 'main' into feature/merlin-dtypes
karlhigley Nov 30, 2022
c2c1f96
Adjust the syntax for using dtypes
karlhigley Dec 1, 2022
d961c5c
Merge branch 'main' into feature/merlin-dtypes
karlhigley Dec 1, 2022
4049ab0
Adjust syntax to `md.dtype()`, add unknown dtype, clean up preprocessing
karlhigley Dec 5, 2022
85cc882
Merge branch 'main' into feature/merlin-dtypes
karlhigley Jan 12, 2023
6020563
Merge branch 'main' into feature/merlin-dtypes
karlhigley Jan 13, 2023
90c1dbf
Merge branch 'main' into feature/merlin-dtypes
karlhigley Jan 13, 2023
1c3a572
Use more explicit names for `DType` fields
karlhigley Jan 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion merlin/dag/dictarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

import numpy as np

import merlin.dtypes as md
from merlin.core.protocols import SeriesLike


Expand All @@ -29,7 +30,7 @@ def __init__(self, values, dtype=None):
super().__init__()

self.values = values
self.dtype = dtype or values.dtype
self.dtype = md.dtype(dtype or values.dtype)

def __getitem__(self, index):
return self.values[index]
Expand Down
28 changes: 22 additions & 6 deletions merlin/dag/executors.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import pandas as pd
from dask.core import flatten

import merlin.dtypes as md
from merlin.core.dispatch import concat_columns, is_list_dtype, list_val_dtype
from merlin.core.utils import (
ensure_optimize_dataframe_graph,
Expand Down Expand Up @@ -177,17 +178,27 @@ def _transform_data(self, node, input_data, capture_dtypes=False):

if is_list:
col_dtype = list_val_dtype(col_series)
if hasattr(col_dtype, "as_numpy_dtype"):
col_dtype = col_dtype.as_numpy_dtype()
elif hasattr(col_series, "numpy"):

# TODO: Add a utility that condenses the known methods of fetching dtypes
# from series/arrays into a single function, so that Tensorflow specific
# code doesn't leak into the executors
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for this new utility was already latent in the existing code, but it becomes a lot clearer once you have a cross-framework way to translate dtypes. Chose not to tackle it here since that would expand the scope of this already somewhat large PR.

if not hasattr(col_dtype, "as_numpy_dtype") and hasattr(col_series, "numpy"):
col_dtype = col_series[0].cpu().numpy().dtype

output_data_schema = output_col_schema.with_dtype(col_dtype, is_list=is_list)

if capture_dtypes:
node.output_schema.column_schemas[col_name] = output_data_schema
elif len(output_data):
if output_col_schema.dtype != output_data_schema.dtype:
# Validate that the dtypes match but only if they both exist
# (since schemas may not have all dtypes specified, especially
# in the tests)
if (
output_col_schema.dtype
and output_data_schema.dtype
and output_col_schema.dtype != md.string
and output_col_schema.dtype != output_data_schema.dtype
):
raise TypeError(
f"Dtype discrepancy detected for column {col_name}: "
f"operator {node.op.label} reported dtype "
Expand Down Expand Up @@ -269,11 +280,16 @@ def transform(
columns = list(flatten(wfn.output_columns.names for wfn in nodes))
columns += additional_columns if additional_columns else []

if isinstance(output_dtypes, dict):
for col_name, col_dtype in output_dtypes.items():
if col_dtype:
output_dtypes[col_name] = md.dtype(col_dtype).to_numpy

if isinstance(output_dtypes, dict) and isinstance(ddf._meta, pd.DataFrame):
dtypes = output_dtypes
output_dtypes = type(ddf._meta)({k: [] for k in columns})
for column, dtype in dtypes.items():
output_dtypes[column] = output_dtypes[column].astype(dtype)
for col_name, col_dtype in dtypes.items():
output_dtypes[col_name] = output_dtypes[col_name].astype(col_dtype)

elif not output_dtypes:
# TODO: constructing meta like this loses dtype information on the ddf
Expand Down
61 changes: 61 additions & 0 deletions merlin/dtypes/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#
# Copyright (c) 2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# flake8: noqa
from merlin.dtypes import mappings
from merlin.dtypes.aliases import *
from merlin.dtypes.base import DType
from merlin.dtypes.registry import _dtype_registry

# Convenience alias for registering dtypes
register = _dtype_registry.register


def dtype(external_dtype):
# If the supplied dtype is None, then there's not a default dtype we can
# universally translate to across frameworks, so raise an error and help
# the downstream developer figure out how to handle that case explicitly
if external_dtype is None:
raise TypeError(
"Merlin doesn't provide a default dtype mapping for `None`. "
"This differs from the Numpy behavior you may be expecting, "
"which treats `None` as an alias for `np.float64`. If you're "
"expecting this dtype to be non-`None`, there may be an issue "
"in upstream code. If you'd like to allow this dtype to be `None`, "
"you can use a `try/except` to catch this error."
)

# If the supplied dtype is already a Merlin dtype, then there's
# nothing for us to do and we can exit early
if isinstance(external_dtype, DType):
return external_dtype

# If not, attempt to apply all the registered Merlin dtype mappings.
# If we don't find a match with those, fall back on converting to
# a numpy dtype and trying to match that instead.
try:
return _dtype_registry.to_merlin(external_dtype)
except TypeError as base_exc:

try:
return _dtype_registry.to_merlin_via_numpy(external_dtype)
except TypeError as exc:
# If we fail to find a match even after we try converting to
# numpy, re-raise the original exception because it has more
# information about the original external dtype that's causing
# the problem. (We want to highlight that dtype, not whatever
# numpy dtype it was converted to in the interim.)
raise base_exc from exc
52 changes: 52 additions & 0 deletions merlin/dtypes/aliases.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#
# Copyright (c) 2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from merlin.dtypes.base import DType, ElementType, ElementUnit

# Unsigned Integer
uint8 = DType("uint8", ElementType.UInt, 8)
uint16 = DType("uint16", ElementType.UInt, 16)
uint32 = DType("uint32", ElementType.UInt, 32)
uint64 = DType("uint64", ElementType.UInt, 64)

# Signed Integer
int8 = DType("int8", ElementType.Int, 8, signed=True)
int16 = DType("int16", ElementType.Int, 16, signed=True)
int32 = DType("int32", ElementType.Int, 32, signed=True)
int64 = DType("int64", ElementType.Int, 64, signed=True)

# Float
float16 = DType("float16", ElementType.Float, 16, signed=True)
float32 = DType("float32", ElementType.Float, 32, signed=True)
float64 = DType("float64", ElementType.Float, 64, signed=True)

# Date/Time
datetime64 = DType("datetime64", ElementType.DateTime, 64)
datetime64Y = DType("datetime64[Y]", ElementType.DateTime, 64, ElementUnit.Year)
datetime64M = DType("datetime64[M]", ElementType.DateTime, 64, ElementUnit.Month)
datetime64D = DType("datetime64[D]", ElementType.DateTime, 64, ElementUnit.Day)
datetime64h = DType("datetime64[h]", ElementType.DateTime, 64, ElementUnit.Hour)
datetime64m = DType("datetime64[m]", ElementType.DateTime, 64, ElementUnit.Minute)
datetime64s = DType("datetime64[s]", ElementType.DateTime, 64, ElementUnit.Second)
datetime64ms = DType("datetime64[ms]", ElementType.DateTime, 64, ElementUnit.Millisecond)
datetime64us = DType("datetime64[us]", ElementType.DateTime, 64, ElementUnit.Microsecond)
datetime64ns = DType("datetime64[ns]", ElementType.DateTime, 64, ElementUnit.Nanosecond)

# Miscellaneous
string = DType("str", ElementType.String)
boolean = DType("bool", ElementType.Bool)
object_ = DType("object", ElementType.Object)
unknown = DType("unknown", ElementType.Unknown)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dtype is essential. replacing none with our own logic, so we can catch.

127 changes: 127 additions & 0 deletions merlin/dtypes/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
#
# Copyright (c) 2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from dataclasses import dataclass
from enum import Enum
from typing import Optional

from merlin.dtypes.registry import _dtype_registry


class ElementType(Enum):
"""
Merlin DType base types

Since a Merlin DType may describe a list, these are the either the types of
scalars or the types of list elements.
"""

Bool = "bool"
Int = "int"
UInt = "uint"
Float = "float"
String = "string"
DateTime = "datetime"
Object = "object"
Unknown = "unknown"


class ElementUnit(Enum):
"""
Dtype units, used only for datetime types

Since a Merlin DType may describe a list, these are the either the units of
scalars or the units of list elements.
"""

Year = "year"
Month = "month"
Day = "day"
Hour = "hour"
Minute = "minute"
Second = "second"
Millisecond = "millisecond"
Microsecond = "microsecond"
Nanosecond = "nanosecond"


@dataclass(eq=True, frozen=True)
class DType:
"""
Merlin dtypes are objects of this dataclass
"""

name: str
element_type: ElementType
element_size: Optional[int] = None
element_unit: Optional[ElementUnit] = None
signed: Optional[bool] = None

def to(self, mapping_name: str):
"""
Convert this Merlin dtype to another framework's dtypes

Parameters
----------
mapping_name : str
Name of the framework dtype mapping to apply

Returns
-------
Any
An external framework dtype object

Raises
------
ValueError
If there is no registered mapping for the given framework name
ValueError
The registered mapping for the given framework name doesn't map
this Merlin dtype to a framework dtype
"""
try:
mapping = _dtype_registry.mappings[mapping_name]
except KeyError as exc:
raise ValueError(
f"Merlin doesn't have a registered dtype mapping for '{mapping_name}'. "
"If you'd like to register a new dtype mapping, use `merlin.dtype.register()`. "
"If you're expecting this mapping to already exist, has the library or package "
"that defines the mapping been imported successfully?"
) from exc

try:
return mapping.from_merlin(self)
except KeyError as exc:
raise ValueError(
f"The registered dtype mapping for {mapping_name} doesn't contain type {self.name}."
) from exc

@property
def to_numpy(self):
return self.to("numpy")

@property
def to_python(self):
return self.to("python")

# These properties refer to a single scalar (potentially a list element)
@property
def is_integer(self):
return self.element_type.value == "int"

@property
def is_float(self):
return self.element_type.value == "float"
Loading