-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create standard Merlin dtypes in the merlin.dtypes
module
#170
Merged
karlhigley
merged 48 commits into
NVIDIA-Merlin:main
from
karlhigley:feature/merlin-dtypes
Jan 13, 2023
Merged
Changes from all commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
982ed40
Prototype: `merlin.dtypes` module
karlhigley eb4f3b0
Incorporate Merlin dtypes into `ColumnSchema`
karlhigley 39f0b9a
Extract dtypes and mappings from `merlin.dtype` module's init
karlhigley 7dcdc56
Minor fixes
karlhigley ac0d0f7
Add `ElementUnit` enum and `Dtype.elemunit` field
karlhigley 998f1f3
Add better error messages to `DType.to()` when mappings not found
karlhigley 56f35af
Add convenience methods `to_python`/`to_numpy to `DType` class
karlhigley abe2036
Extend the available `datetime64` types to include units
karlhigley 7f84f70
Add the `datetime` types to the Numpy dtype mapping
karlhigley 5fbf5c6
Prefer `np.dtype(...)` objects over dtype classes in Numpy mapping
karlhigley b6317be
Add mapping for Triton dtypes (if `tritonclient` is available)
karlhigley 7686daa
Remove stray `tritonclient` import
karlhigley 81b6252
Add matching and translation methods to `DTypeMapping`
karlhigley e4bee54
Add dtype mappings for Tensorflow and PyTorch
karlhigley ceeff89
Apply new matching and translation methods
karlhigley be4fca0
Add `is_integer`/`is_float`
karlhigley c4519da
Add a partial dtype mapping for Pandas
karlhigley 9b74bb5
Clean up the Tensorflow/PyTorch dtype mappings
karlhigley 81f4d4d
Migrate dtype normalization code out of `ColumnSchema`
karlhigley 7a33336
Remove stray preprocessing lambda (for now)
karlhigley 4b1d4bc
Remove unused `shape`-related properties (for now)
karlhigley 71d0e31
Remove unused imports
karlhigley 1135296
Remove `shape`-related tests
karlhigley da49e34
Normalize `ColumnSchema` bool fields to avoid having `None`
karlhigley 3112636
Rename some method parameters for clarity
karlhigley c9e0620
Improve the organization of the dtype mapping code
karlhigley a979cbd
Move the fallback Numpy dtype conversion to the `numpy` mapping
karlhigley 57fecce
Fix a typo
karlhigley 448f2ea
Flesh out the Numpy dtype translation fallback function
karlhigley 920623f
Remove the unknown dtype exception test (no longer feasible)
karlhigley 7dc0123
Clean up the main `dtype` module `__call__()` method
karlhigley d05f5d6
Slight comment tweak
karlhigley f77aff8
Make the TF dtype mapping base class param explicit
karlhigley b68af65
Add docstrings and fix linter errors
karlhigley a4fba37
Remove stray whitespace
karlhigley 6049ace
Apply the `is_integer`/`is_float` helpers
karlhigley a5aa4f9
Improve the docstrings and comments in the `dtype` module
karlhigley 2e830b5
Clean up `dtype` test organization
karlhigley c334a45
Placate pylint about modules not being callable
karlhigley 5eebc79
Fix import formatting
karlhigley ff0d88f
Merge branch 'main' into feature/merlin-dtypes
karlhigley c2c1f96
Adjust the syntax for using dtypes
karlhigley d961c5c
Merge branch 'main' into feature/merlin-dtypes
karlhigley 4049ab0
Adjust syntax to `md.dtype()`, add unknown dtype, clean up preprocessing
karlhigley 85cc882
Merge branch 'main' into feature/merlin-dtypes
karlhigley 6020563
Merge branch 'main' into feature/merlin-dtypes
karlhigley 90c1dbf
Merge branch 'main' into feature/merlin-dtypes
karlhigley 1c3a572
Use more explicit names for `DType` fields
karlhigley File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# | ||
# Copyright (c) 2022, NVIDIA CORPORATION. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
# flake8: noqa | ||
from merlin.dtypes import mappings | ||
from merlin.dtypes.aliases import * | ||
from merlin.dtypes.base import DType | ||
from merlin.dtypes.registry import _dtype_registry | ||
|
||
# Convenience alias for registering dtypes | ||
register = _dtype_registry.register | ||
|
||
|
||
def dtype(external_dtype): | ||
# If the supplied dtype is None, then there's not a default dtype we can | ||
# universally translate to across frameworks, so raise an error and help | ||
# the downstream developer figure out how to handle that case explicitly | ||
if external_dtype is None: | ||
raise TypeError( | ||
"Merlin doesn't provide a default dtype mapping for `None`. " | ||
"This differs from the Numpy behavior you may be expecting, " | ||
"which treats `None` as an alias for `np.float64`. If you're " | ||
"expecting this dtype to be non-`None`, there may be an issue " | ||
"in upstream code. If you'd like to allow this dtype to be `None`, " | ||
"you can use a `try/except` to catch this error." | ||
) | ||
|
||
# If the supplied dtype is already a Merlin dtype, then there's | ||
# nothing for us to do and we can exit early | ||
if isinstance(external_dtype, DType): | ||
return external_dtype | ||
|
||
# If not, attempt to apply all the registered Merlin dtype mappings. | ||
# If we don't find a match with those, fall back on converting to | ||
# a numpy dtype and trying to match that instead. | ||
try: | ||
return _dtype_registry.to_merlin(external_dtype) | ||
except TypeError as base_exc: | ||
|
||
try: | ||
return _dtype_registry.to_merlin_via_numpy(external_dtype) | ||
except TypeError as exc: | ||
# If we fail to find a match even after we try converting to | ||
# numpy, re-raise the original exception because it has more | ||
# information about the original external dtype that's causing | ||
# the problem. (We want to highlight that dtype, not whatever | ||
# numpy dtype it was converted to in the interim.) | ||
raise base_exc from exc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# | ||
# Copyright (c) 2022, NVIDIA CORPORATION. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
from merlin.dtypes.base import DType, ElementType, ElementUnit | ||
|
||
# Unsigned Integer | ||
uint8 = DType("uint8", ElementType.UInt, 8) | ||
uint16 = DType("uint16", ElementType.UInt, 16) | ||
uint32 = DType("uint32", ElementType.UInt, 32) | ||
uint64 = DType("uint64", ElementType.UInt, 64) | ||
|
||
# Signed Integer | ||
int8 = DType("int8", ElementType.Int, 8, signed=True) | ||
int16 = DType("int16", ElementType.Int, 16, signed=True) | ||
int32 = DType("int32", ElementType.Int, 32, signed=True) | ||
int64 = DType("int64", ElementType.Int, 64, signed=True) | ||
|
||
# Float | ||
float16 = DType("float16", ElementType.Float, 16, signed=True) | ||
float32 = DType("float32", ElementType.Float, 32, signed=True) | ||
float64 = DType("float64", ElementType.Float, 64, signed=True) | ||
|
||
# Date/Time | ||
datetime64 = DType("datetime64", ElementType.DateTime, 64) | ||
datetime64Y = DType("datetime64[Y]", ElementType.DateTime, 64, ElementUnit.Year) | ||
datetime64M = DType("datetime64[M]", ElementType.DateTime, 64, ElementUnit.Month) | ||
datetime64D = DType("datetime64[D]", ElementType.DateTime, 64, ElementUnit.Day) | ||
datetime64h = DType("datetime64[h]", ElementType.DateTime, 64, ElementUnit.Hour) | ||
datetime64m = DType("datetime64[m]", ElementType.DateTime, 64, ElementUnit.Minute) | ||
datetime64s = DType("datetime64[s]", ElementType.DateTime, 64, ElementUnit.Second) | ||
datetime64ms = DType("datetime64[ms]", ElementType.DateTime, 64, ElementUnit.Millisecond) | ||
datetime64us = DType("datetime64[us]", ElementType.DateTime, 64, ElementUnit.Microsecond) | ||
datetime64ns = DType("datetime64[ns]", ElementType.DateTime, 64, ElementUnit.Nanosecond) | ||
|
||
# Miscellaneous | ||
string = DType("str", ElementType.String) | ||
boolean = DType("bool", ElementType.Bool) | ||
object_ = DType("object", ElementType.Object) | ||
unknown = DType("unknown", ElementType.Unknown) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This dtype is essential. replacing none with our own logic, so we can catch. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
# | ||
# Copyright (c) 2022, NVIDIA CORPORATION. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
from dataclasses import dataclass | ||
from enum import Enum | ||
from typing import Optional | ||
|
||
from merlin.dtypes.registry import _dtype_registry | ||
|
||
|
||
class ElementType(Enum): | ||
""" | ||
Merlin DType base types | ||
|
||
Since a Merlin DType may describe a list, these are the either the types of | ||
scalars or the types of list elements. | ||
""" | ||
|
||
Bool = "bool" | ||
Int = "int" | ||
UInt = "uint" | ||
Float = "float" | ||
String = "string" | ||
DateTime = "datetime" | ||
Object = "object" | ||
Unknown = "unknown" | ||
|
||
|
||
class ElementUnit(Enum): | ||
""" | ||
Dtype units, used only for datetime types | ||
|
||
Since a Merlin DType may describe a list, these are the either the units of | ||
scalars or the units of list elements. | ||
""" | ||
|
||
Year = "year" | ||
Month = "month" | ||
Day = "day" | ||
Hour = "hour" | ||
Minute = "minute" | ||
Second = "second" | ||
Millisecond = "millisecond" | ||
Microsecond = "microsecond" | ||
Nanosecond = "nanosecond" | ||
|
||
|
||
@dataclass(eq=True, frozen=True) | ||
class DType: | ||
""" | ||
Merlin dtypes are objects of this dataclass | ||
""" | ||
|
||
name: str | ||
element_type: ElementType | ||
element_size: Optional[int] = None | ||
element_unit: Optional[ElementUnit] = None | ||
signed: Optional[bool] = None | ||
|
||
def to(self, mapping_name: str): | ||
""" | ||
Convert this Merlin dtype to another framework's dtypes | ||
|
||
Parameters | ||
---------- | ||
mapping_name : str | ||
Name of the framework dtype mapping to apply | ||
|
||
Returns | ||
------- | ||
Any | ||
An external framework dtype object | ||
|
||
Raises | ||
------ | ||
ValueError | ||
If there is no registered mapping for the given framework name | ||
ValueError | ||
The registered mapping for the given framework name doesn't map | ||
this Merlin dtype to a framework dtype | ||
""" | ||
try: | ||
mapping = _dtype_registry.mappings[mapping_name] | ||
except KeyError as exc: | ||
raise ValueError( | ||
f"Merlin doesn't have a registered dtype mapping for '{mapping_name}'. " | ||
"If you'd like to register a new dtype mapping, use `merlin.dtype.register()`. " | ||
"If you're expecting this mapping to already exist, has the library or package " | ||
"that defines the mapping been imported successfully?" | ||
) from exc | ||
|
||
try: | ||
return mapping.from_merlin(self) | ||
except KeyError as exc: | ||
raise ValueError( | ||
f"The registered dtype mapping for {mapping_name} doesn't contain type {self.name}." | ||
) from exc | ||
|
||
@property | ||
def to_numpy(self): | ||
return self.to("numpy") | ||
|
||
@property | ||
def to_python(self): | ||
return self.to("python") | ||
|
||
# These properties refer to a single scalar (potentially a list element) | ||
@property | ||
def is_integer(self): | ||
return self.element_type.value == "int" | ||
|
||
@property | ||
def is_float(self): | ||
return self.element_type.value == "float" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for this new utility was already latent in the existing code, but it becomes a lot clearer once you have a cross-framework way to translate dtypes. Chose not to tackle it here since that would expand the scope of this already somewhat large PR.