-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
named-types: 1st pass on klay_beam dataclass primitives #41 #68
Draft
CharlesHolbrow
wants to merge
2
commits into
main
Choose a base branch
from
named-properties
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
from typing import Optional, Tuple, List | ||
from dataclasses import dataclass | ||
import numpy as np | ||
|
||
from .torch_utils import TORCH_AVAILABLE | ||
|
||
if TORCH_AVAILABLE: | ||
import torch | ||
|
||
|
||
"""Basic klay_beam io types used when reading and writing data. | ||
|
||
Why are these types frozen? This is to avoid accidentally modifying the data. In | ||
distributed runners, the types passed between PTransforms should be immutable. | ||
See: https://beam.apache.org/documentation/programming-guide/#immutability | ||
""" | ||
|
||
|
||
@dataclass(frozen=True) | ||
class Datum: | ||
"""Base type for data that is read and written via klay_beam transforms. | ||
|
||
Why are these properties optional? At different stages of a pipeline, | ||
different properties will be available. For example LoadWithLibrosa will | ||
populate the source_path, but not the target_path. A downstream transform | ||
that creates multiple files may not populate source_path for each file. | ||
|
||
Using a single reusable type for the majority of klay_beam transforms keeps | ||
things simple, but also means that Transform authors should be careful to | ||
assert that the properties they need are populated. | ||
|
||
What is the difference between source_PATH and source_DIR? Paths for for | ||
full file names. Dirs are the root of a copy operation. We've found in the | ||
past the past it is helpful to be able to write transforms that COPY data | ||
from one path to another while recursively preserving the directory | ||
structure from in some source_dir. | ||
""" | ||
|
||
source_dir: Optional[str] = None | ||
target_dir: Optional[str] = None | ||
source_path: Optional[str] = None | ||
target_path: Optional[str] = None | ||
|
||
binary_data: Optional[bytes] = None | ||
"""When a klay_beam pipeline is ready to WRITE a file, it should populate | ||
`binary_data` AND `target_path`. Note that transforms that LOAD time-series | ||
data do not need to populate `binary_data`. Instead transforms that load | ||
data will generally just populate the .datum property of a NumpyDatum or | ||
TorchDatum instance, and leave `binary_data` empty.""" | ||
|
||
|
||
@dataclass(frozen=True) | ||
class NumpyDatum(Datum): | ||
"""General data storage returned by transforms that load time-series numpy | ||
data such as LoadWithLibrosa""" | ||
|
||
datum: Tuple[np.ndarray, int] | ||
|
||
|
||
@dataclass(frozen=True) | ||
class TorchDatum(Datum): | ||
"""General data storage returned by transforms that load time-series torch | ||
data such as LoadWithPytorch""" | ||
|
||
datum: Tuple["torch.Tensor", int] | ||
|
||
|
||
@dataclass(frozen=True) | ||
class NumpyData: | ||
"""General data storage returned by transforms that do not have a one-to-one | ||
mapping between input and output files""" | ||
|
||
data: List[NumpyDatum] = [] | ||
|
||
|
||
@dataclass(frozen=True) | ||
class TorchData: | ||
"""General data storage returned by transforms that do not have a one-to-one | ||
mapping between input and output files""" | ||
|
||
data: List[TorchDatum] = [] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this conflicting with the name,
NumpyDatum
is suggesting that this is a genericnumpy
dataclass. However the fact that this is aTuple
means you're expecting an audio array and a sample rate value here. If that's the case this should be anAudioDatum
. Alternatively I would suggest adding sample rate as a separate field, or I would just leave thedatum
field and allow another child class calledAudioNumpyDatum
.This applies to the
TorchDatum
also and is important because there's scenarios where we're not passing aroundnumpy
arrays that are audio e.g. feature arrays.