Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

named-types: 1st pass on klay_beam dataclass primitives #41 #68

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

CharlesHolbrow
Copy link
Contributor

After thinking about it some more, it seems like including #41 in v1.0 is worthwhile. Here's a proposal for the data types that could be passed between core klay_beam transforms.

The next step it so try re-writing some of the transforms to use the classes, but before I do that it would be really helpful to have a quick review of the choices here, which are explained in the docstrings.

@CharlesHolbrow CharlesHolbrow added this to the v1.0 milestone Dec 8, 2023
Copy link
Collaborator

@mxkrn mxkrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you're going with this design pattern, however personally i had something slightly different in mind. The base Datum class is powerful in that it avoids duplication and creates a structured interface for passing around data. However, I don't think it really helps us achieve (what I thought to be) the main objective of using these data structures to validate types and values that are passed between transforms.

This would suggest a much less generic setup but instead Transform-specific data types, i.e. TorchAudioTuple, TorchFeatureTuple, NumpyAudioTuple with type and potentially even value validation. There's definitely room to use a base class in the way you suggested but I would want to see how this would trickle down into child classes that both provide an interface but also validate data being passed between transforms.

"""General data storage returned by transforms that load time-series numpy
data such as LoadWithLibrosa"""

datum: Tuple[np.ndarray, int]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this conflicting with the name, NumpyDatum is suggesting that this is a generic numpy dataclass. However the fact that this is a Tuple means you're expecting an audio array and a sample rate value here. If that's the case this should be an AudioDatum. Alternatively I would suggest adding sample rate as a separate field, or I would just leave the datum field and allow another child class called AudioNumpyDatum.

This applies to the TorchDatum also and is important because there's scenarios where we're not passing around numpy arrays that are audio e.g. feature arrays.

@CharlesHolbrow
Copy link
Contributor Author

CharlesHolbrow commented Dec 8, 2023

Got it, thanks Max. If I understand correctly, you are proposing each Transform in Klay Beam could have its own Dataclass that has the exact properties relevant to that transform. For example LoadWithTorchaudio could output data classes that look like this:

@dataclass(frozen=True)
class TimeSeriesTorchData:
    source_dir: str
    source_path: str
    sample_rate: int
    tensor: torch.Tensor

This is still somewhat resuable–we could write a .npy loader that outputs this type. Also downstream PTransforms don't have to assert element.source_dir etc. Is that what you had in mind?

One of the problems that I want to solve is that Transforms should be able to forward little bits of metadata from their input elements to subsequent transforms in the pipeline. One example is SkipCompleted, which really benefits from knowing the expected shape of the output file(s).

I would like Transforms (like SkipCompleted and LoadWithX) that sit between the main file matching transform, and the final "persistence" transform to be able to forward these little bits of metadata when they are available. This means that we can setup just one function that prescribes the the composition of the expected input and output filetypes and extensions. Then all core transforms can forward any available metadata to their outputs, even when they do not necessarily use that data.

If every Transform in the pipeline has rigid input and output types, I'm worried that:

  1. We have to maintain many slightly different dataclasses making it a chore to find the right one when writing transforms
  2. It becomes difficult to write Transforms that forward little bits of metadata that they do not necessarily use, but MAY be useful for subsequent transforms. One solution is to add a .metadata dicts to all Transform IO dataclasses, but then we have to shuffle these bits of data between properties and the dictionary, and that feels error-prone.

Let's think together if there is a better way to accomplish these goals while also maintaining more type safety than the current proposal. Maybe the answer is to just start writing the updated transforms and pipelines, and see what approach goes smoother.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants