named-types: 1st pass on klay_beam dataclass primitives #41 #68

CharlesHolbrow · 2023-12-08T16:20:57Z

After thinking about it some more, it seems like including #41 in v1.0 is worthwhile. Here's a proposal for the data types that could be passed between core klay_beam transforms.

The next step it so try re-writing some of the transforms to use the classes, but before I do that it would be really helpful to have a quick review of the choices here, which are explained in the docstrings.

mxkrn

I see where you're going with this design pattern, however personally i had something slightly different in mind. The base Datum class is powerful in that it avoids duplication and creates a structured interface for passing around data. However, I don't think it really helps us achieve (what I thought to be) the main objective of using these data structures to validate types and values that are passed between transforms.

This would suggest a much less generic setup but instead Transform-specific data types, i.e. TorchAudioTuple, TorchFeatureTuple, NumpyAudioTuple with type and potentially even value validation. There's definitely room to use a base class in the way you suggested but I would want to see how this would trickle down into child classes that both provide an interface but also validate data being passed between transforms.

mxkrn · 2023-12-08T17:12:01Z

klay_beam/src/klay_beam/types.py

+    """General data storage returned by transforms that load time-series numpy
+    data such as LoadWithLibrosa"""
+
+    datum: Tuple[np.ndarray, int]


I find this conflicting with the name, NumpyDatum is suggesting that this is a generic numpy dataclass. However the fact that this is a Tuple means you're expecting an audio array and a sample rate value here. If that's the case this should be an AudioDatum. Alternatively I would suggest adding sample rate as a separate field, or I would just leave the datum field and allow another child class called AudioNumpyDatum.

This applies to the TorchDatum also and is important because there's scenarios where we're not passing around numpy arrays that are audio e.g. feature arrays.

CharlesHolbrow · 2023-12-08T19:14:03Z

Got it, thanks Max. If I understand correctly, you are proposing each Transform in Klay Beam could have its own Dataclass that has the exact properties relevant to that transform. For example LoadWithTorchaudio could output data classes that look like this:

@dataclass(frozen=True)
class TimeSeriesTorchData:
    source_dir: str
    source_path: str
    sample_rate: int
    tensor: torch.Tensor

This is still somewhat resuable–we could write a .npy loader that outputs this type. Also downstream PTransforms don't have to assert element.source_dir etc. Is that what you had in mind?

One of the problems that I want to solve is that Transforms should be able to forward little bits of metadata from their input elements to subsequent transforms in the pipeline. One example is SkipCompleted, which really benefits from knowing the expected shape of the output file(s).

I would like Transforms (like SkipCompleted and LoadWithX) that sit between the main file matching transform, and the final "persistence" transform to be able to forward these little bits of metadata when they are available. This means that we can setup just one function that prescribes the the composition of the expected input and output filetypes and extensions. Then all core transforms can forward any available metadata to their outputs, even when they do not necessarily use that data.

If every Transform in the pipeline has rigid input and output types, I'm worried that:

We have to maintain many slightly different dataclasses making it a chore to find the right one when writing transforms
It becomes difficult to write Transforms that forward little bits of metadata that they do not necessarily use, but MAY be useful for subsequent transforms. One solution is to add a .metadata dicts to all Transform IO dataclasses, but then we have to shuffle these bits of data between properties and the dictionary, and that feels error-prone.

Let's think together if there is a better way to accomplish these goals while also maintaining more type safety than the current proposal. Maybe the answer is to just start writing the updated transforms and pipelines, and see what approach goes smoother.

named-types: 1st pass on klay_beam dataclass primitives #41

7f7de82

CharlesHolbrow added this to the v1.0 milestone Dec 8, 2023

CharlesHolbrow requested a review from mxkrn December 8, 2023 16:20

name-properties: code-style

2412e1f

mxkrn requested changes Dec 8, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

named-types: 1st pass on klay_beam dataclass primitives #41 #68

named-types: 1st pass on klay_beam dataclass primitives #41 #68

CharlesHolbrow commented Dec 8, 2023

mxkrn left a comment

mxkrn Dec 8, 2023

CharlesHolbrow commented Dec 8, 2023 •

edited

Loading

named-types: 1st pass on klay_beam dataclass primitives #41 #68

Are you sure you want to change the base?

named-types: 1st pass on klay_beam dataclass primitives #41 #68

Conversation

CharlesHolbrow commented Dec 8, 2023

mxkrn left a comment

Choose a reason for hiding this comment

mxkrn Dec 8, 2023

Choose a reason for hiding this comment

CharlesHolbrow commented Dec 8, 2023 • edited Loading

CharlesHolbrow commented Dec 8, 2023 •

edited

Loading