Moving to LitData: Refactoring data pipeline #80

gitttt-1234 · 2024-09-05T22:31:08Z

The primary bottleneck in our training pipeline is the dataloader performance (currently training time per epoch is very high due to the dataloader - IterDataPipe and 2-3 times slower then TensorFlow used in SLEAP).

We evaluated the performance of LitData, an API designed to optimize data pipelines by enabling multi-machine and distributed processing, along with built-in cloud-scalable solutions. With the support for efficient parallelization and memory handling, LitData accelerates the performance of data-intensive training processes. We benchmarked hte performance of LitData across all data pipelines (Single instance, centroid, Centered-instance and Bottom-up) and achieved nearly on-par performance with TensorFlow in all cases (except for single instance, which remains 1.5x slower than TensorFlow).

This PR details the plan for refactoring our current data pipeline.

MVP:

PR1:

Break down all operations in the __iter__ methods across the data modules into individual, well-defined functions.

PR2:

Implement the get_chunks() method for each model pipeline. This method handles all the data preprocessing functions (except augmentation, resizing/ pad_to_stride and confidence map (or pafs) generation) to extract dictionaries from .slp file and save them as .bin files.
- For centroid model, the centroids are computed inside get_chunks().
- For centered-instance model, the crops are generated (with crop size as: crop_hw * (np.sqrt(2) - 1)) to account for blacking of edges when applying rotation augmentation. The images are recropped to crop_hw in the litdata.StreamingDataset.__getitem__() method.

PR3:

Implement a custom litdata.StreamingDataset class for each model type. Apply augmentation, resizer, pad_to_stride and generates confidence maps (and part affinity fields for bottom-up model) in the litdata.StreamingDataset.__getitem__() method.

PR4:

Integrate with training.model_trainer.ModelTrainer class. In _create_data_loaders(), use ld.optimize(fn = get_chunks) to generate the .bin files. Pass the .bin dir path to the litdata.StreamingDataset class. Ensure the .bin files are deleted after training.

Example

get_chunks() function

import litdata as ld

def single_instance_get_chunks(lf: sleap_io.LabeledFrame):
    image, instances = get_img_inst_from_lf(lf) # extract image and instances from labeled frame and convert to `torch.Tensor`s.

    image = normalize(image) # includes converting to/ from RGB from/ to grayscale

    image, instances = resize(image, instances)

    image = pad_to_stride(image, max_stride)

    ex = {
          "image": image, 
          "instances": instances, 
          "orig_size": orig_size, 
          "frame_idx": lf.frame_idx, 
          "video_idx": video_idx
       }
    
    return ex

labels = sio.load_slp("test.pkg.slp")
ld.optimize(
        fn = single_instance_get_chunks,
        inputs = [x for x in labels],
        output_dir="./single_instance_chunks/",
        num_workers=2,
        chunk_size=100
    )

Custom StreamingDataset

class SingleInstanceDataset(ld.StreamingDataset):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def __getitem__(self, index):
        ex = super().__getitem__(index)

        image, instances = ex["image"], ex["instances"]

        image, instances = augmentation(image, instances)

        confidence_maps = get_confmaps(instances, image.shape[-2:])

        sample = {
            "image": image,
            "video_idx": ex["video_idx"],
            "frame_idx": ex["frame_idx"],
            "orig_size": ex["orig_size"],
            "instances": instances,
            "confidence_maps": confidence_maps,
        }

        return sample

Next steps:

Need to re-implement Cycler.
Speed-up torch.exp function?

The text was updated successfully, but these errors were encountered:

gitttt-1234 linked a pull request Sep 12, 2024 that will close this issue

LitData Refactor PR1: Get individual functions for data pipelines #90

Merged

gitttt-1234 mentioned this issue Sep 12, 2024

LitData Refactor PR2: Implement a function to get the data chunks for all model types #91

Merged

gitttt-1234 linked a pull request Sep 12, 2024 that will close this issue

LitData Refactor PR2: Implement a function to get the data chunks for all model types #91

Merged

gitttt-1234 mentioned this issue Sep 12, 2024

LitData Refactor PR3: Add custom StreamingDataset #92

Merged

gitttt-1234 linked a pull request Sep 12, 2024 that will close this issue

LitData Refactor PR3: Add custom StreamingDataset #92

Merged

This was referenced Sep 19, 2024

LitData Refactor PR4: Integrate LitData with ModelTrainer class #93

Closed

LitData Refactor PR4: Integrate LitData with ModelTrainer class #94

Merged

gitttt-1234 linked a pull request Sep 26, 2024 that will close this issue

LitData Refactor PR4: Integrate LitData with ModelTrainer class #94

Merged

gitttt-1234 mentioned this issue Sep 27, 2024

Remove IterDataPipe from Inference pipeline #96

Merged

gitttt-1234 closed this as completed in #90 Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving to LitData: Refactoring data pipeline #80

Moving to LitData: Refactoring data pipeline #80

gitttt-1234 commented Sep 5, 2024 •

edited

Loading

Moving to LitData: Refactoring data pipeline #80

Moving to LitData: Refactoring data pipeline #80

Comments

gitttt-1234 commented Sep 5, 2024 • edited Loading

MVP:

PR1:

PR2:

PR3:

PR4:

Example

get_chunks() function

Custom StreamingDataset

Next steps:

gitttt-1234 commented Sep 5, 2024 •

edited

Loading