Releases · Lightning-AI/litData

29 Jul 06:23

bhimrazy

v0.2.51

f88a139

LitData v0.2.51 Latest

Latest

Lightning AI ⚡ is excited to announce the release of LitData v0.2.51

Highlights

Stream Raw Datasets from Cloud Storage (Beta)

Effortlessly stream raw files (e.g., images, text) directly from S3, GCS, or Azure cloud storage without preprocessing. Perfect for workflows needing immediate access to data in its original format.

from litdata.streaming.raw_dataset import StreamingRawDataset
from torch.utils.data import DataLoader

dataset = StreamingRawDataset("s3://bucket/files/")

# Use with PyTorch DataLoader
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
   # Process raw bytes
    pass

Benchmarks
Streaming speed for raw ImageNet (1.2M images) from cloud storage:

Storage	Images/s (No Transform)	Images/s (With Transform)
AWS S3	~6,400 ± 100	~3,200 ± 100
Google Cloud Storage	~5,650 ± 100	~3,100 ± 100

Note: Use StreamingRawDataset for direct data streaming. Opt for StreamingDataset for maximum speed with pre-optimized data.

Resume ParallelStreamingDataset

The ParallelStreamingDataset now supports a resume option, allowing you to seamlessly continue training from the previous epoch's state when cycling through datasets. Enable it with resume=True to avoid restarting at index 0 each epoch, ensuring consistent sample progression across epochs.

from litdata.streaming.parallel import ParallelStreamingDataset
from torch.utils.data import DataLoader

dataset = ParallelStreamingDataset(datasets=[dataset1, dataset2], length=100, resume=True)
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
    # Resumes from previous epoch's state
    pass

Per-Dataset Batch Sizes in CombinedStreamingDataset

The CombinedStreamingDataset now supports per-dataset batch sizes when using batching_method="per_stream". Specify unique batch sizes for each dataset using set_batch_size() with a list of integers. The iterator respects these limits, switching datasets once the per-stream quota is met, optimizing GPU utilization for datasets with varying tensor sizes.

from litdata.streaming.combined import CombinedStreamingDataset

dataset = CombinedStreamingDataset(
    datasets=[dataset1, dataset2],
    weights=[0.5, 0.5],
    batching_method="per_stream",
    seed=123
)
dataset.set_batch_size([4, 8])  # Set batch sizes: 4 for dataset1, 8 for dataset2

for sample in dataset:
    # Iterator yields samples respecting per-dataset batch size limits
    pass

Changes

Added

Added support for setting cache directory via LITDATA_CACHE_DIR environment variable (#639 by @deependujha)
Added CLI option to clear default cache (#627 by @deependujha)
Added resume support to ParallelStreamingDataset (#650 by @philgzl)
Added verbose option to optimize_fn (#654 by @deependujha)
Added support for multiple transform_fn in StreamingDataset (#655 by @deependujha)
Enabled per-dataset batch size support in CombinedStreamingDataset (#635 by @MagellaX)
Added support for StreamingRawDataset to stream raw datasets from cloud storage (#652 by @bhimrazy)
Added GCP support for directory resolution in resolve_dir (#659 by @bhimrazy)

Changed

Cleaned up logic in _loop by removing hacky index assignment (#640 by @deependujha)
Updated CODEOWNERS (#646 by @Borda)
Switched to astral-sh/setup-uv for Python setup and used uv pip for package installation (#656 by @bhimrazy)
Replaced PIL with torchvision's decode_image for more robust JPEG deserialization (#660 by @bhimrazy)

Fixed

Fixed performance issue with StreamingDataLoader when using ≥5 workers on Parquet data (#616 by @bhimrazy)
Fixed performance bottleneck in train_test_split (#647 by @lukemerrick)
Fixed async handling in StreamingRawDataset (#661 by @bhimrazy)

Chores

Bumped cryptography from 42.0.8 to 45.0.4 (#644 by @dependabot[bot])
Updated numpy requirement from <2.0 to <3.0 (#645 by @dependabot[bot])
Bumped pytest-timeout from 2.3.1 to 2.4.0 (#643 by @dependabot[bot])
Applied pre-commit suggestions & bumped Python to 3.9 (#653 by @pre-commit-ci[bot])
Bumped actions/first-interaction from 1 to 2 in GitHub Actions updates (#657 by @dependabot[bot])
Bumped version to 0.2.51 (#664 by @bhimrazy)

Full Changelog: v0.2.50...v0.2.51

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@deependujha, @Borda, @bhimrazy, @philgzl

New Contributors

@lukemerrick made their first contribution in #647
@MagellaX made their first contribution in #635

Thank you ❤️ and we hope you'll keep them coming!

Contributors

Borda, lukemerrick, and 6 other contributors

Assets 2

27 Jun 15:04

deependujha

v0.2.50

af64e33

litData v0.2.50: Fast Random Access & S3 Improvements 🧪⚡️

Lightning AI is excited to announce the release of litData v0.2.50, a lightweight and powerful streaming data library designed for fast AI model training.

This release focuses on improving the developer experience and performance for streamed datasets, with a particular focus on:

Faster random access support
Transform hooks for datasets
Better S3 interoperability
CI stability and performance improvements

👉 Check out the full changelog here: Compare v0.2.49...v0.2.50

🚀 Highlights

🔄 Fast Random Access (No Chunk Download Needed)

You can now access samples randomly from remote datasets without downloading entire chunks, dramatically reducing IO overhead during sparse reads.
This is especially useful for visualization tools or quickly inspecting your dataset without requiring full downloads.

🚀 Benchmark (on Lightning Studio, chunk size: 64MB)

10 random accesses:

🔹 v0.2.49: 20–22 seconds
🔹 v0.2.50: 5–6 seconds

The benchmark was designed to ensure enough separation between accesses, avoiding repeated reads from the same chunk.

Single item access:

🔹 v0.2.49: ~2 seconds
🔹 v0.2.50: ~0.83 seconds

Sample code

import litdata as ld

uri = "gs://litdata-gcp-bucket/optimized_data"
ds = ld.StreamingDataset(uri, cache_dir="my_cache")

# when iterating, check `my_cache`. it shouldn't download chunks
for i in range(0,1000, 100):
    print(i, ds[i])

# it should download chunks now
for data in ds:
    print(data)

#631

🧩 Transform Support in StreamingDataset

You can now apply transforms to samples in StreamingDataset and CombinedStreamingDataset.

There are two supported ways to use it:

Pass a transform function when initializing the dataset:

# Define a simple transform function
torch_transform = transforms.Compose([
  transforms.Resize((256, 256)),       # Resize to 256x256
  transforms.ToTensor(),               # Convert to PyTorch tensor (C x H x W)
  transforms.Normalize(                # Normalize using ImageNet stats
      mean=[0.485, 0.456, 0.406], 
      std=[0.229, 0.224, 0.225]
  )
])

def transform_fn(x, *args, **kwargs):
    """Define your transform function."""
    return torch_transform(x)  # Apply the transform to the input image

# Create dataset with appropriate configuration
dataset = StreamingDataset(data_dir, cache_dir=str(cache_dir), shuffle=shuffle, transform=transform_fn)

Subclass and override the transform method:

class StreamingDatasetWithTransform(StreamingDataset):
        """A custom dataset class that inherits from StreamingDataset and applies a transform."""

        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)

            self.torch_transform = transforms.Compose([
                transforms.Resize((256, 256)),       # Resize to 256x256
                transforms.ToTensor(),               # Convert to PyTorch tensor (C x H x W)
                transforms.Normalize(                # Normalize using ImageNet stats
                    mean=[0.485, 0.456, 0.406], 
                    std=[0.229, 0.224, 0.225]
                )
            ])

        # Define your transform method
        def transform(self, x, *args, **kwargs):
            """A simple transform function."""
            return self.torch_transform(x)


dataset = StreamingDatasetWithTransform(data_dir, cache_dir=str(cache_dir), shuffle=shuffle)

This makes it easier to insert preprocessing logic directly into the streaming pipeline.

#618

📖 AWS S3 Streaming Docs (with `boto3` & `unsigned requests` Example)

The documentation now includes a clear example of how to stream datasets from AWS S3 using boto3, including support for unsigned requests. It also prioritizes boto3 in the list of options for better clarity.

import botocore
from litdata import StreamingDataset

storage_options = {
    "config": botocore.config.Config(
        retries={"max_attempts": 1000, "mode": "adaptive"},
        signature_version=botocore.UNSIGNED,
    )
}

dataset = StreamingDataset(
    input_dir="s3://pl-flash-data/optimized_tiny_imagenet",
    storage_options=storage_options,
)

#628

📖 Batching Methods in `CombinedStreamingDataset`

The CombinedStreamingDataset supports two different batching methods through the batching_method parameter:

Stratified Batching (Default):
With batching_method="stratified" (the default), each batch contains samples from multiple datasets according to the specified weights:

# Default stratified batching - batches mix samples from all datasets
combined_dataset = CombinedStreamingDataset(
    datasets=[dataset1, dataset2], 
    batching_method="stratified"  # This is the default
)

Per-Stream Batching:
With batching_method="per_stream", each batch contains samples exclusively from a single dataset. This is useful when datasets have different shapes or structures:

# Per-stream batching - each batch contains samples from only one dataset
combined_dataset = CombinedStreamingDataset(
    datasets=[dataset1, dataset2], 
    batching_method="per_stream"
)

#609

🐛 Bug Fixes

Fixed breaking tqdm progress bar in optimizing dataset
See before v/s after
- v0.2.49
- v0.2.50
#619

Suppressed multiple lightning-sdk warnings.
See before v/s after
- v0.2.49:
- v0.2.50:
#633

Fixed FileNotFoundError in file locking for downloader and cache systems.
#615, #617

🧪 Testing & CI

Python 3.12 and 3.13 now supported in CI matrix
#589
Test durations now logged for debugging
#614
Added missing CI dependencies.
#634
Refactored large, slow tests to reduce CI runtime
#629, #632

📎 Minor Improvements

Updated bug report template for easier Lightning Studio reproduction
#611

📦 Dependency Updates

mosaicml-streaming: 0.8.1 → 0.11.0
#624
transformers: <4.50.0 → <4.53.0
#623
pytest: 8.3.* → 8.4.*
#625

🧑‍💻 Contributors

Thanks to everyone who contributed to this release!
Special thanks to @bhimrazy, @deependujha, @Borda, and @dependabot.

What's Changed

🕒 Add Test Duration Reporting to Pytest in CI by @bhimrazy in #614
Update bug report template with Lightning Studio sharing instructions by @bhimrazy in #611
docs: Add documentation for batching methods in CombinedStreamingDataset by @bhimrazy in #609
fix: suppress FileNotFoundError when acquiring file lock for count file by @bhimrazy in #615
chore: suppress FileNotFoundError for locks in downloader classes by @bhimrazy in #617
Add Dependabot for Pip & GitHub Actions by @Borda in #621
chore(deps): update pytest requirement from ==8.3.* to ==8.4.* by @dependabot in #625
chore(deps): bump mosaicml-streaming from 0.8.1 to 0.11.0 by @dependabot in #624
chore(deps): update transformers requirement from <4.50.0 to <4.53.0 by @dependabot in #623
chore(deps): bump the gha-updates group with 2 updates by @dependabot in #622
Feat: add transform support for StreamingDataset by @deependujha in #618
fix: breaking tqdm progress bar in optimizing dataset by @deependujha in #619
upd: Optimize test (test_dataset_for_text_tokens_with_large_num_chunks) to reduce time consumption by @bhimrazy in #629
docs: Update documentation for AWS S3 dataset st...

Contributors

Borda, dependabot, and 2 other contributors

Assets 2

04 Jun 08:58

bhimrazy

v0.2.49

0c2fbc3

v0.2.49

What's Changed

Add ParallelStreamingDataset by @philgzl in #576
feat: add support for shared queue for data processing by @deependujha in #602
Add custom collate function for Getting Started example (resolves the collate_fn TypeError) by @bhimrazy in #607
feat: support Queue-based streaming inputs for optimize via new recipe by @deependujha in #606
fix: Mark flaky tests to rerun on failure by @bhimrazy in #610
bump version to 0.2.49 by @bhimrazy in #613

Full Changelog: v0.2.48...v0.2.49

Contributors

philgzl, bhimrazy, and deependujha

Assets 2

24 May 14:38

deependujha

v0.2.48

742fb52

v0.2.48

What's Changed

readme: update Maintainers by @Borda in #594
chore: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet by @bhimrazy in #572
fix: Move cache warning under debug by @bhimrazy in #598
Add support for torch.uint16 data type by @bhimrazy in #597
fix: Add error handling for empty Parquet files while indexing and corresponding tests by @bhimrazy in #601
fix: boto3 session options by @deependujha in #604
bump version 0.2.48 by @deependujha in #605

Full Changelog: v0.2.47...v0.2.48

Contributors

Borda, bhimrazy, and deependujha

Assets 2

13 May 07:01

deependujha

v0.2.47

44ce484

v0.2.47

What's Changed

feat: Add support for path in map fn by @deependujha in #582
ci: Add Python 3.11 to CI testing matrix by @bhimrazy in #585
fix: docs failing in ci by @deependujha in #586
fix: multi-node parquet indexing by @deependujha in #583
bump version 0.2.47 by @deependujha in #587

Full Changelog: v0.2.46...v0.2.47

Contributors

bhimrazy and deependujha

Assets 2

03 May 07:58

bhimrazy

v0.2.46

96238b6

Release v0.2.46

What's Changed

Feat: Add per_stream batching method to CombinedStreamingDataset by @schopra8 in #438
Fix parquet cache by @philgzl in #560
refactor: StreamingDataset variable names for better readability by @deependujha in #557
feat: Add GitHub Actions workflow for @benchmark bot by @deependujha in #561
fix: @benchmark bot fixes by @deependujha in #565
Fix IndexError when resuming after some workers are done by @philgzl in #567
ref: simplify cache dir creation and remove repeated parts by @bhimrazy in #568
fix: suppress FileNotFoundError when acquiring file lock for count file by @bhimrazy in #570
fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets by @bhimrazy in #569
update readme to include best practices for image data optimization by @bhimrazy in #577

New Contributors

@schopra8 made their first contribution in #438

Full Changelog: v0.2.45...v0.2.46

Contributors

schopra8, philgzl, and 2 other contributors

Assets 2

14 Apr 16:34

deependujha

v0.2.45

68d23cd

v0.2.45

What's Changed

Fixes the logic for is_last_index. by @bhimrazy in #531
Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode by @bhimrazy in #535
Update JPEGSerializer (deserialize) to return as a tensor and also make torchvision required dependency by @bhimrazy in #541
nitpick: readme incorrect transfer spelling by @deependujha in #543
Update macos version to 14 in CI by @bhimrazy in #545
[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #542
Fix/last chunk deletion by @bhimrazy in #536
Add file filtering support to StreamingDataset for Parquet datasets by @philgzl in #546
Add papers with litdata and citation by @tchaton in #547
Feat/add jpeg array serializer by @bhimrazy in #537
feat: better debug & profile with logs & Litracer by @deependujha in #528
add Github reticular repo by @tchaton in #548
feat: add Litracer docs in readme by @deependujha in #549
docs: add benchmark speed for r2 by @bhimrazy in #551
bump: version 0.2.45 by @deependujha in #555

New Contributors

@philgzl made their first contribution in #546

Full Changelog: v0.2.44...v0.2.45

Contributors

tchaton, philgzl, and 3 other contributors

Assets 2

26 Mar 16:40

tchaton

v0.2.44

cfac30a

Release v0.2.44

What's Changed

Remove .lock download skipping, skip locks on force download by @JackUrb in #519
pre-release bump 0.2.44 by @tchaton in #530

Full Changelog: v0.2.43...v0.2.44

Contributors

JackUrb and tchaton

Assets 2

25 Mar 18:23

bhimrazy

v0.2.43

65ba5a7

v0.2.43

What's Changed

Fix: resume issues with resuming in combined streaming dataset in dataloader by @bhimrazy in #507
fix: s3 error by @deependujha in #510
Fix: unsigned s5cmd requests and also add option to disable s5cmd by @bhimrazy in #513
Turn on DEBUG logging based on DEBUG_LITDATA environment variable by @ouj in #518
Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets by @bhimrazy in #505
feat: correctly propagate storage_options by @deependujha in #514
fix: remove warnings for Streaming Dataset with hf dataset and shuffle enabled by @bhimrazy in #520
Revert '#506 Add s5cmd' – as boto3 Outperforms s5cmd in Latest Benchmarks by @bhimrazy in #521
Upd/hf-dataset-get-format by @bhimrazy in #522
Update documentation on Streaming Parquet Datasets from Huggingface and other cloud providers by @bhimrazy in #523
Bump version to 0.2.43 by @bhimrazy in #525
fix package config by @Borda in #526
example: sine function model prediction with litdata & pytorch-lightning by @deependujha in #517
fixing package & releasing by @Borda in #529

Full Changelog: v0.2.42...v0.2.43

Contributors

ouj, Borda, and 2 other contributors

Assets 2

11 Mar 15:43

tchaton

v0.2.42

a8fc6a8

Release v0.2.42

What's Changed

Add register function for downloader by @ouj in #496
Allow for more lenient state resume. by @JackUrb in #497
Slighy faster speed by @tchaton in #503
Add s5cmd by @tchaton in #506
Feat: add support for gcp by @deependujha in #504
Bump version 0.2.42 by @tchaton in #508

New Contributors

@ouj made their first contribution in #496

Full Changelog: v0.2.41...v0.2.42

Contributors

ouj, JackUrb, and 2 other contributors

Assets 2

Releases: Lightning-AI/litData

LitData v0.2.51

Highlights

Stream Raw Datasets from Cloud Storage (Beta)

Resume ParallelStreamingDataset

Per-Dataset Batch Sizes in CombinedStreamingDataset

Changes

🧑‍💻 Contributors

Key Contributors

New Contributors

Contributors

Uh oh!

litData v0.2.50: Fast Random Access & S3 Improvements 🧪⚡️

🚀 Highlights

🔄 Fast Random Access (No Chunk Download Needed)

🚀 Benchmark (on Lightning Studio, chunk size: 64MB)

Sample code

🧩 Transform Support in StreamingDataset

📖 AWS S3 Streaming Docs (with boto3 & unsigned requests Example)

📖 Batching Methods in CombinedStreamingDataset

🐛 Bug Fixes

🧪 Testing & CI

📎 Minor Improvements

📦 Dependency Updates

🧑‍💻 Contributors

What's Changed

Contributors

Uh oh!

v0.2.49

What's Changed

Contributors

Uh oh!

v0.2.48

What's Changed

Contributors

Uh oh!

v0.2.47

What's Changed

Contributors

Uh oh!

Release v0.2.46

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.45

What's Changed

New Contributors

Contributors

Uh oh!

Release v0.2.44

What's Changed

Contributors

Uh oh!

v0.2.43

What's Changed

Contributors

Uh oh!

Release v0.2.42

What's Changed

New Contributors

Contributors

Uh oh!

📖 AWS S3 Streaming Docs (with `boto3` & `unsigned requests` Example)

📖 Batching Methods in `CombinedStreamingDataset`