Releases: Lightning-AI/litData
LitData v0.2.51
Lightning AI ⚡ is excited to announce the release of LitData v0.2.51
Highlights
Stream Raw Datasets from Cloud Storage (Beta)
Effortlessly stream raw files (e.g., images, text) directly from S3, GCS, or Azure cloud storage without preprocessing. Perfect for workflows needing immediate access to data in its original format.
from litdata.streaming.raw_dataset import StreamingRawDataset
from torch.utils.data import DataLoader
dataset = StreamingRawDataset("s3://bucket/files/")
# Use with PyTorch DataLoader
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
# Process raw bytes
pass
Benchmarks
Streaming speed for raw ImageNet (1.2M images) from cloud storage:
Storage | Images/s (No Transform) | Images/s (With Transform) |
---|---|---|
AWS S3 | ~6,400 ± 100 | ~3,200 ± 100 |
Google Cloud Storage | ~5,650 ± 100 | ~3,100 ± 100 |
Note: Use
StreamingRawDataset
for direct data streaming. Opt forStreamingDataset
for maximum speed with pre-optimized data.
Resume ParallelStreamingDataset
The ParallelStreamingDataset
now supports a resume
option, allowing you to seamlessly continue training from the previous epoch's state when cycling through datasets. Enable it with resume=True
to avoid restarting at index 0 each epoch, ensuring consistent sample progression across epochs.
from litdata.streaming.parallel import ParallelStreamingDataset
from torch.utils.data import DataLoader
dataset = ParallelStreamingDataset(datasets=[dataset1, dataset2], length=100, resume=True)
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
# Resumes from previous epoch's state
pass
Per-Dataset Batch Sizes in CombinedStreamingDataset
The CombinedStreamingDataset
now supports per-dataset batch sizes when using batching_method="per_stream"
. Specify unique batch sizes for each dataset using set_batch_size()
with a list of integers. The iterator respects these limits, switching datasets once the per-stream quota is met, optimizing GPU utilization for datasets with varying tensor sizes.
from litdata.streaming.combined import CombinedStreamingDataset
dataset = CombinedStreamingDataset(
datasets=[dataset1, dataset2],
weights=[0.5, 0.5],
batching_method="per_stream",
seed=123
)
dataset.set_batch_size([4, 8]) # Set batch sizes: 4 for dataset1, 8 for dataset2
for sample in dataset:
# Iterator yields samples respecting per-dataset batch size limits
pass
Changes
Added
- Added support for setting cache directory via
LITDATA_CACHE_DIR
environment variable (#639 by @deependujha) - Added CLI option to clear default cache (#627 by @deependujha)
- Added resume support to
ParallelStreamingDataset
(#650 by @philgzl) - Added
verbose
option tooptimize_fn
(#654 by @deependujha) - Added support for multiple
transform_fn
inStreamingDataset
(#655 by @deependujha) - Enabled per-dataset batch size support in
CombinedStreamingDataset
(#635 by @MagellaX) - Added support for
StreamingRawDataset
to stream raw datasets from cloud storage (#652 by @bhimrazy) - Added GCP support for directory resolution in
resolve_dir
(#659 by @bhimrazy)
Changed
- Cleaned up logic in
_loop
by removing hacky index assignment (#640 by @deependujha) - Updated CODEOWNERS (#646 by @Borda)
- Switched to
astral-sh/setup-uv
for Python setup and useduv pip
for package installation (#656 by @bhimrazy) - Replaced PIL with torchvision's
decode_image
for more robust JPEG deserialization (#660 by @bhimrazy)
Fixed
Chores
- Bumped
cryptography
from 42.0.8 to 45.0.4 (#644 by @dependabot[bot]) - Updated
numpy
requirement from <2.0 to <3.0 (#645 by @dependabot[bot]) - Bumped
pytest-timeout
from 2.3.1 to 2.4.0 (#643 by @dependabot[bot]) - Applied pre-commit suggestions & bumped Python to 3.9 (#653 by @pre-commit-ci[bot])
- Bumped
actions/first-interaction
from 1 to 2 in GitHub Actions updates (#657 by @dependabot[bot]) - Bumped version to 0.2.51 (#664 by @bhimrazy)
Full Changelog: v0.2.50...v0.2.51
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
@deependujha, @Borda, @bhimrazy, @philgzl
New Contributors
- @lukemerrick made their first contribution in #647
- @MagellaX made their first contribution in #635
Thank you ❤️ and we hope you'll keep them coming!
litData v0.2.50: Fast Random Access & S3 Improvements 🧪⚡️
Lightning AI is excited to announce the release of litData
v0.2.50, a lightweight and powerful streaming data library designed for fast AI model training.
This release focuses on improving the developer experience and performance for streamed datasets, with a particular focus on:
- Faster random access support
- Transform hooks for datasets
- Better S3 interoperability
- CI stability and performance improvements
👉 Check out the full changelog here: Compare v0.2.49...v0.2.50
🚀 Highlights
🔄 Fast Random Access (No Chunk Download Needed)
You can now access samples randomly from remote datasets without downloading entire chunks, dramatically reducing IO overhead during sparse reads.
This is especially useful for visualization tools or quickly inspecting your dataset without requiring full downloads.
🚀 Benchmark (on Lightning Studio, chunk size: 64MB)
10 random accesses:
- 🔹
v0.2.49
: 20–22 seconds - 🔹
v0.2.50
: 5–6 seconds
The benchmark was designed to ensure enough separation between accesses, avoiding repeated reads from the same chunk.
Single item access:
- 🔹
v0.2.49
: ~2 seconds - 🔹
v0.2.50
: ~0.83 seconds
Sample code
import litdata as ld
uri = "gs://litdata-gcp-bucket/optimized_data"
ds = ld.StreamingDataset(uri, cache_dir="my_cache")
# when iterating, check `my_cache`. it shouldn't download chunks
for i in range(0,1000, 100):
print(i, ds[i])
# it should download chunks now
for data in ds:
print(data)
🧩 Transform Support in StreamingDataset
You can now apply transforms to samples in StreamingDataset
and CombinedStreamingDataset
.
There are two supported ways to use it:
- Pass a transform function when initializing the dataset:
# Define a simple transform function
torch_transform = transforms.Compose([
transforms.Resize((256, 256)), # Resize to 256x256
transforms.ToTensor(), # Convert to PyTorch tensor (C x H x W)
transforms.Normalize( # Normalize using ImageNet stats
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
def transform_fn(x, *args, **kwargs):
"""Define your transform function."""
return torch_transform(x) # Apply the transform to the input image
# Create dataset with appropriate configuration
dataset = StreamingDataset(data_dir, cache_dir=str(cache_dir), shuffle=shuffle, transform=transform_fn)
- Subclass and override the
transform
method:
class StreamingDatasetWithTransform(StreamingDataset):
"""A custom dataset class that inherits from StreamingDataset and applies a transform."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.torch_transform = transforms.Compose([
transforms.Resize((256, 256)), # Resize to 256x256
transforms.ToTensor(), # Convert to PyTorch tensor (C x H x W)
transforms.Normalize( # Normalize using ImageNet stats
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
# Define your transform method
def transform(self, x, *args, **kwargs):
"""A simple transform function."""
return self.torch_transform(x)
dataset = StreamingDatasetWithTransform(data_dir, cache_dir=str(cache_dir), shuffle=shuffle)
This makes it easier to insert preprocessing logic directly into the streaming pipeline.
📖 AWS S3 Streaming Docs (with boto3
& unsigned requests
Example)
The documentation now includes a clear example of how to stream datasets from AWS S3 using boto3
, including support for unsigned requests. It also prioritizes boto3
in the list of options for better clarity.
import botocore
from litdata import StreamingDataset
storage_options = {
"config": botocore.config.Config(
retries={"max_attempts": 1000, "mode": "adaptive"},
signature_version=botocore.UNSIGNED,
)
}
dataset = StreamingDataset(
input_dir="s3://pl-flash-data/optimized_tiny_imagenet",
storage_options=storage_options,
)
📖 Batching Methods in CombinedStreamingDataset
The CombinedStreamingDataset
supports two different batching methods through the batching_method
parameter:
Stratified Batching (Default):
With batching_method="stratified"
(the default), each batch contains samples from multiple datasets according to the specified weights:
# Default stratified batching - batches mix samples from all datasets
combined_dataset = CombinedStreamingDataset(
datasets=[dataset1, dataset2],
batching_method="stratified" # This is the default
)
Per-Stream Batching:
With batching_method="per_stream"
, each batch contains samples exclusively from a single dataset. This is useful when datasets have different shapes or structures:
# Per-stream batching - each batch contains samples from only one dataset
combined_dataset = CombinedStreamingDataset(
datasets=[dataset1, dataset2],
batching_method="per_stream"
)
🐛 Bug Fixes
-
Fixed breaking
tqdm
progress bar in optimizing dataset
-
Suppressed multiple
lightning-sdk
warnings.
🧪 Testing & CI
- Python 3.12 and 3.13 now supported in CI matrix
#589 - Test durations now logged for debugging
#614 - Added missing CI dependencies.
#634 - Refactored large, slow tests to reduce CI runtime
#629, #632
📎 Minor Improvements
- Updated bug report template for easier Lightning Studio reproduction
#611
📦 Dependency Updates
mosaicml-streaming
: 0.8.1 → 0.11.0
#624transformers
: <4.50.0 → <4.53.0
#623pytest
: 8.3.* → 8.4.*
#625
🧑💻 Contributors
Thanks to everyone who contributed to this release!
Special thanks to @bhimrazy, @deependujha, @Borda, and @dependabot.
What's Changed
- 🕒 Add Test Duration Reporting to Pytest in CI by @bhimrazy in #614
- Update bug report template with Lightning Studio sharing instructions by @bhimrazy in #611
- docs: Add documentation for batching methods in CombinedStreamingDataset by @bhimrazy in #609
- fix: suppress FileNotFoundError when acquiring file lock for count file by @bhimrazy in #615
- chore: suppress FileNotFoundError for locks in downloader classes by @bhimrazy in #617
- Add Dependabot for Pip & GitHub Actions by @Borda in #621
- chore(deps): update pytest requirement from ==8.3.* to ==8.4.* by @dependabot in #625
- chore(deps): bump mosaicml-streaming from 0.8.1 to 0.11.0 by @dependabot in #624
- chore(deps): update transformers requirement from <4.50.0 to <4.53.0 by @dependabot in #623
- chore(deps): bump the gha-updates group with 2 updates by @dependabot in #622
- Feat: add transform support for StreamingDataset by @deependujha in #618
- fix: breaking tqdm progress bar in optimizing dataset by @deependujha in #619
- upd: Optimize test (
test_dataset_for_text_tokens_with_large_num_chunks
) to reduce time consumption by @bhimrazy in #629 - docs: Update documentation for AWS S3 dataset st...
v0.2.49
What's Changed
- Add
ParallelStreamingDataset
by @philgzl in #576 - feat: add support for shared queue for data processing by @deependujha in #602
- Add custom collate function for Getting Started example (resolves the
collate_fn
TypeError) by @bhimrazy in #607 - feat: support Queue-based streaming inputs for optimize via new recipe by @deependujha in #606
- fix: Mark flaky tests to rerun on failure by @bhimrazy in #610
- bump version to 0.2.49 by @bhimrazy in #613
Full Changelog: v0.2.48...v0.2.49
v0.2.48
What's Changed
- readme: update Maintainers by @Borda in #594
- chore: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet by @bhimrazy in #572
- fix: Move cache warning under debug by @bhimrazy in #598
- Add support for torch.uint16 data type by @bhimrazy in #597
- fix: Add error handling for empty Parquet files while indexing and corresponding tests by @bhimrazy in #601
- fix: boto3 session options by @deependujha in #604
- bump version 0.2.48 by @deependujha in #605
Full Changelog: v0.2.47...v0.2.48
v0.2.47
What's Changed
- feat: Add support for path in map fn by @deependujha in #582
- ci: Add Python 3.11 to CI testing matrix by @bhimrazy in #585
- fix: docs failing in ci by @deependujha in #586
- fix: multi-node parquet indexing by @deependujha in #583
- bump version 0.2.47 by @deependujha in #587
Full Changelog: v0.2.46...v0.2.47
Release v0.2.46
What's Changed
- Feat: Add
per_stream
batching method to CombinedStreamingDataset by @schopra8 in #438 - Fix parquet cache by @philgzl in #560
- refactor: StreamingDataset variable names for better readability by @deependujha in #557
- feat: Add GitHub Actions workflow for
@benchmark
bot by @deependujha in #561 - fix:
@benchmark
bot fixes by @deependujha in #565 - Fix
IndexError
when resuming after some workers are done by @philgzl in #567 - ref: simplify cache dir creation and remove repeated parts by @bhimrazy in #568
- fix: suppress FileNotFoundError when acquiring file lock for count file by @bhimrazy in #570
- fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets by @bhimrazy in #569
- update readme to include best practices for image data optimization by @bhimrazy in #577
New Contributors
Full Changelog: v0.2.45...v0.2.46
v0.2.45
What's Changed
- Fixes the logic for
is_last_index
. by @bhimrazy in #531 - Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode by @bhimrazy in #535
- Update
JPEGSerializer
(deserialize) to return as a tensor and also maketorchvision
required dependency by @bhimrazy in #541 - nitpick: readme incorrect
transfer
spelling by @deependujha in #543 - Update macos version to 14 in CI by @bhimrazy in #545
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #542
- Fix/last chunk deletion by @bhimrazy in #536
- Add file filtering support to
StreamingDataset
for Parquet datasets by @philgzl in #546 - Add papers with litdata and citation by @tchaton in #547
- Feat/add jpeg array serializer by @bhimrazy in #537
- feat: better debug & profile with logs & Litracer by @deependujha in #528
- add Github reticular repo by @tchaton in #548
- feat: add Litracer docs in readme by @deependujha in #549
- docs: add benchmark speed for r2 by @bhimrazy in #551
- bump: version 0.2.45 by @deependujha in #555
New Contributors
Full Changelog: v0.2.44...v0.2.45
Release v0.2.44
What's Changed
- Remove
.lock
download skipping, skip locks on force download by @JackUrb in #519 - pre-release bump 0.2.44 by @tchaton in #530
Full Changelog: v0.2.43...v0.2.44
v0.2.43
What's Changed
- Fix: resume issues with resuming in combined streaming dataset in dataloader by @bhimrazy in #507
- fix: s3 error by @deependujha in #510
- Fix: unsigned s5cmd requests and also add option to disable s5cmd by @bhimrazy in #513
- Turn on DEBUG logging based on DEBUG_LITDATA environment variable by @ouj in #518
- Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets by @bhimrazy in #505
- feat: correctly propagate storage_options by @deependujha in #514
- fix: remove warnings for Streaming Dataset with hf dataset and shuffle enabled by @bhimrazy in #520
- Revert '#506 Add s5cmd' – as boto3 Outperforms s5cmd in Latest Benchmarks by @bhimrazy in #521
- Upd/hf-dataset-get-format by @bhimrazy in #522
- Update documentation on Streaming Parquet Datasets from Huggingface and other cloud providers by @bhimrazy in #523
- Bump version to 0.2.43 by @bhimrazy in #525
- fix package config by @Borda in #526
- example: sine function model prediction with litdata & pytorch-lightning by @deependujha in #517
- fixing package & releasing by @Borda in #529
Full Changelog: v0.2.42...v0.2.43
Release v0.2.42
What's Changed
- Add register function for downloader by @ouj in #496
- Allow for more lenient state resume. by @JackUrb in #497
- Slighy faster speed by @tchaton in #503
- Add s5cmd by @tchaton in #506
- Feat: add support for gcp by @deependujha in #504
- Bump version 0.2.42 by @tchaton in #508
New Contributors
Full Changelog: v0.2.41...v0.2.42