Releases: huggingface/datasets
Releases · huggingface/datasets
3.2.0
Dataset Features
- Faster parquet streaming + filters with predicate pushdown by @lhoestq in #7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
from datasets import load_dataset filters = [('date', '>=', '2023')] ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
Other improvements and bug fixes
- fix conda release worlflow by @lhoestq in #7272
- Add link to video dataset by @NielsRogge in #7277
- Raise error for incorrect JSON serialization by @varadhbhatnagar in #7273
- support for custom feature encoding/decoding by @alex-hh in #7284
- update load_dataset doctring by @lhoestq in #7301
- Let server decide default repo visibility by @Wauplin in #7302
- fix: update elasticsearch version by @ruidazeng in #7300
- Fix typing in iterable_dataset.py by @lhoestq in #7304
- Updated inconsistent output in documentation examples for
ClassLabel
by @sergiopaniego in #7293 - More docs to from_dict to mention that the result lives in RAM by @lhoestq in #7316
- Release: 3.2.0 by @lhoestq in #7317
New Contributors
- @ruidazeng made their first contribution in #7300
- @sergiopaniego made their first contribution in #7293
Full Changelog: 3.1.0...3.2.0
3.1.0
Dataset Features
- Video support by @lhoestq in #7230
>>> from datasets import Dataset, Video, load_dataset >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video()) >>> # or from the hub >>> ds = load_dataset("username/dataset_name", split="train") >>> ds[0]["video"] <decord.video_reader.VideoReader at 0x105525c70>
- Add IterableDataset.shard() by @lhoestq in #7252
>>> from datasets import load_dataset >>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True) >>> full_ds.num_shards 2360 >>> ds = full_ds.shard(num_shards=ds.num_shards, index=0) >>> ds.num_shards 1 >>> ds = full_ds.shard(num_shards=8, index=0) >>> ds.num_shards 295
- Basic XML support by @lhoestq in #7250
What's Changed
- (Super tiny doc update) Mention to_polars by @fzyzcjy in #7232
- [MINOR:TYPO] Update arrow_dataset.py by @cakiki in #7236
- Missing video docs by @lhoestq in #7251
- fix decord import by @lhoestq in #7255
- fix ci for pyarrow 18 by @lhoestq in #7257
- Retry all requests timeouts by @lhoestq in #7256
- Always set non-null writer batch size by @lhoestq in #7258
- Don't embed videos by @lhoestq in #7259
- Allow video with disabeld decoding without decord by @lhoestq in #7262
- Small addition to video docs by @lhoestq in #7263
- fix docs relative links by @lhoestq in #7264
- Disallow video push_to_hub by @lhoestq in #7265
New Contributors
Full Changelog: 3.0.2...3.1.0
3.0.2
Main bug fixes
- fix unbatched arrow map for iterable datasets by @alex-hh in #7204
- Support features in metadata configs by @albertvillanova in #7182
- Preserve features in iterable dataset.filter by @alex-hh in #7209
- Pin dill<0.3.9 to fix CI by @albertvillanova in #7184
- this should also fix cache issues
What's Changed
- Fix release instructions by @albertvillanova in #7177
- Pin multiprocess<0.70.1 to align with dill<0.3.9 by @albertvillanova in #7188
- with_format docstring by @lhoestq in #7203
- fix ci benchmark by @lhoestq in #7205
- Fix the environment variable for huggingface cache by @torotoki in #7200
- Support Python 3.11 by @albertvillanova in #7179
- bump fsspec by @lhoestq in #7219
- Fix typo in image dataset docs by @albertvillanova in #7231
- No need for dataset_info by @lhoestq in #7234
- use huggingface_hub offline mode by @lhoestq in #7244
New Contributors
Full Changelog: 3.0.1...3.0.2
3.0.1
What's Changed
- Modify add_column() to optionally accept a FeatureType as param by @varadhbhatnagar in #7143
- Align filename prefix splitting with WebDataset library by @albertvillanova in #7151
- Support ndjson data files by @albertvillanova in #7154
- Support JSON lines with missing struct fields by @albertvillanova in #7160
- Support JSON lines with empty struct by @albertvillanova in #7162
- fix increase_load_count by @lhoestq in #7165
- fix docstring code example for distributed shuffle by @lhoestq in #7166
- Support JSON lines with missing columns by @albertvillanova in #7170
- Add torchdata as a regular test dependency by @albertvillanova in #7172
New Contributors
- @varadhbhatnagar made their first contribution in #7143
Full Changelog: 3.0.0...3.0.1
3.0.0
Dataset Features
- Use Polars functions in
.map()
-
Example:
>>> from datasets import load_dataset >>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars") >>> cols = [pl.col("content").str.len_bytes().alias("length")] >>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True) >>> ds_with_length[:5] shape: (5, 5) ┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐ │ idx ┆ title ┆ content ┆ labels ┆ length │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ str ┆ u32 │ ╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡ │ 0 ┆ The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure ┆ 180 │ │ 1 ┆ Pikachu's Quest for Peace ┆ Pikachu, with his cheeky persona… ┆ peaceful_narrative ┆ 138 │ │ 2 ┆ The Tender Tale of Squirtle ┆ Squirtle took everyone on a memo… ┆ gentle_adventure ┆ 135 │ │ 3 ┆ Charizard's Heartwarming Tale ┆ Charizard found joy in helping o… ┆ heartwarming_story ┆ 112 │ │ 4 ┆ Jolteon's Sparkling Journey ┆ Jolteon, with his zest for life,… ┆ celebratory_narrative ┆ 111 │ └─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘
- Support NumPy 2
- Allow numpy-2.1 and test it without audio extra by @albertvillanova in #7118
Cache Changes
- Use
huggingface_hub
cache by @lhoestq in #7105- use the
huggingface_hub
cache for files downloaded from HF, by default at~/.cache/huggingface/hub
- cached datasets (Arrow files) will still be reloaded from the
datasets
cache, by default at~/.cache/huggingface/datasets
- use the
Breaking changes
- Remove deprecated code by @albertvillanova in #6996
- removed deprecated arguments like
use_auth_token
,fs
orignore_verifications
- removed deprecated arguments like
- Remove beam by @albertvillanova in #6987
- removed deprecated apache beam datasets support
- Remove metrics by @albertvillanova in #6983
- remove deprecated
load_metric
, please use theevaluate
library instead
- remove deprecated
- Remove tasks by @albertvillanova in #6999
- remove deprecated
task
argument inload_dataset()
.prepare_for_task()
method,datasets.tasks
module
- remove deprecated
General improvements and bug fixes
- Improved the tutorial by adding a link for loading datasets by @AmboThom in #7042
- Automatically create
cache_dir
fromcache_file_name
by @ringohoffman in #7096 - remove more script docs by @lhoestq in #7104
- Fix args of feature docstrings by @albertvillanova in #7103
- Temporarily pin numpy<2.1 to fix CI by @albertvillanova in #7114
- Fix ConnectionError for gated datasets and unauthenticated users by @albertvillanova in #7110
- Install transformers with numpy-2 CI by @albertvillanova in #7119
- don't mention the script if trust_remote_code=False by @severo in #7120
- Fix typed examples iterable state dict by @lhoestq in #7121
- Rename LargeList.dtype to LargeList.feature by @albertvillanova in #7106
- Fix wrong SHA in CI tests of HubDatasetModuleFactoryWithParquetExport by @albertvillanova in #7125
- Disable implicit token in CI by @albertvillanova in #7126
- Test get_dataset_config_info with non-existing/gated/private dataset by @albertvillanova in #7124
- fix streaming from arrow files by @fschlatt in #7083
New Contributors
Full Changelog: 2.21.0...3.0.0
2.21.0
Features
- Support pyarrow large_list by @albertvillanova in #7019
- Support Polars round trip:
import polars as pl from datasets import Dataset df1 = pl.from_dict({"col_1": [[1, 2], [3, 4]]} df2 = Dataset.from_polars(df).to_polars() assert df1.equals(df2)
- Support Polars round trip:
What's Changed
- Use
HF_HUB_OFFLINE
instead ofHF_DATASETS_OFFLINE
by @Wauplin in #6968 - packaging: Remove useless dependencies by @daskol in #6971
- Fix resuming arrow format by @lhoestq in #6964
- Fix webdataset pickling by @lhoestq in #6972
- Set temporary numpy upper version < 2.0.0 to fix CI by @albertvillanova in #6975
- Fix regression for pandas < 2.0.0 in JSON loader by @albertvillanova in #6978
- Ensure compatibility with numpy 2.0.0 by @KennethEnevoldsen in #6976
- Remove underlines between badges by @novialriptide in #6966
- Update docs on trust_remote_code defaults to False by @albertvillanova in #6981
- Improve skip take shuffling and distributed by @lhoestq in #6965
- Fix tests using hf-internal-testing/librispeech_asr_dummy by @albertvillanova in #6998
- Fix dump of bfloat16 torch tensor by @lhoestq in #7002
- minor fix for bfloat16 by @lhoestq in #7003
- Fix incorrect rank value in data splitting by @yzhangcs in #6994
- less script docs by @lhoestq in #6993
- Fix CI by temporarily pinning ruff < 0.5.0 by @albertvillanova in #7007
- Support ruff 0.5.0 in CI by @albertvillanova in #7009
- Fix WebDatasets KeyError for user-defined Features when a field is missing in an example by @ProGamerGov in #7004
- [Streaming] retry on requests errors by @lhoestq in #6963
- Re-enable raising error from huggingface-hub FutureWarning in CI by @albertvillanova in #7011
- Skip faiss tests on Windows to avoid running CI for 360 minutes by @albertvillanova in #7014
- Support fsspec 2024.6.1 by @albertvillanova in #7017
- Persist IterableDataset epoch in workers by @lhoestq in #6710
- Fix casting list array to fixed size list by @albertvillanova in #7021
- Remove dead code for pyarrow < 15.0.0 by @albertvillanova in #7023
- Fix check_library_imports by @lhoestq in #7026
- Missing line from previous pr by @lhoestq in #7027
- Fix ci by @lhoestq in #7028
- Add decorator as explicit test dependency by @albertvillanova in #7043
- Mark tests that require librosa by @albertvillanova in #7044
- Unblock NumPy 2.0 by @NeilGirdhar in #6991
- Fix tensorflow min version depending on Python version by @albertvillanova in #7045
- Support librosa and numpy 2.0 for Python 3.10 by @albertvillanova in #7046
- add checkpoint and resume title in docs by @lhoestq in #7050
- Update load_hub.mdx by @severo in #7057
- Add batching to IterableDataset by @lappemic in #7054
- Avoid calling http_head for non-HTTP URLs by @albertvillanova in #7062
- Fix load_dataset for data_files with protocols other than HF by @matstrand in #6862
- Add batch method to Dataset class by @lappemic in #7064
- Fix doc generation when NamedSplit is used as parameter default value by @albertvillanova in #7036
- Fix CI by temporarily marking test_convert_to_parquet as expected to fail by @albertvillanova in #7074
- add split argument to Generator by @piercus in #7015
- Update required soxr version from pre-release to release by @albertvillanova in #7075
- Fix CI test_convert_to_parquet by @albertvillanova in #7078
- Fix prepare_single_hop_path_and_storage_options by @albertvillanova in #7068
- Set load_from_disk path type as PathLike by @albertvillanova in #7081
- Fix push_to_hub by not calling create_branch if branch exists by @albertvillanova in #7069
- feat: support non streamable arrow file binary format by @kmehant in #7025
- Support HTTP authentication in non-streaming mode by @albertvillanova in #7082
- chore: fix typos in docs by @hattizai in #7034
- Fix CI for metrics by @albertvillanova in 83e5c05
New Contributors
- @novialriptide made their first contribution in #6966
- @yzhangcs made their first contribution in #6994
- @ProGamerGov made their first contribution in #7004
- @NeilGirdhar made their first contribution in #6991
- @matstrand made their first contribution in #6862
- @lappemic made their first contribution in #7054
- @piercus made their first contribution in #7015
- @kmehant made their first contribution in #7025
- @hattizai made their first contribution in #7034
Full Changelog: 2.20.0...2.21.0
2.20.0
Important
- Remove default
trust_remote_code=True
by @lhoestq in #6954- datasets with a python loading script now require passing
trust_remote_code=True
to be used
- datasets with a python loading script now require passing
Datasets features
- [Resumable IterableDataset] Add IterableDataset state_dict by @lhoestq in #6658
-
checkpoint and resume an iterable dataset (e.g. when streaming):
>>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3) >>> for idx, example in enumerate(iterable_dataset): ... print(example) ... if idx == 2: ... state_dict = iterable_dataset.state_dict() ... print("checkpoint") ... break >>> iterable_dataset.load_state_dict(state_dict) >>> print(f"restart from checkpoint") >>> for example in iterable_dataset: ... print(example)
Returns:
{'a': 0} {'a': 1} {'a': 2} checkpoint restart from checkpoint {'a': 3} {'a': 4} {'a': 5}
-
General improvements and bug fixes
- Add docs about the CLI by @albertvillanova in #6831
- Remove token arg from CLI examples by @albertvillanova in #6839
- Allow deleting a subset/config from a no-script dataset by @albertvillanova in #6820
- Fix line-endings in tests on Windows by @albertvillanova in #6857
- Fix CI by temporarily pinning huggingface-hub < 0.23.0 by @albertvillanova in #6861
- Fix dataset name for community Hub script-datasets by @albertvillanova in #6855
- Update tqdm >= 4.66.3 to fix vulnerability by @albertvillanova in #6870
- Fix download for dict of dicts of URLs by @albertvillanova in #6871
- Set dev version by @albertvillanova in #6873
- Shorten long logs by @lhoestq in #6875
- Support jax 0.4.27 in CI tests by @albertvillanova in #6885
- Close gzipped files properly by @lhoestq in #6893
- Make CLI convert_to_parquet not raise error if no rights to create script branch by @albertvillanova in #6902
- Fix YAML error in README files appearing on GitHub by @albertvillanova in #6898
- Document that to_json defaults to JSON Lines by @albertvillanova in #6895
- Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in #6883
- Create function to convert to parquet by @albertvillanova in #6878
- Update features.py to avoid bfloat16 unsupported error by @skaulintel in #6607
- Fix decoding multi part extension by @lhoestq in #6904
- Use pandas ujson in JSON loader to improve performance by @albertvillanova in #6874
- Update requests >=2.32.1 to fix vulnerability by @albertvillanova in #6909
- Fix wrong type hints in data_files by @albertvillanova in #6910
- Remove dead code for non-dict data_files from packaged modules by @albertvillanova in #6911
- Support fsspec 2024.5.0 by @albertvillanova in #6921
- Remove torchaudio remnants from code by @albertvillanova in #6922
- [WebDataset] Add
.pth
support for torch tensors by @lhoestq in #6920 - Unpin hfh by @lhoestq in #6876
- Preserve JSON column order and support list of strings field by @albertvillanova in #6914
- [WebDataset] Support compressed files by @lhoestq in #6931
- update ci user by @lhoestq in #6933
- Revert ci user by @lhoestq in #6934
- Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing data_dir/data_files in no-code Hub datasets by @albertvillanova in #6925
- Set dev version by @albertvillanova in #6944
- Update yanked version of minimum requests requirement by @albertvillanova in #6945
- Re-enable import sorting disabled by flake8:noqa directive when using ruff linter by @albertvillanova in #6946
- Update dataset_dict.py by @Arunprakash-A in #6932
- Update process.mdx: Code Listings Fixes by @FadyMorris in #6928
- Fix small typo by @marcenacp in #6955
- update docs on N-dim arrays by @lhoestq in #6956
- Fix typos in docs by @albertvillanova in #6957
- Validate config name and data_files in packaged modules by @albertvillanova in #6915
- Add support for categorical/dictionary types by @EthanSteinberg in #6892
- feat(ci): add trufflehog secrets detection by @McPatate in #6960
- Better error handling in
dataset_module_factory
by @Wauplin in #6959 - Move info_utils errors to exceptions module by @albertvillanova in #6952
- fix(ci): remove unnecessary permissions by @McPatate in #6962
New Contributors
- @skaulintel made their first contribution in #6607
- @Arunprakash-A made their first contribution in #6932
- @FadyMorris made their first contribution in #6928
- @marcenacp made their first contribution in #6955
- @EthanSteinberg made their first contribution in #6892
- @McPatate made their first contribution in #6960
Full Changelog: 2.19.0...2.20.0
2.19.2
Bug fixes
- Make CLI convert_to_parquet not raise error if no rights to create script branch by @albertvillanova in #6902
- Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in #6883
- Update requests >=2.32.1 to fix vulnerability by @albertvillanova in #6909
- Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing data_dir/data_files in no-code Hub datasets by @albertvillanova in #6925
Full Changelog: 2.19.1...2.19.2
2.19.1
Bug fixes
- Fix download for dict of dicts of URLs by @albertvillanova in #6871
Full Changelog: 2.19.0...2.19.1
2.19.0
Dataset Features
- Add Polars compatibility by @psmyth94 in #6531
- convert to a Polars dataframe using
.to_polars()
;import polars as pl from datasets import load_dataset ds = load_dataset("DIBT/10k_prompts_ranked", split="train") ds.to_polars() \ .groupby("topic") \ .agg(pl.len(), pl.first()) \ .sort("len", descending=True)
- Use Polars formatting to return Polars objects when accessing a dataset:
ds = ds.with_format("polars") ds[:10].group_by("kind").len()
- convert to a Polars dataframe using
- Add
fsspec
support forto_json
,to_csv
, andto_parquet
by @alvarobartt in #6096- Save on HF in any file format:
ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl") ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv") ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
- Save on HF in any file format:
- Add
mode
parameter toImage
feature by @mariosasko in #6735- Set images to be read in a certain mode like "RGB"
dataset = dataset.cast_column("image", Image(mode="RGB"))
- Set images to be read in a certain mode like "RGB"
- Add CLI function to convert script-dataset to Parquet by @albertvillanova in #6795
- run command to open a PR in script-based dataset to convert it to Parquet:
datasets-cli convert_to_parquet <dataset_id>
- run command to open a PR in script-based dataset to convert it to Parquet:
- Add Dataset.take and Dataset.skip by @lhoestq in #6813
- same as IterableDataset.take and IterableDataset.skip
ds = ds.take(10) # take only the first 10 examples
- same as IterableDataset.take and IterableDataset.skip
General improvements and bug fixes
- Bump huggingface-hub lower version to 0.21.2 by @albertvillanova in #6713
- fix CastError pickling by @lhoestq in #6712
- Expand no-code dataset info with datasets-server info by @mariosasko in #6714
- Fix sliced ConcatenationTable pickling with mixed schemas vertically by @lhoestq in #6715
- Fix concurrent script loading with force_redownload by @lhoestq in #6718
- get_dataset_default_config_name docstring by @lhoestq in #6723
- Deprecate Beam API and download from HF GCS bucket by @mariosasko in #6474
- Deprecate Pandas builder by @mariosasko in #6730
- Using a registry instead of calling globals for fetching feature types by @psmyth94 in #6727
- Update torch_formatter.py by @VarunNSrivastava in #6402
- Improve default patterns resolution by @mariosasko in #6704
- Transpose images with EXIF Orientation tag by @mariosasko in #6739
- Fix missing download_config in get_data_patterns by @lhoestq in #6742
- Allow null values in dict columns by @mariosasko in #6743
- Fix fsspec tqdm callback by @lhoestq in #6749
- chore(deps): bump fsspec by @shcheklein in #6747
- Fix offline mode with single config by @lhoestq in #6741
- Remove deprecated code by @Wauplin in #6761
- fixing the issue 6755(small typo) by @JINO-ROHIT in #6767
remove_columns
/rename_columns
doc fixes by @mariosasko in #6772- Fix CI by @mariosasko in #6780
- rename datasets-server to dataset-viewer by @severo in #6785
- Install dependencies with
uv
in CI by @mariosasko in #6779 - Fix cache conflict in
_check_legacy_cache2
by @lhoestq in #6792 - Fix typo in docs (upload CLI) by @Wauplin in #6802
- fix
DatasetBuilder._split_generators
incomplete type annotation by @JonasLoos in #6799 - #6791 Improve type checking around FAISS by @Dref360 in #6803
- Fix --repo-type order in cli upload docs by @lhoestq in #6804
- Fix hf-internal-testing/dataset_with_script commit SHA in CI test by @albertvillanova in #6806
- Fix cache path to snakecase for
CachedDatasetModuleFactory
andCache
by @izhx in #6754 - Multithreaded downloads by @lhoestq in #6794
- Remove
os.path.relpath
inresolve_patterns
by @mariosasko in #6815 - Extract data on the fly in packaged builders by @mariosasko in #6784
- add allow_primitive_to_str and allow_decimal_to_str instead of allow_number_to_str by @Modexus in #6811
- Support indexable objects in
Dataset.__getitem__
by @mariosasko in #6817 - Make convert_to_parquet CLI command create script branch by @albertvillanova in #6809
- Fix parquet export infos by @lhoestq in #6822
New Contributors
- @VarunNSrivastava made their first contribution in #6402
- @shcheklein made their first contribution in #6747
- @JINO-ROHIT made their first contribution in #6767
- @JonasLoos made their first contribution in #6799
- @izhx made their first contribution in #6754
- @Modexus made their first contribution in #6811
Full Changelog: 2.18.0...2.19.0