Skip to content

Releases: huggingface/datasets

3.2.0

10 Dec 17:00
fba4758
Compare
Choose a tag to compare

Dataset Features

  • Faster parquet streaming + filters with predicate pushdown by @lhoestq in #7309
    • Up to +100% streaming speed
    • Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
      from datasets import load_dataset
      filters = [('date', '>=', '2023')]
      ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)

Other improvements and bug fixes

New Contributors

Full Changelog: 3.1.0...3.2.0

3.1.0

31 Oct 15:21
dfb52e2
Compare
Choose a tag to compare

Dataset Features

  • Video support by @lhoestq in #7230
    >>> from datasets import Dataset, Video, load_dataset
    >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video())
    >>> # or from the hub
    >>> ds = load_dataset("username/dataset_name", split="train")
    >>> ds[0]["video"]
    <decord.video_reader.VideoReader at 0x105525c70>
  • Add IterableDataset.shard() by @lhoestq in #7252
    >>> from datasets import load_dataset
    >>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True)
    >>> full_ds.num_shards
    2360
    >>> ds = full_ds.shard(num_shards=ds.num_shards, index=0)
    >>> ds.num_shards
    1
    >>> ds = full_ds.shard(num_shards=8, index=0)
    >>> ds.num_shards
    295
  • Basic XML support by @lhoestq in #7250

What's Changed

New Contributors

Full Changelog: 3.0.2...3.1.0

3.0.2

22 Oct 15:03
97e5e17
Compare
Choose a tag to compare

Main bug fixes

What's Changed

New Contributors

Full Changelog: 3.0.1...3.0.2

3.0.1

26 Sep 08:27
679562d
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 3.0.0...3.0.1

3.0.0

11 Sep 13:50
3505ed9
Compare
Choose a tag to compare

Dataset Features

  • Use Polars functions in .map()
    • Allow Polars as valid output type by @psmyth94 in #6762

    • Example:

      >>> from datasets import load_dataset
      >>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars")
      >>> cols = [pl.col("content").str.len_bytes().alias("length")]
      >>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True)
      >>> ds_with_length[:5]
      shape: (5, 5)
      ┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐
      │ idxtitlecontentlabelslength │
      │ ---------------    │
      │ i64strstrstru32    │
      ╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡
      │ 0The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure180    │
      │ 1Pikachu's Quest for PeacePikachu, with his cheeky persona… ┆ peaceful_narrative138    │
      │ 2The Tender Tale of SquirtleSquirtle took everyone on a memo… ┆ gentle_adventure135    │
      │ 3Charizard's Heartwarming TaleCharizard found joy in helping o… ┆ heartwarming_story112    │
      │ 4Jolteon's Sparkling JourneyJolteon, with his zest for life,… ┆ celebratory_narrative111    │
      └─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘
  • Support NumPy 2

Cache Changes

  • Use huggingface_hub cache by @lhoestq in #7105
    • use the huggingface_hub cache for files downloaded from HF, by default at ~/.cache/huggingface/hub
    • cached datasets (Arrow files) will still be reloaded from the datasets cache, by default at ~/.cache/huggingface/datasets

Breaking changes

  • Remove deprecated code by @albertvillanova in #6996
    • removed deprecated arguments like use_auth_token, fs or ignore_verifications
  • Remove beam by @albertvillanova in #6987
    • removed deprecated apache beam datasets support
  • Remove metrics by @albertvillanova in #6983
    • remove deprecated load_metric, please use the evaluate library instead
  • Remove tasks by @albertvillanova in #6999
    • remove deprecated task argument in load_dataset() .prepare_for_task() method, datasets.tasks module

General improvements and bug fixes

New Contributors

Full Changelog: 2.21.0...3.0.0

2.21.0

14 Aug 08:08
a1b5a32
Compare
Choose a tag to compare

Features

  • Support pyarrow large_list by @albertvillanova in #7019
    • Support Polars round trip:
      import polars as pl
      from datasets import Dataset
      
      df1 = pl.from_dict({"col_1": [[1, 2], [3, 4]]}
      df2 = Dataset.from_polars(df).to_polars()
      assert df1.equals(df2)

What's Changed

New Contributors

Full Changelog: 2.20.0...2.21.0

2.20.0

13 Jun 14:57
98fdc9e
Compare
Choose a tag to compare

Important

  • Remove default trust_remote_code=True by @lhoestq in #6954
    • datasets with a python loading script now require passing trust_remote_code=True to be used

Datasets features

  • [Resumable IterableDataset] Add IterableDataset state_dict by @lhoestq in #6658
    • checkpoint and resume an iterable dataset (e.g. when streaming):

      >>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
      >>> for idx, example in enumerate(iterable_dataset):
      ...     print(example)
      ...     if idx == 2:
      ...         state_dict = iterable_dataset.state_dict()
      ...         print("checkpoint")
      ...         break
      >>> iterable_dataset.load_state_dict(state_dict)
      >>> print(f"restart from checkpoint")
      >>> for example in iterable_dataset:
      ...     print(example)

      Returns:

      {'a': 0}
      {'a': 1}
      {'a': 2}
      checkpoint
      restart from checkpoint
      {'a': 3}
      {'a': 4}
      {'a': 5}
      

General improvements and bug fixes

New Contributors

Full Changelog: 2.19.0...2.20.0

2.19.2

03 Jun 05:26
Compare
Choose a tag to compare

Bug fixes

  • Make CLI convert_to_parquet not raise error if no rights to create script branch by @albertvillanova in #6902
  • Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in #6883
  • Update requests >=2.32.1 to fix vulnerability by @albertvillanova in #6909
  • Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing data_dir/data_files in no-code Hub datasets by @albertvillanova in #6925

Full Changelog: 2.19.1...2.19.2

2.19.1

06 May 09:40
bb2664c
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.19.0...2.19.1

2.19.0

19 Apr 08:46
0d3c746
Compare
Choose a tag to compare

Dataset Features

  • Add Polars compatibility by @psmyth94 in #6531
    • convert to a Polars dataframe using .to_polars();
      import polars as pl
      from datasets import load_dataset
      ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
      ds.to_polars() \
          .groupby("topic") \
          .agg(pl.len(), pl.first()) \
          .sort("len", descending=True)
    • Use Polars formatting to return Polars objects when accessing a dataset:
      ds = ds.with_format("polars")
      ds[:10].group_by("kind").len()
  • Add fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in #6096
    • Save on HF in any file format:
      ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
      ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
      ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
  • Add mode parameter to Image feature by @mariosasko in #6735
    • Set images to be read in a certain mode like "RGB"
      dataset = dataset.cast_column("image", Image(mode="RGB"))
  • Add CLI function to convert script-dataset to Parquet by @albertvillanova in #6795
    • run command to open a PR in script-based dataset to convert it to Parquet:
      datasets-cli convert_to_parquet <dataset_id>
      
  • Add Dataset.take and Dataset.skip by @lhoestq in #6813
    • same as IterableDataset.take and IterableDataset.skip
      ds = ds.take(10)  # take only the first 10 examples

General improvements and bug fixes

New Contributors

Full Changelog: 2.18.0...2.19.0