Skip to content

datasets

v5.0.0 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 8h RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai artificial-intelligence computer-vision dataset-hub datasets machine-learning
+9 more
huggingface llm natural-language-processing nlp numpy pandas pytorch speech tensorflow

ReleasePort's take

Moderate signal
editorial:auto 6h

IterableDataset.shuffle() now shuffles across multiple input shards, altering prior behavior.

Why it matters: The change in shuffle semantics can break pipelines that rely on the previous ordering; test any code using IterableDataset.shuffle().

Summary

AI summary

Updates New supported formats, Agent traces, and feat across a mixed release.

Changes in this release

Breaking High

Default shuffling now uses multiple input shards, breaking previous behavior.

Default shuffling now uses multiple input shards, breaking previous behavior.

Source: llm_adapter@2026-06-05

Confidence: high

Feature Medium

Adds `batch(by_column=...)` method for grouping rows by a column.

Adds `batch(by_column=...)` method for grouping rows by a column.

Source: llm_adapter@2026-06-05

Confidence: high

Feature Medium

Adds Apache Iceberg format support.

Adds Apache Iceberg format support.

Source: llm_adapter@2026-06-05

Confidence: high

Feature Medium

Adds 3D mesh support and MeshFolder builder.

Adds 3D mesh support and MeshFolder builder.

Source: llm_adapter@2026-06-05

Confidence: high

Feature Medium

Adds .conll and .conllu dataset format loaders.

Adds .conll and .conllu dataset format loaders.

Source: llm_adapter@2026-06-05

Confidence: high

Feature Medium

Adds `num_proc` argument to `Dataset.to_sql`.

Adds `num_proc` argument to `Dataset.to_sql`.

Source: llm_adapter@2026-06-05

Confidence: high

Feature Low

Parses Agent traces messages for SFT using the optional `teich` library, enabling loading and training on agent trace datasets.

Parses Agent traces messages for SFT using the optional `teich` library, enabling loading and training on agent trace datasets.

Source: granite4.1:30b@2026-06-05-audit

Confidence: low

Dependency Low

Supports fsspec version 2026.4.0.

Supports fsspec version 2026.4.0.

Source: llm_adapter@2026-06-05

Confidence: high

Bugfix Medium

Fixes Parquet streaming hangs at script end.

Fixes Parquet streaming hangs at script end.

Source: llm_adapter@2026-06-05

Confidence: high

Bugfix Medium

Fixes storage_options lookup for streaming Lance datasets.

Fixes storage_options lookup for streaming Lance datasets.

Source: llm_adapter@2026-06-05

Confidence: high

Bugfix Medium

Fixes iterable skip over full Arrow blocks.

Fixes iterable skip over full Arrow blocks.

Source: llm_adapter@2026-06-05

Confidence: high

Bugfix Low

Fixes Parquet columns argument handling.

Fixes Parquet columns argument handling.

Source: granite4.1:30b@2026-06-05-audit

Confidence: low

Full changelog

Datasets Features

Agent traces

  • Parse Agent traces messages for SFT using teich by @lhoestq in https://github.com/huggingface/datasets/pull/8232

    • Agent traces from claude_code/pi/codex and others can now be loaded with load_dataset
    • Using the teich library (new optional dependency), traces are parsed to messages to enable training on traces using e.g. trl
    • Load the data:
    >>> from datasets import load_dataset
    >>> ds = load_dataset("lhoestq/agent-traces-example", split="train")
    >>> ds[0]["messages"]
    [{'role': 'user', 'content': 'Download a random dataset from Hugging Face, use DuckDB to inspect it, and come back with a short report about it. Be concise and include: dataset name, what files/format you found, row count or rough size if you can determine it,...'
     ...]
    
    • Train on agent traces:
    trl sft --dataset-name lhoestq/agent-traces-example ...
    
    • find all the Agent traces datasets on HF here: https://huggingface.co/datasets?format=format:agent-traces&sort=trending

Next-level shuffling in streaming mode

  • Use multiple input shards for shuffle buffer by @lhoestq in https://github.com/huggingface/datasets/pull/8194

    ds = load_dataset(..., streaming=True)
    ds = ds.shuffle(seed=42)
    # or configure local buffer shuffling manually, default is:
    ds = ds.shuffle(seed=42, buffer_size=1000, max_buffer_input_shards=10)
    

    before👎:

    after✨:

    toy example comparison

    from datasets import IterableDataset
    
    ds = IterableDataset.from_dict({"i": range(123_456_789)}, num_shards=1024)
    ds = ds.shuffle(seed=42)
    
    print("Cold start ids:")
    print(list(ds.take(10)["i"]))
    print("Nominal regime ids:")
    print(list(ds.skip(10_000).take(10)["i"]))
    

    before👎:

    Cold start ids:
    [6148853, 6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858]
    Nominal regime ids:
    [6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858, 6149290]
    

    after✨:

    Cold start ids:
    [7836668, 9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871]
    Nominal regime ids:
    [9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871, 16758448]
    

    Note: ds.state_dict() and ds.load_state_dict() are still supported for this improved shuffling :) enabling dataset checkpointing

    Note 2: it uses threads to fetch the first examples in parallel from the input shards

    Note 3: This is a BREAKING CHANGE: the default shuffling mechanism now uses multiple input shards. You can get the old mechanism by passing max_buffer_input_shards=1 to IterableDataset.shuffle()

New batching features for robotics datasets

  • Add batch(by_column=...) by @lhoestq in https://github.com/huggingface/datasets/pull/8172

    from datasets import Dataset
    
    ds = Dataset.from_dict({"episode": [0] * 10 + [1] * 10, "frame": list(range(10)) * 2})
    # ds = ds.to_iterable_dataset()
    ds = ds.batch(by_column="episode")
    for x in ds:
        print(x)
    # {'episode': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
    # {'episode': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
    

New supported formats

  • Add Apache Iceberg format support by @frankliee in https://github.com/huggingface/datasets/pull/8148
  • feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format by @JackieTien97 in https://github.com/huggingface/datasets/pull/8160
  • feat: add 3D mesh support and MeshFolder builder by @Vinay-Umrethe in https://github.com/huggingface/datasets/pull/8055
  • Add .conll / .conllu dataset format loader (CoNLL-2003 / 2000 / U) by @CrypticCortex in https://github.com/huggingface/datasets/pull/8219

Other improvements and bug fixes

  • Pass library_name/version to HfApi in dataset push and delete paths by @davanstrien in https://github.com/huggingface/datasets/pull/8161
  • Fix storage_options lookup for streaming Lance datasets by @ericjaebeom in https://github.com/huggingface/datasets/pull/8166
  • add agent trace prompt, sent_at, count fields by @cfahlgren1 in https://github.com/huggingface/datasets/pull/8163
  • fix: add num_proc argument to Dataset.to_sql by @EricSaikali in https://github.com/huggingface/datasets/pull/7791
  • Support fsspec 2026.4.0 by @lhoestq in https://github.com/huggingface/datasets/pull/8175
  • Fix Parquet streaming hangs at the end of script by @lhoestq in https://github.com/huggingface/datasets/pull/8176
  • ClassLabel docs: Correct value for unknown labels by @l-uuz in https://github.com/huggingface/datasets/pull/7645
  • fix parquet reshard by @lhoestq in https://github.com/huggingface/datasets/pull/8193
  • Fix parquet columns arg by @lhoestq in https://github.com/huggingface/datasets/pull/8210
  • update readme by @lhoestq in https://github.com/huggingface/datasets/pull/8208
  • update single seg repos in ci by @lhoestq in https://github.com/huggingface/datasets/pull/8213
  • Fix single lance file form pylance 7.0 by @lhoestq in https://github.com/huggingface/datasets/pull/8225
  • fix(map): fix progress bar exceeding total when load_from_cache_file=False by @Nitin-Rajasekar in https://github.com/huggingface/datasets/pull/8170
  • fix: embed_external_files=True for mesh support by @Vinay-Umrethe in https://github.com/huggingface/datasets/pull/8224
  • Fix iterable skip over full Arrow blocks by @my17th2 in https://github.com/huggingface/datasets/pull/8236
  • Keep None as a real null in Json() columns instead of the string "null" by @adityasingh2400 in https://github.com/huggingface/datasets/pull/8231
  • Support composed splits in streaming datasets by @lanarkite99 in https://github.com/huggingface/datasets/pull/8220

New Contributors

  • @ericjaebeom made their first contribution in https://github.com/huggingface/datasets/pull/8166
  • @EricSaikali made their first contribution in https://github.com/huggingface/datasets/pull/7791
  • @l-uuz made their first contribution in https://github.com/huggingface/datasets/pull/7645
  • @CrypticCortex made their first contribution in https://github.com/huggingface/datasets/pull/8219
  • @frankliee made their first contribution in https://github.com/huggingface/datasets/pull/8148
  • @Vinay-Umrethe made their first contribution in https://github.com/huggingface/datasets/pull/8055
  • @Nitin-Rajasekar made their first contribution in https://github.com/huggingface/datasets/pull/8170
  • @JackieTien97 made their first contribution in https://github.com/huggingface/datasets/pull/8160
  • @my17th2 made their first contribution in https://github.com/huggingface/datasets/pull/8236
  • @adityasingh2400 made their first contribution in https://github.com/huggingface/datasets/pull/8231
  • @lanarkite99 made their first contribution in https://github.com/huggingface/datasets/pull/8220

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.5...5.0.0

Breaking Changes

  • Default shuffling mechanism now uses multiple input shards; use max_buffer_input_shards=1 to revert to the old single‑shard behavior.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track datasets

Get notified when new releases ship.

Sign up free

About datasets

The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

All releases →

Related context

Beta — feedback welcome: [email protected]