datasets

v5.0.0 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 1mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai artificial-intelligence computer-vision dataset-hub datasets machine-learning

+9 more

huggingface llm natural-language-processing nlp numpy pandas pytorch speech tensorflow

ReleasePort's take

Moderate signal

editorial:auto 1mo

IterableDataset.shuffle() now shuffles across multiple input shards, altering prior behavior.

Why it matters: The change in shuffle semantics can break pipelines that rely on the previous ordering; test any code using IterableDataset.shuffle().

Summary

AI summary

Updates New supported formats, Agent traces, and feat across a mixed release.

Changes in this release

Type	Severity	Summary	CVE
Breaking	High	Default shuffling now uses multiple input shards, breaking previous behavior. Default shuffling now uses multiple input shards, breaking previous behavior. Source: llm_adapter@2026-06-05 Confidence: high	—
Feature
Feature	Medium	Adds `batch(by_column=...)` method for grouping rows by a column. Adds `batch(by_column=...)` method for grouping rows by a column. Source: llm_adapter@2026-06-05 Confidence: high	—
Feature	Medium	Adds Apache Iceberg format support. Adds Apache Iceberg format support. Source: llm_adapter@2026-06-05 Confidence: high	—
Feature	Medium	Adds 3D mesh support and MeshFolder builder. Adds 3D mesh support and MeshFolder builder. Source: llm_adapter@2026-06-05 Confidence: high	—
Feature	Medium	Adds .conll and .conllu dataset format loaders. Adds .conll and .conllu dataset format loaders. Source: llm_adapter@2026-06-05 Confidence: high	—
Feature	Medium	Adds `num_proc` argument to `Dataset.to_sql`. Adds `num_proc` argument to `Dataset.to_sql`. Source: llm_adapter@2026-06-05 Confidence: high	—
Feature	Low	Parses Agent traces messages for SFT using the optional `teich` library, enabling loading and training on agent trace datasets. Parses Agent traces messages for SFT using the optional `teich` library, enabling loading and training on agent trace datasets. Source: granite4.1:30b@2026-06-05-audit Confidence: low	—
Dependency	Low	Supports fsspec version 2026.4.0. Supports fsspec version 2026.4.0. Source: llm_adapter@2026-06-05 Confidence: high	—
Bugfix
Bugfix	Medium	Fixes Parquet streaming hangs at script end. Fixes Parquet streaming hangs at script end. Source: llm_adapter@2026-06-05 Confidence: high	—
Bugfix	Medium	Fixes storage_options lookup for streaming Lance datasets. Fixes storage_options lookup for streaming Lance datasets. Source: llm_adapter@2026-06-05 Confidence: high	—
Bugfix	Medium	Fixes iterable skip over full Arrow blocks. Fixes iterable skip over full Arrow blocks. Source: llm_adapter@2026-06-05 Confidence: high	—
Bugfix	Low	Fixes Parquet columns argument handling. Fixes Parquet columns argument handling. Source: granite4.1:30b@2026-06-05-audit Confidence: low	—

Full changelog

Datasets Features

Agent traces

Parse Agent traces messages for SFT using teich by @lhoestq in https://github.com/huggingface/datasets/pull/8232
- Agent traces from claude_code/pi/codex and others can now be loaded with load_dataset
- Using the teich library (new optional dependency), traces are parsed to messages to enable training on traces using e.g. trl
- Load the data:
```
>>> from datasets import load_dataset
>>> ds = load_dataset("lhoestq/agent-traces-example", split="train")
>>> ds[0]["messages"]
[{'role': 'user', 'content': 'Download a random dataset from Hugging Face, use DuckDB to inspect it, and come back with a short report about it. Be concise and include: dataset name, what files/format you found, row count or rough size if you can determine it,...'
 ...]
```
- Train on agent traces:
```
trl sft --dataset-name lhoestq/agent-traces-example ...
```
- find all the Agent traces datasets on HF here: https://huggingface.co/datasets?format=format:agent-traces&sort=trending

Next-level shuffling in streaming mode

Use multiple input shards for shuffle buffer by @lhoestq in https://github.com/huggingface/datasets/pull/8194

ds = load_dataset(..., streaming=True)
ds = ds.shuffle(seed=42)
# or configure local buffer shuffling manually, default is:
ds = ds.shuffle(seed=42, buffer_size=1000, max_buffer_input_shards=10)

before👎:

after✨:

toy example comparison

from datasets import IterableDataset

ds = IterableDataset.from_dict({"i": range(123_456_789)}, num_shards=1024)
ds = ds.shuffle(seed=42)

print("Cold start ids:")
print(list(ds.take(10)["i"]))
print("Nominal regime ids:")
print(list(ds.skip(10_000).take(10)["i"]))

before👎:

Cold start ids:
[6148853, 6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858]
Nominal regime ids:
[6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858, 6149290]

after✨:

Cold start ids:
[7836668, 9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871]
Nominal regime ids:
[9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871, 16758448]

Note: ds.state_dict() and ds.load_state_dict() are still supported for this improved shuffling :) enabling dataset checkpointing

Note 2: it uses threads to fetch the first examples in parallel from the input shards

Note 3: This is a BREAKING CHANGE: the default shuffling mechanism now uses multiple input shards. You can get the old mechanism by passing max_buffer_input_shards=1 to IterableDataset.shuffle()

New batching features for robotics datasets

Add batch(by_column=...) by @lhoestq in https://github.com/huggingface/datasets/pull/8172

from datasets import Dataset

ds = Dataset.from_dict({"episode": [0] * 10 + [1] * 10, "frame": list(range(10)) * 2})
# ds = ds.to_iterable_dataset()
ds = ds.batch(by_column="episode")
for x in ds:
    print(x)
# {'episode': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
# {'episode': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

New supported formats

Add Apache Iceberg format support by @frankliee in https://github.com/huggingface/datasets/pull/8148
feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format by @JackieTien97 in https://github.com/huggingface/datasets/pull/8160
feat: add 3D mesh support and MeshFolder builder by @Vinay-Umrethe in https://github.com/huggingface/datasets/pull/8055
Add .conll / .conllu dataset format loader (CoNLL-2003 / 2000 / U) by @CrypticCortex in https://github.com/huggingface/datasets/pull/8219

Other improvements and bug fixes

Pass library_name/version to HfApi in dataset push and delete paths by @davanstrien in https://github.com/huggingface/datasets/pull/8161
Fix storage_options lookup for streaming Lance datasets by @ericjaebeom in https://github.com/huggingface/datasets/pull/8166
add agent trace prompt, sent_at, count fields by @cfahlgren1 in https://github.com/huggingface/datasets/pull/8163
fix: add num_proc argument to Dataset.to_sql by @EricSaikali in https://github.com/huggingface/datasets/pull/7791
Support fsspec 2026.4.0 by @lhoestq in https://github.com/huggingface/datasets/pull/8175
Fix Parquet streaming hangs at the end of script by @lhoestq in https://github.com/huggingface/datasets/pull/8176
ClassLabel docs: Correct value for unknown labels by @l-uuz in https://github.com/huggingface/datasets/pull/7645
fix parquet reshard by @lhoestq in https://github.com/huggingface/datasets/pull/8193
Fix parquet columns arg by @lhoestq in https://github.com/huggingface/datasets/pull/8210
update readme by @lhoestq in https://github.com/huggingface/datasets/pull/8208
update single seg repos in ci by @lhoestq in https://github.com/huggingface/datasets/pull/8213
Fix single lance file form pylance 7.0 by @lhoestq in https://github.com/huggingface/datasets/pull/8225
fix(map): fix progress bar exceeding total when load_from_cache_file=False by @Nitin-Rajasekar in https://github.com/huggingface/datasets/pull/8170
fix: embed_external_files=True for mesh support by @Vinay-Umrethe in https://github.com/huggingface/datasets/pull/8224
Fix iterable skip over full Arrow blocks by @my17th2 in https://github.com/huggingface/datasets/pull/8236
Keep None as a real null in Json() columns instead of the string "null" by @adityasingh2400 in https://github.com/huggingface/datasets/pull/8231
Support composed splits in streaming datasets by @lanarkite99 in https://github.com/huggingface/datasets/pull/8220

New Contributors

@ericjaebeom made their first contribution in https://github.com/huggingface/datasets/pull/8166
@EricSaikali made their first contribution in https://github.com/huggingface/datasets/pull/7791
@l-uuz made their first contribution in https://github.com/huggingface/datasets/pull/7645
@CrypticCortex made their first contribution in https://github.com/huggingface/datasets/pull/8219
@frankliee made their first contribution in https://github.com/huggingface/datasets/pull/8148
@Vinay-Umrethe made their first contribution in https://github.com/huggingface/datasets/pull/8055
@Nitin-Rajasekar made their first contribution in https://github.com/huggingface/datasets/pull/8170
@JackieTien97 made their first contribution in https://github.com/huggingface/datasets/pull/8160
@my17th2 made their first contribution in https://github.com/huggingface/datasets/pull/8236
@adityasingh2400 made their first contribution in https://github.com/huggingface/datasets/pull/8231
@lanarkite99 made their first contribution in https://github.com/huggingface/datasets/pull/8220

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.5...5.0.0

Breaking Changes

Default shuffling mechanism now uses multiple input shards; use max_buffer_input_shards=1 to revert to the old single‑shard behavior.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track datasets

Get notified when new releases ship.

About datasets

The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

All releases →