This release includes 1 breaking change for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
ReleasePort's take
Moderate signalIterableDataset.shuffle() now shuffles across multiple input shards, altering prior behavior.
Why it matters: The change in shuffle semantics can break pipelines that rely on the previous ordering; test any code using IterableDataset.shuffle().
Summary
AI summaryUpdates New supported formats, Agent traces, and feat across a mixed release.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Breaking | High |
Default shuffling now uses multiple input shards, breaking previous behavior. Default shuffling now uses multiple input shards, breaking previous behavior. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Feature | Medium |
Adds `batch(by_column=...)` method for grouping rows by a column. Adds `batch(by_column=...)` method for grouping rows by a column. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Feature | Medium |
Adds Apache Iceberg format support. Adds Apache Iceberg format support. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Feature | Medium |
Adds 3D mesh support and MeshFolder builder. Adds 3D mesh support and MeshFolder builder. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Feature | Medium |
Adds .conll and .conllu dataset format loaders. Adds .conll and .conllu dataset format loaders. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Feature | Medium |
Adds `num_proc` argument to `Dataset.to_sql`. Adds `num_proc` argument to `Dataset.to_sql`. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Feature | Low |
Parses Agent traces messages for SFT using the optional `teich` library, enabling loading and training on agent trace datasets. Parses Agent traces messages for SFT using the optional `teich` library, enabling loading and training on agent trace datasets. Source: granite4.1:30b@2026-06-05-audit Confidence: low |
— |
| Dependency | Low |
Supports fsspec version 2026.4.0. Supports fsspec version 2026.4.0. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Bugfix | Medium |
Fixes Parquet streaming hangs at script end. Fixes Parquet streaming hangs at script end. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Bugfix | Medium |
Fixes storage_options lookup for streaming Lance datasets. Fixes storage_options lookup for streaming Lance datasets. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Bugfix | Medium |
Fixes iterable skip over full Arrow blocks. Fixes iterable skip over full Arrow blocks. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Bugfix | Low |
Fixes Parquet columns argument handling. Fixes Parquet columns argument handling. Source: granite4.1:30b@2026-06-05-audit Confidence: low |
— |
Full changelog
Datasets Features
Agent traces
-
Parse Agent traces messages for SFT using
teichby @lhoestq in https://github.com/huggingface/datasets/pull/8232- Agent traces from claude_code/pi/codex and others can now be loaded with load_dataset
- Using the
teichlibrary (new optional dependency), traces are parsed tomessagesto enable training on traces using e.g.trl - Load the data:
>>> from datasets import load_dataset >>> ds = load_dataset("lhoestq/agent-traces-example", split="train") >>> ds[0]["messages"] [{'role': 'user', 'content': 'Download a random dataset from Hugging Face, use DuckDB to inspect it, and come back with a short report about it. Be concise and include: dataset name, what files/format you found, row count or rough size if you can determine it,...' ...]- Train on agent traces:
trl sft --dataset-name lhoestq/agent-traces-example ...- find all the Agent traces datasets on HF here: https://huggingface.co/datasets?format=format:agent-traces&sort=trending
Next-level shuffling in streaming mode
-
Use multiple input shards for shuffle buffer by @lhoestq in https://github.com/huggingface/datasets/pull/8194
ds = load_dataset(..., streaming=True) ds = ds.shuffle(seed=42) # or configure local buffer shuffling manually, default is: ds = ds.shuffle(seed=42, buffer_size=1000, max_buffer_input_shards=10)before👎:
after✨:
toy example comparison
from datasets import IterableDataset ds = IterableDataset.from_dict({"i": range(123_456_789)}, num_shards=1024) ds = ds.shuffle(seed=42) print("Cold start ids:") print(list(ds.take(10)["i"])) print("Nominal regime ids:") print(list(ds.skip(10_000).take(10)["i"]))before👎:
Cold start ids: [6148853, 6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858] Nominal regime ids: [6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858, 6149290]after✨:
Cold start ids: [7836668, 9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871] Nominal regime ids: [9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871, 16758448]Note:
ds.state_dict()andds.load_state_dict()are still supported for this improved shuffling :) enabling dataset checkpointingNote 2: it uses threads to fetch the first examples in parallel from the input shards
Note 3: This is a BREAKING CHANGE: the default shuffling mechanism now uses multiple input shards. You can get the old mechanism by passing
max_buffer_input_shards=1toIterableDataset.shuffle()
New batching features for robotics datasets
-
Add batch(by_column=...) by @lhoestq in https://github.com/huggingface/datasets/pull/8172
from datasets import Dataset ds = Dataset.from_dict({"episode": [0] * 10 + [1] * 10, "frame": list(range(10)) * 2}) # ds = ds.to_iterable_dataset() ds = ds.batch(by_column="episode") for x in ds: print(x) # {'episode': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]} # {'episode': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
New supported formats
- Add Apache Iceberg format support by @frankliee in https://github.com/huggingface/datasets/pull/8148
- feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format by @JackieTien97 in https://github.com/huggingface/datasets/pull/8160
- feat: add 3D mesh support and MeshFolder builder by @Vinay-Umrethe in https://github.com/huggingface/datasets/pull/8055
- Add
.conll/.conlludataset format loader (CoNLL-2003 / 2000 / U) by @CrypticCortex in https://github.com/huggingface/datasets/pull/8219
Other improvements and bug fixes
- Pass library_name/version to HfApi in dataset push and delete paths by @davanstrien in https://github.com/huggingface/datasets/pull/8161
- Fix storage_options lookup for streaming Lance datasets by @ericjaebeom in https://github.com/huggingface/datasets/pull/8166
- add agent trace prompt, sent_at, count fields by @cfahlgren1 in https://github.com/huggingface/datasets/pull/8163
- fix: add
num_procargument toDataset.to_sqlby @EricSaikali in https://github.com/huggingface/datasets/pull/7791 - Support fsspec 2026.4.0 by @lhoestq in https://github.com/huggingface/datasets/pull/8175
- Fix Parquet streaming hangs at the end of script by @lhoestq in https://github.com/huggingface/datasets/pull/8176
ClassLabeldocs: Correct value for unknown labels by @l-uuz in https://github.com/huggingface/datasets/pull/7645- fix parquet reshard by @lhoestq in https://github.com/huggingface/datasets/pull/8193
- Fix parquet columns arg by @lhoestq in https://github.com/huggingface/datasets/pull/8210
- update readme by @lhoestq in https://github.com/huggingface/datasets/pull/8208
- update single seg repos in ci by @lhoestq in https://github.com/huggingface/datasets/pull/8213
- Fix single lance file form pylance 7.0 by @lhoestq in https://github.com/huggingface/datasets/pull/8225
- fix(map): fix progress bar exceeding total when load_from_cache_file=False by @Nitin-Rajasekar in https://github.com/huggingface/datasets/pull/8170
- fix: embed_external_files=True for mesh support by @Vinay-Umrethe in https://github.com/huggingface/datasets/pull/8224
- Fix iterable skip over full Arrow blocks by @my17th2 in https://github.com/huggingface/datasets/pull/8236
- Keep None as a real null in Json() columns instead of the string "null" by @adityasingh2400 in https://github.com/huggingface/datasets/pull/8231
- Support composed splits in streaming datasets by @lanarkite99 in https://github.com/huggingface/datasets/pull/8220
New Contributors
- @ericjaebeom made their first contribution in https://github.com/huggingface/datasets/pull/8166
- @EricSaikali made their first contribution in https://github.com/huggingface/datasets/pull/7791
- @l-uuz made their first contribution in https://github.com/huggingface/datasets/pull/7645
- @CrypticCortex made their first contribution in https://github.com/huggingface/datasets/pull/8219
- @frankliee made their first contribution in https://github.com/huggingface/datasets/pull/8148
- @Vinay-Umrethe made their first contribution in https://github.com/huggingface/datasets/pull/8055
- @Nitin-Rajasekar made their first contribution in https://github.com/huggingface/datasets/pull/8170
- @JackieTien97 made their first contribution in https://github.com/huggingface/datasets/pull/8160
- @my17th2 made their first contribution in https://github.com/huggingface/datasets/pull/8236
- @adityasingh2400 made their first contribution in https://github.com/huggingface/datasets/pull/8231
- @lanarkite99 made their first contribution in https://github.com/huggingface/datasets/pull/8220
Full Changelog: https://github.com/huggingface/datasets/compare/4.8.5...5.0.0
Breaking Changes
- Default shuffling mechanism now uses multiple input shards; use max_buffer_input_shards=1 to revert to the old single‑shard behavior.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About datasets
The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools
Beta — feedback welcome: [email protected]