datasets releases - releaseport

No immediate action

5.0.0 Breaking risk 1mo

Agent traces + shuffling + new formats

Open

4.8.5 Bug fix 3mo

Fixed JSON decoding before DataFrame.to_json and related conversions.

Full changelog

Main bug fixes

fix: decode Json() values before calling DataFrame.to_json() (#8116) by @Brianzhengca in https://github.com/huggingface/datasets/pull/8122
Fix: decode JSON type before to_list or to_dict is called by @ItsTania in https://github.com/huggingface/datasets/pull/8137
Fix batching for table-formatted datasets by @bluehyena in https://github.com/huggingface/datasets/pull/8126
Fix iterable map resume state by @Brianzhengca in https://github.com/huggingface/datasets/pull/8147
don't embed remote files in download_and_prepare to parquet by @lhoestq in https://github.com/huggingface/datasets/pull/8150

Other improvements and bug fixes

Parse agent traces by @lhoestq in https://github.com/huggingface/datasets/pull/8113
🔒 Pin GitHub Actions to commit SHAs by @paulinebm in https://github.com/huggingface/datasets/pull/8114
chore: bump doc-builder SHA for PR upload workflow by @rtrompier in https://github.com/huggingface/datasets/pull/8134
Remove print statement in JSON processing by @lhoestq in https://github.com/huggingface/datasets/pull/8136
Don't include files list DatasetInfo (and remove old stuff) by @lhoestq in https://github.com/huggingface/datasets/pull/8128
update ci uer by @lhoestq in https://github.com/huggingface/datasets/pull/8139
fix warning in ci by @lhoestq in https://github.com/huggingface/datasets/pull/8140
fix mask in embed_storage for remote files by @lhoestq in https://github.com/huggingface/datasets/pull/8151
fix original_files missing in ci json test by @lhoestq in https://github.com/huggingface/datasets/pull/8152
Fix null in embed storage by @lhoestq in https://github.com/huggingface/datasets/pull/8154
Fix base_path in integration tests by @lhoestq in https://github.com/huggingface/datasets/pull/8155

New Contributors

@paulinebm made their first contribution in https://github.com/huggingface/datasets/pull/8114
@Brianzhengca made their first contribution in https://github.com/huggingface/datasets/pull/8122
@bluehyena made their first contribution in https://github.com/huggingface/datasets/pull/8126
@rtrompier made their first contribution in https://github.com/huggingface/datasets/pull/8134
@ItsTania made their first contribution in https://github.com/huggingface/datasets/pull/8137

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.4...4.8.5

View release on GitHub

4.8.4 Bug fix 4mo

Notable features

Support for the latest torchvision version

Full changelog

What's Changed

Support latest torchvision by @lhoestq in https://github.com/huggingface/datasets/pull/8087
fix regression when loading JSON with one file = one object by @lhoestq in https://github.com/huggingface/datasets/pull/8086

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.3...4.8.4

View release on GitHub

4.8.3 Bug fix 4mo

Fixed the split_dataset_by_node step and corrected the Json.cast_storage docstring.

Full changelog

What's Changed

Fix split_dataset_by_node step by @lhoestq in https://github.com/huggingface/datasets/pull/8081
Fix docstring of Json.cast_storage by @albertvillanova in https://github.com/huggingface/datasets/pull/8080

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.2...4.8.3

View release on GitHub

4.8.2 Maintenance 4mo

Minor fixes and improvements.

Full changelog

What's Changed

Json type for empty struct by @lhoestq in https://github.com/huggingface/datasets/pull/8074

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.1...4.8.2

View release on GitHub

4.8.1 Bug fix 4mo

Fixed formatted iter arrow function yielding twice.

Full changelog

What's Changed

Fix formatted iter arrow double yield by @HaukurPall in https://github.com/huggingface/datasets/pull/8063

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.0...4.8.1

View release on GitHub

4.8.0 Mixed 4mo

⚠ Upgrade required

Bumped dependencies: `dill` and `multiprocess` versions to add Python 3.14 support.
On macOS, push_to_hub now uses the `spawn` start method instead of `fork` to avoid segmentation faults.

Notable features

Read and write datasets from HF Storage Buckets via `load_dataset` with paths like `buckets/username/data-bucket` or `hf://buckets/...`.
IterableDataset.push_to_hub now accepts a `max_shard_size` argument (requires two dataset iterations).
Added more Arrow‑native operations and improved glob pattern support in archives for IterableDataset.

Full changelog

Dataset Features

Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in https://github.com/huggingface/datasets/pull/8064

from datasets import load_dataset
# load raw data from a Storage Bucket on HF
ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"])
# or manually, using hf:// paths
ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"])
# process, filter
ds = ds.map(...).filter(...)
# publish the AI-ready dataset
ds.push_to_hub("username/my-dataset-ready-for-training")

This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork).
And it bumps dill and multiprocess versions to support python 3.14

Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in https://github.com/huggingface/datasets/pull/8068
- added max_shard_size to IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)
- more arrow-native iterable operations for IterableDataset
- better support of glob patterns in archives, e.g. zip://*.jsonl::hf://datasets/username/dataset-name/data.zip
- fixes for to_pandas, videofolder, load_dataset_builder kwargs

What's Changed

fix reshard_data_sources by @lhoestq in https://github.com/huggingface/datasets/pull/8061
Improve error message for invalid data_files pattern format by @kushalkkb in https://github.com/huggingface/datasets/pull/8060
fix null filling in missing jsonl columns by @lhoestq in https://github.com/huggingface/datasets/pull/8069

New Contributors

@kushalkkb made their first contribution in https://github.com/huggingface/datasets/pull/8060
@Michael-RDev made their first contribution in https://github.com/huggingface/datasets/pull/8068

Full Changelog: https://github.com/huggingface/datasets/compare/4.7.0...4.8.0

View release on GitHub

4.7.0 New feature 4mo

Notable features

Introduced `Json()` type for storing fields with mixed data types (str/int/float/dict/list).
`Features({"a": Json()})` can be specified in `Dataset.from_dict`, `.map`, `.cast`, etc.
`on_mixed_types="use_json"` automatically applies `Json()` on mixed‑type columns.

Full changelog

Datasets Features

Add Json() type by @lhoestq in https://github.com/huggingface/datasets/pull/8027
- JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the Json()type is used to store such data that would normally not be supported in Arrow/Parquet
- Use the Json() type in Features() for any dataset, it is supported in any functions that accepts features=like load_dataset(), .map(), .cast(), .from_dict(), .from_list()
- Use on_mixed_types="use_json" to automatically set the Json() type on mixed types in .from_dict(), .from_list() and .map()

Examples:

You can use on_mixed_types="use_json" or specify features= with a [Json] type:

>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
  ...
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64

>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]

This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:

>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]]  # missing fields are filled with None

>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]]  # OK

Another example with tool calling data and the on_mixed_types="use_json" argument (useful to not have to specify features= manually):

>>> messages = [
...     {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
...     {"role": "assistant", "tool_calls": [
...         {"type": "function", "function": {
...             "name": "control_light",
...             "arguments": {"room": "living room", "state": "on"}
...         }},
...         {"type": "function", "function": {
...             "name": "play_music",
...             "arguments": {"playlist": "electronic"}  # mixed-type here since keys ["playlist"] and ["room", "state"] are different
...         }}]
...     },
...     {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
...     {"role": "tool", "name": "play_music", "content": "The music is now playing."},
...     {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}

What's Changed

Fix typos in iterable_dataset.py by @omkar-334 in https://github.com/huggingface/datasets/pull/8049
Fix non-deterministic by sorting metadata extensions (#8034) by @Nexround in https://github.com/huggingface/datasets/pull/8039
Use num_examples instead of len(self) for iterable_dataset's SplitInfo by @HaukurPall in https://github.com/huggingface/datasets/pull/8041
Fix silent data loss in push_to_hub when num_proc > num_shards by @HaukurPall in https://github.com/huggingface/datasets/pull/8044
Don't extract bad files by @lhoestq in https://github.com/huggingface/datasets/pull/8056
fix(iterable_dataset): preserve features when chaining filter() on typed IterableDataset by @s-zx in https://github.com/huggingface/datasets/pull/8053
fix: handle nested null types in feature alignment for multi-proc map by @ain-soph in https://github.com/huggingface/datasets/pull/8047
Fix unstable tokenizer fingerprinting (enables map cache reuse) by @KOKOSde in https://github.com/huggingface/datasets/pull/7982
Limit dataset listing to first 20 entries in readme by @lhoestq in https://github.com/huggingface/datasets/pull/8057

New Contributors

@omkar-334 made their first contribution in https://github.com/huggingface/datasets/pull/8049
@Nexround made their first contribution in https://github.com/huggingface/datasets/pull/8039
@HaukurPall made their first contribution in https://github.com/huggingface/datasets/pull/8041
@s-zx made their first contribution in https://github.com/huggingface/datasets/pull/8053
@ain-soph made their first contribution in https://github.com/huggingface/datasets/pull/8047
@KOKOSde made their first contribution in https://github.com/huggingface/datasets/pull/7982

Full Changelog: https://github.com/huggingface/datasets/compare/4.6.1...4.7.0

View release on GitHub

4.6.1 Bug fix 4mo

Fixed temporary file cleanup after pushing to the hub.

Full changelog

Bug fix

Remove tmp file in push to hub by @lhoestq in https://github.com/huggingface/datasets/pull/8030

Full Changelog: https://github.com/huggingface/datasets/compare/4.6.0...4.6.1

View release on GitHub

4.6.0 Breaking risk 5mo

Breaking changes

Minimum supported Python version increased to 3.10 (Python 3.9 removed).

Notable features

Support Image, Video, and Audio types in Lance datasets with automatic type inference.
Push_to_hub now supports uploading Video types directly as blobs.
IterableDataset.reshard() adds ability to split Parquet shards into finer row‑group shards.

Full changelog

Dataset Features

Support Image, Video and Audio types in Lance datasets

Infer types from lance blobs by @lhoestq in https://github.com/huggingface/datasets/pull/7966

>>> from datasets import load_dataset
>>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
>>> ds.features
{'video_blob': Video(),
 'video_path': Value('string'),
 'caption': Value('string'),
 'aesthetic_score': Value('float64'),
 'motion_score': Value('float64'),
 'temporal_consistency_score': Value('float64'),
 'camera_motion': Value('string'),
 'frame': Value('int64'),
 'fps': Value('float64'),
 'seconds': Value('float64'),
 'embedding': List(Value('float32'), length=1024)}

Push to hub now supports Video types

push_to_hub() for videos by @lhoestq in https://github.com/huggingface/datasets/pull/7971

 >>> from datasets import Dataset, Video
>>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
>>> ds = ds.cast_column("video", Video())
>>> ds.push_to_hub("username/my-video-dataset")

Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7976
- this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
- E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
- See more info here: https://huggingface.co/docs/hub/en/xet/deduplication

Add IterableDataset.reshard() by @lhoestq in https://github.com/huggingface/datasets/pull/7992

Reshard the dataset if possible, i.e. split the current shards further into more shards.
This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.
Equality may happen if no shard can be split further.

The resharding mechanism depends on the dataset file format:
- Parquet: shard per row group instead of per file
- Other: not implemented yet (contributions are welcome !)
```
>>> from datasets import load_dataset
>>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
>>> ds
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 4
})
>>> ds.reshard()
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 3600
})
```

What's Changed

Fix load_from_disk progress bar with redirected stdout by @omarfarhoud in https://github.com/huggingface/datasets/pull/7919
Revert "feat: avoid some copies in torch formatter (#7787)" by @lhoestq in https://github.com/huggingface/datasets/pull/7961
docs: fix grammar and add type hints in splits.py by @Edge-Explorer in https://github.com/huggingface/datasets/pull/7960
Fix interleave_datasets with all_exhausted_without_replacement strategy by @prathamk-tw in https://github.com/huggingface/datasets/pull/7955
Add examples for Lance datasets by @prrao87 in https://github.com/huggingface/datasets/pull/7950
Support null in json string cols by @lhoestq in https://github.com/huggingface/datasets/pull/7963
handle blob lance by @lhoestq in https://github.com/huggingface/datasets/pull/7964
Count examples in lance by @lhoestq in https://github.com/huggingface/datasets/pull/7969
Use temp files in push_to_hub to save memory by @lhoestq in https://github.com/huggingface/datasets/pull/7979
Drop python 3.9 by @lhoestq in https://github.com/huggingface/datasets/pull/7980
Support pandas 3 by @lhoestq in https://github.com/huggingface/datasets/pull/7981
Remove unused data files optims by @lhoestq in https://github.com/huggingface/datasets/pull/7985
Remove pre-release workaround in CI for transformers v5 and huggingface_hub v1 by @hanouticelina in https://github.com/huggingface/datasets/pull/7989
very basic support for more hf urls by @lhoestq in https://github.com/huggingface/datasets/pull/8003
Bump fsspec upper bound to 2026.2.0 (fixes #7994) by @jayzuccarelli in https://github.com/huggingface/datasets/pull/7995
Fix: make environment variable naming consistent (issue #7998) by @AnkitAhlawat7742 in https://github.com/huggingface/datasets/pull/8000
More IterableDataset.from_x methods and docs and polars.Lazyframe support by @lhoestq in https://github.com/huggingface/datasets/pull/8009
Support empty shard in from_generator by @lhoestq in https://github.com/huggingface/datasets/pull/8023
Allow import polars in map() by @lhoestq in https://github.com/huggingface/datasets/pull/8024

New Contributors

@omarfarhoud made their first contribution in https://github.com/huggingface/datasets/pull/7919
@Edge-Explorer made their first contribution in https://github.com/huggingface/datasets/pull/7960
@prathamk-tw made their first contribution in https://github.com/huggingface/datasets/pull/7955
@prrao87 made their first contribution in https://github.com/huggingface/datasets/pull/7950
@hanouticelina made their first contribution in https://github.com/huggingface/datasets/pull/7989
@jayzuccarelli made their first contribution in https://github.com/huggingface/datasets/pull/7995
@AnkitAhlawat7742 made their first contribution in https://github.com/huggingface/datasets/pull/8000

Full Changelog: https://github.com/huggingface/datasets/compare/4.5.0...4.6.0

View release on GitHub

4.5.0 New feature 6mo

Notable features

Support for loading Lance dataset formats (both full datasets and standalone .lance files)
Early exception raised when `load_dataset` receives an invalid revision

Full changelog

Dataset Features

Add lance format support by @eddyxu in https://github.com/huggingface/datasets/pull/7913
- Support for both Lance dataset (including metadata / manifests) and standalone .lance files
- e.g. with lance-format/fineweb-edu
```
from datasets import load_dataset

ds = load_dataset("lance-format/fineweb-edu", streaming=True)
for example in ds["train"]:
    ...
```

What's Changed

Raise early for invalid revision in load_dataset by @Scott-Simmons in https://github.com/huggingface/datasets/pull/7929
fix low but large example indexerror by @CloseChoice in https://github.com/huggingface/datasets/pull/7912
Fix method to retrieve attributes from file object by @lhoestq in https://github.com/huggingface/datasets/pull/7938
add _OverridableIOWrapper by @lhoestq in https://github.com/huggingface/datasets/pull/7942
Add _generate_shards by @lhoestq in https://github.com/huggingface/datasets/pull/7943

New Contributors

@eddyxu made their first contribution in https://github.com/huggingface/datasets/pull/7913
@Scott-Simmons made their first contribution in https://github.com/huggingface/datasets/pull/7929

Full Changelog: https://github.com/huggingface/datasets/compare/4.4.2...4.5.0

View release on GitHub

4.4.2 Bug fix 7mo

Notable features

Type overloads for load_dataset to improve static type inference
Inspect AI eval logs support

Full changelog

Bug fixes

Fix embed storage nifti by @CloseChoice in https://github.com/huggingface/datasets/pull/7853
ArXiv -> HF Papers by @qgallouedec in https://github.com/huggingface/datasets/pull/7855
fix some broken links by @julien-c in https://github.com/huggingface/datasets/pull/7859
Nifti visualization support by @CloseChoice in https://github.com/huggingface/datasets/pull/7874
Replace papaya with niivue by @CloseChoice in https://github.com/huggingface/datasets/pull/7878
Fix 7846: add_column and add_item erroneously(?) require new_fingerprint parameter by @sajmaru in https://github.com/huggingface/datasets/pull/7884
fix(fingerprint): treat TMPDIR as strict API and fail (Issue #7877) by @ada-ggf25 in https://github.com/huggingface/datasets/pull/7891
encode nifti correctly when uploading lazily by @CloseChoice in https://github.com/huggingface/datasets/pull/7892
fix(nifti): enable lazy loading for Nifti1ImageWrapper by @The-Obstacle-Is-The-Way in https://github.com/huggingface/datasets/pull/7887

Minor additions

Add type overloads to load_dataset for better static type inference by @Aditya2755 in https://github.com/huggingface/datasets/pull/7888
Add inspect_ai eval logs support by @lhoestq in https://github.com/huggingface/datasets/pull/7899
Save input shard lengths by @lhoestq in https://github.com/huggingface/datasets/pull/7897
Don't save original_shard_lengths by default for backward compat by @lhoestq in https://github.com/huggingface/datasets/pull/7906

New Contributors

@sajmaru made their first contribution in https://github.com/huggingface/datasets/pull/7884
@Aditya2755 made their first contribution in https://github.com/huggingface/datasets/pull/7888
@ada-ggf25 made their first contribution in https://github.com/huggingface/datasets/pull/7891
@The-Obstacle-Is-The-Way made their first contribution in https://github.com/huggingface/datasets/pull/7887

Full Changelog: https://github.com/huggingface/datasets/compare/4.4.1...4.4.2

View release on GitHub

4.4.1 Bug fix 8mo

Fixed streaming retry handling for HTTP 504 and 429 responses.

Full changelog

Bug fixes and improvements

Better streaming retries (504 and 429) by @lhoestq in https://github.com/huggingface/datasets/pull/7847
DOC: remove mode parameter in docstring of pdf and video feature by @CloseChoice in https://github.com/huggingface/datasets/pull/7848

Full Changelog: https://github.com/huggingface/datasets/compare/4.4.0...4.4.1

View release on GitHub

4.4.0 New feature 8mo

Notable features

Support for loading and handling NIfTI (`.nii.gz`) medical imaging files via `Nifti()` column type.
Audio column now accepts a `num_channels` argument to select mono or stereo output.

Full changelog

Dataset Features

Add nifti support by @CloseChoice in https://github.com/huggingface/datasets/pull/7815

Load medical imaging datasets from Hugging Face:

ds = load_dataset("username/my_nifti_dataset")
ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}

Load medical imaging datasets from your disk:

files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"]
ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti())
ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}

Documentation: https://huggingface.co/docs/datasets/nifti_dataset

Add num channels to audio by @CloseChoice in https://github.com/huggingface/datasets/pull/7840

# samples have shape (num_channels, num_samples)
ds = ds.cast_column("audio", Audio())  # default, use all channels
ds = ds.cast_column("audio", Audio(num_channels=2))  # use stereo
ds = ds.cast_column("audio", Audio(num_channels=1))  # use mono

Python 3.14 support by @lhoestq in https://github.com/huggingface/datasets/pull/7836

What's Changed

Fix random seed on shuffle and interleave_datasets by @CloseChoice in https://github.com/huggingface/datasets/pull/7823
fix ci compressionfs by @lhoestq in https://github.com/huggingface/datasets/pull/7830
fix: better args passthrough for _batch_setitems() by @sghng in https://github.com/huggingface/datasets/pull/7817
Fix: Properly render [!TIP] block in stream.shuffle documentation by @art-test-stack in https://github.com/huggingface/datasets/pull/7833
resolves the ValueError: Unable to avoid copy while creating an array by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7831
fix column with transform by @lhoestq in https://github.com/huggingface/datasets/pull/7843
support fsspec 2025.10.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7844

New Contributors

@sghng made their first contribution in https://github.com/huggingface/datasets/pull/7817
@art-test-stack made their first contribution in https://github.com/huggingface/datasets/pull/7833

Full Changelog: https://github.com/huggingface/datasets/compare/4.3.0...4.4.0

View release on GitHub

4.3.0 New feature 9mo

⚠ Upgrade required

Requires huggingface_hub version >= 1.1.0 for full streaming improvements.

Notable features

Add custom fingerprint support to `from_generator`

Full changelog

Dataset Features

Enable large scale distributed dataset streaming:

Keep hffs cache in workers when streaming by @lhoestq in https://github.com/huggingface/datasets/pull/7820
Retry open hf file by @lhoestq in https://github.com/huggingface/datasets/pull/7822

These improvements require huggingface_hub>=1.1.0 to take full effect

What's Changed

fix conda deps by @lhoestq in https://github.com/huggingface/datasets/pull/7810
Add pyarrow's binary view to features by @delta003 in https://github.com/huggingface/datasets/pull/7795
Fix polars cast column image by @CloseChoice in https://github.com/huggingface/datasets/pull/7800
Allow streaming hdf5 files by @lhoestq in https://github.com/huggingface/datasets/pull/7814
Fix batch_size default description in to_polars docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/7824
docs: document_dataset PDFs & OCR by @ethanknights in https://github.com/huggingface/datasets/pull/7812
Add custom fingerprint support to from_generator by @simonreise in https://github.com/huggingface/datasets/pull/7533
picklable batch_fn by @lhoestq in https://github.com/huggingface/datasets/pull/7826

New Contributors

@delta003 made their first contribution in https://github.com/huggingface/datasets/pull/7795
@CloseChoice made their first contribution in https://github.com/huggingface/datasets/pull/7800
@ethanknights made their first contribution in https://github.com/huggingface/datasets/pull/7812
@simonreise made their first contribution in https://github.com/huggingface/datasets/pull/7533

Full Changelog: https://github.com/huggingface/datasets/compare/4.2.0...4.3.0

View release on GitHub

4.2.0 New feature 9mo

Notable features

Sample without replacement option when interleaving datasets (`stopping_strategy="all_exhausted_without_replacement"`)
Parquet `load_dataset` gains `on_bad_files` argument (error/warn/skip) and column/filter selection support
Fragment scan options for Parquet streaming control (caching, prefetch limits)

Full changelog

Dataset Features

Sample without replacement option when interleaving datasets by @radulescupetru in https://github.com/huggingface/datasets/pull/7786
```
ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
```
Parquet: add on_bad_files argument to error/warn/skip bad files by @lhoestq in https://github.com/huggingface/datasets/pull/7806
```
ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
```

Add parquet scan options and docs by @lhoestq in https://github.com/huggingface/datasets/pull/7801

docs to select columns and filter data efficiently

ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])

new argument to control buffering and caching when streaming

fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

What's Changed

Document HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7740
update tips in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7790
feat: avoid some copies in torch formatter by @drbh in https://github.com/huggingface/datasets/pull/7787
Support huggingface_hub v0.x and v1.x by @Wauplin in https://github.com/huggingface/datasets/pull/7783
Define CI future by @lhoestq in https://github.com/huggingface/datasets/pull/7799
More Parquet streaming docs by @lhoestq in https://github.com/huggingface/datasets/pull/7803
Less api calls when resolving data_files by @lhoestq in https://github.com/huggingface/datasets/pull/7805
typo by @lhoestq in https://github.com/huggingface/datasets/pull/7807

New Contributors

@drbh made their first contribution in https://github.com/huggingface/datasets/pull/7787

Full Changelog: https://github.com/huggingface/datasets/compare/4.1.1...4.2.0

View release on GitHub

4.1.1 Bug fix 10mo

Notable features

Support arrow iterable when concatenating or interleaving

Full changelog

What's Changed

fix iterate nested field by @lhoestq in https://github.com/huggingface/datasets/pull/7775
Add support for arrow iterable when concatenating or interleaving by @radulescupetru in https://github.com/huggingface/datasets/pull/7771
fix empty dataset to_parquet by @lhoestq in https://github.com/huggingface/datasets/pull/7779

New Contributors

@radulescupetru made their first contribution in https://github.com/huggingface/datasets/pull/7771

Full Changelog: https://github.com/huggingface/datasets/compare/4.1.0...4.1.1

View release on GitHub

4.1.0 New feature 10mo

Notable features

Parquet files are now Optimized Parquet with content‑defined chunking enabled by default
Default row group size for Parquet set to 100 MB
Concurrent push_to_hub and IterableDataset push_to_hub support added

Full changelog

Dataset Features

feat: use content defined chunking by @kszucs in https://github.com/huggingface/datasets/pull/7589
- Parquet datasets are now Optimized Parquet !
- internally uses use_content_defined_chunking=True when writing Parquet files
- this enables fast deduped uploads to Hugging Face !
```
# Now faster thanks to content defined chunking
ds.push_to_hub("username/dataset_name")
```
- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
- with this change, the new default row group size for Parquet is set to 100MB
- write_page_index=True is also used to enable fast random access for the Dataset Viewer and tools that need it
Concurrent push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7708
Concurrent IterableDataset push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7710
HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7690
- load HDF5 datasets in one line of code
```
ds = load_dataset("username/dataset-with-hdf5-files")
```
- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

Convert to string when needed + faster .zstd by @lhoestq in https://github.com/huggingface/datasets/pull/7683
fix audio cast storage from array + sampling_rate by @lhoestq in https://github.com/huggingface/datasets/pull/7684
Fix misleading add_column() usage example in docstring by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7648
Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in https://github.com/huggingface/datasets/pull/7438
Update fsspec max version to current release 2025.7.0 by @rootAvish in https://github.com/huggingface/datasets/pull/7701
Update dataset_dict push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7711
Retry intermediate commits too by @lhoestq in https://github.com/huggingface/datasets/pull/7712
num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in https://github.com/huggingface/datasets/pull/7702
Update cli.mdx to refer to the new "hf" CLI by @evalstate in https://github.com/huggingface/datasets/pull/7713
fix num_proc=1 ci test by @lhoestq in https://github.com/huggingface/datasets/pull/7714
Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in https://github.com/huggingface/datasets/pull/7715
typo by @lhoestq in https://github.com/huggingface/datasets/pull/7716
fix largelist repr by @lhoestq in https://github.com/huggingface/datasets/pull/7735
Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in https://github.com/huggingface/datasets/pull/7730
Fix type hint train_test_split by @qgallouedec in https://github.com/huggingface/datasets/pull/7736
fix(webdataset): don't .lower() field_name by @YassineYousfi in https://github.com/huggingface/datasets/pull/7726
Refactor HDF5 and preserve tree structure by @klamike in https://github.com/huggingface/datasets/pull/7743
docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in https://github.com/huggingface/datasets/pull/7737
Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in https://github.com/huggingface/datasets/pull/7761
Support pathlib.Path for feature input by @Joshua-Chin in https://github.com/huggingface/datasets/pull/7755
add support for pyarrow string view in features by @onursatici in https://github.com/huggingface/datasets/pull/7718
Fix typo in error message for cache directory deletion by @brchristian in https://github.com/huggingface/datasets/pull/7749
update torchcodec in ci by @lhoestq in https://github.com/huggingface/datasets/pull/7764
Bump dill to 0.4.0 by @Bomme in https://github.com/huggingface/datasets/pull/7763

New Contributors

@DavidRConnell made their first contribution in https://github.com/huggingface/datasets/pull/7438
@rootAvish made their first contribution in https://github.com/huggingface/datasets/pull/7701
@tanuj-rai made their first contribution in https://github.com/huggingface/datasets/pull/7702
@evalstate made their first contribution in https://github.com/huggingface/datasets/pull/7713
@brchristian made their first contribution in https://github.com/huggingface/datasets/pull/7730
@klamike made their first contribution in https://github.com/huggingface/datasets/pull/7690
@YassineYousfi made their first contribution in https://github.com/huggingface/datasets/pull/7726
@Sanjaykumar030 made their first contribution in https://github.com/huggingface/datasets/pull/7737
@kszucs made their first contribution in https://github.com/huggingface/datasets/pull/7589
@Joshua-Chin made their first contribution in https://github.com/huggingface/datasets/pull/7755
@onursatici made their first contribution in https://github.com/huggingface/datasets/pull/7718
@Bomme made their first contribution in https://github.com/huggingface/datasets/pull/7763

Full Changelog: https://github.com/huggingface/datasets/compare/4.0.0...4.1.0

View release on GitHub

4.0.0 Breaking risk 1y

Breaking changes

Remove scripts altogether; `trust_remote_code` is no longer supported.
Replace `Sequence` type with `List`; `Sequence` becomes a utility returning `List` or dict based on subfeature.
Torchcodec decoding replaces soundfile for audio and decord for video, requiring torch>=2.7.0 and FFmpeg >=4 (not yet available on Windows).

Notable features

Add `IterableDataset.push_to_hub()` for streaming data pipeline uploads.
Introduce `num_proc=` parameter to `.push_to_hub()` for Dataset and IterableDataset to enable parallel pushing.
New `Column` object enabling lazy iteration over column values in an IterableDataset.

Full changelog

New Features

Add IterableDataset.push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7595

# Build streaming data pipelines in a few lines of code !
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)

Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in https://github.com/huggingface/datasets/pull/7606
```
# Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)
```

New Column object

Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in https://github.com/huggingface/datasets/pull/7564
Lazy column by @lhoestq in https://github.com/huggingface/datasets/pull/7614

# Syntax:
ds["column_name"]  # datasets.Column([...]) or datasets.IterableColumn(...)

# Iterate on a column:
for text in ds["text"]:
    ...

# Load one cell without bringing the full column in memory
first_text = ds["text"][0]  # equivalent to ds[0]["text"]

Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616

Enables streaming only the ranges you need !

# Don't download full audios/videos when it's not necessary
# Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
for example in ds:
    video = example["video"]
    frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames

Requires torch>=2.7.0 and FFmpeg >= 4
Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
Load audio data with AudioDecoder:

audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate  # 16000

# old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]

Load video data with VideoDecoder:

video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape  # (3, 240, 320)
first_frame.pts_seconds  # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
- trust_remote_code is no longer supported
Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding

Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634

Introduction of the List type

from datasets import Features, List, Value

features = Features({
    "texts": List(Value("string")),
    "four_paragraphs": List(Value("string"), length=4)
})

Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature

from datasets import Sequence

Sequence(Value("string"))  # List(Value("string"))
Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

Refactor Dataset.map to reuse cache files mapped with different num_proc by @ringohoffman in https://github.com/huggingface/datasets/pull/7434
fix string_to_dict test by @lhoestq in https://github.com/huggingface/datasets/pull/7571
Preserve formatting in concatenated IterableDataset by @francescorubbo in https://github.com/huggingface/datasets/pull/7522
Fix typos in PDF and Video documentation by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7579
fix: Add embed_storage in Pdf feature by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7582
load_dataset splits typing by @lhoestq in https://github.com/huggingface/datasets/pull/7587
Fixed typos by @TopCoder2K in https://github.com/huggingface/datasets/pull/7572
Fix regex library warnings by @emmanuel-ferdman in https://github.com/huggingface/datasets/pull/7576
[MINOR:TYPO] Update save_to_disk docstring by @cakiki in https://github.com/huggingface/datasets/pull/7575
Add missing property on RepeatExamplesIterable by @SilvanCodes in https://github.com/huggingface/datasets/pull/7581
Avoid multiple default config names by @albertvillanova in https://github.com/huggingface/datasets/pull/7585
Fix broken link to albumentations by @ternaus in https://github.com/huggingface/datasets/pull/7593
fix string_to_dict usage for windows by @lhoestq in https://github.com/huggingface/datasets/pull/7598
No TF in win tests by @lhoestq in https://github.com/huggingface/datasets/pull/7603
Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in https://github.com/huggingface/datasets/pull/7604
Tests typing and fixes for push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7608
fix parallel push_to_hub in dataset_dict by @lhoestq in https://github.com/huggingface/datasets/pull/7613
remove unused code by @lhoestq in https://github.com/huggingface/datasets/pull/7615
Update _dill.py to use co_linetable for Python 3.10+ in place of co_lnotab by @qgallouedec in https://github.com/huggingface/datasets/pull/7609
Fixes in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7620
Add albumentations to use dataset by @ternaus in https://github.com/huggingface/datasets/pull/7596
minor docs data aug by @lhoestq in https://github.com/huggingface/datasets/pull/7621
fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7623
fix save_infos by @lhoestq in https://github.com/huggingface/datasets/pull/7639
better features repr by @lhoestq in https://github.com/huggingface/datasets/pull/7640
update docs and docstrings by @lhoestq in https://github.com/huggingface/datasets/pull/7641
fix length for ci by @lhoestq in https://github.com/huggingface/datasets/pull/7642
Backward compat sequence instance by @lhoestq in https://github.com/huggingface/datasets/pull/7643
fix sequence ci by @lhoestq in https://github.com/huggingface/datasets/pull/7644
Custom metadata filenames by @lhoestq in https://github.com/huggingface/datasets/pull/7663
Update the beans dataset link in Preprocess by @HJassar in https://github.com/huggingface/datasets/pull/7659
Backward compat list feature by @lhoestq in https://github.com/huggingface/datasets/pull/7666
Fix infer list of images by @lhoestq in https://github.com/huggingface/datasets/pull/7667
Fix audio bytes by @lhoestq in https://github.com/huggingface/datasets/pull/7670
Fix double sequence by @lhoestq in https://github.com/huggingface/datasets/pull/7672

New Contributors

@TopCoder2K made their first contribution in https://github.com/huggingface/datasets/pull/7564
@francescorubbo made their first contribution in https://github.com/huggingface/datasets/pull/7522
@emmanuel-ferdman made their first contribution in https://github.com/huggingface/datasets/pull/7576
@SilvanCodes made their first contribution in https://github.com/huggingface/datasets/pull/7581
@ternaus made their first contribution in https://github.com/huggingface/datasets/pull/7593
@ArjunJagdale made their first contribution in https://github.com/huggingface/datasets/pull/7623
@TyTodd made their first contribution in https://github.com/huggingface/datasets/pull/7616
@HJassar made their first contribution in https://github.com/huggingface/datasets/pull/7659

Full Changelog: https://github.com/huggingface/datasets/compare/3.6.0...4.0.0

View release on GitHub

All releases

Main bug fixes

Other improvements and bug fixes

New Contributors

What's Changed

What's Changed

What's Changed

What's Changed

Dataset Features

What's Changed

New Contributors

Datasets Features

What's Changed

New Contributors

Bug fix

Dataset Features

What's Changed

New Contributors

Dataset Features

What's Changed

New Contributors

Bug fixes

Minor additions

New Contributors

Bug fixes and improvements

Dataset Features

What's Changed

New Contributors

Dataset Features

What's Changed

New Contributors

Dataset Features

What's Changed

New Contributors

What's Changed

New Contributors

Dataset Features

Other improvements and bug fixes

New Contributors

New Features

Breaking changes

Other improvements and bug fixes

New Contributors