Skip to content

datasets

RAG & Retrieval

A lightweight Python library for loading and processing thousands of public datasets (text, audio, image, video, etc.) with one‑line commands.

Python Latest 4.8.5 · 1mo ago Security brief →

Features

  • One‑line dataloader for any public dataset via `load_dataset()`
  • Native support for CSV, JSON, Parquet, HDF5, audio, image, video, PDF and NIfTI formats
  • Streaming mode for on‑the‑fly iteration without full download
  • Zero‑copy Apache Arrow backend with smart caching
  • Multi‑framework interoperability (NumPy, Pandas, PyTorch, TensorFlow, JAX)

Recent releases

View all 18 releases →
4.8.5 Bug fix

Fixed JSON decoding before DataFrame.to_json and related conversions.

Full changelog

Main bug fixes

  • fix: decode Json() values before calling DataFrame.to_json() (#8116) by @Brianzhengca in https://github.com/huggingface/datasets/pull/8122
  • Fix: decode JSON type before to_list or to_dict is called by @ItsTania in https://github.com/huggingface/datasets/pull/8137
  • Fix batching for table-formatted datasets by @bluehyena in https://github.com/huggingface/datasets/pull/8126
  • Fix iterable map resume state by @Brianzhengca in https://github.com/huggingface/datasets/pull/8147
  • don't embed remote files in download_and_prepare to parquet by @lhoestq in https://github.com/huggingface/datasets/pull/8150

Other improvements and bug fixes

  • Parse agent traces by @lhoestq in https://github.com/huggingface/datasets/pull/8113
  • 🔒 Pin GitHub Actions to commit SHAs by @paulinebm in https://github.com/huggingface/datasets/pull/8114
  • chore: bump doc-builder SHA for PR upload workflow by @rtrompier in https://github.com/huggingface/datasets/pull/8134
  • Remove print statement in JSON processing by @lhoestq in https://github.com/huggingface/datasets/pull/8136
  • Don't include files list DatasetInfo (and remove old stuff) by @lhoestq in https://github.com/huggingface/datasets/pull/8128
  • update ci uer by @lhoestq in https://github.com/huggingface/datasets/pull/8139
  • fix warning in ci by @lhoestq in https://github.com/huggingface/datasets/pull/8140
  • fix mask in embed_storage for remote files by @lhoestq in https://github.com/huggingface/datasets/pull/8151
  • fix original_files missing in ci json test by @lhoestq in https://github.com/huggingface/datasets/pull/8152
  • Fix null in embed storage by @lhoestq in https://github.com/huggingface/datasets/pull/8154
  • Fix base_path in integration tests by @lhoestq in https://github.com/huggingface/datasets/pull/8155

New Contributors

  • @paulinebm made their first contribution in https://github.com/huggingface/datasets/pull/8114
  • @Brianzhengca made their first contribution in https://github.com/huggingface/datasets/pull/8122
  • @bluehyena made their first contribution in https://github.com/huggingface/datasets/pull/8126
  • @rtrompier made their first contribution in https://github.com/huggingface/datasets/pull/8134
  • @ItsTania made their first contribution in https://github.com/huggingface/datasets/pull/8137

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.4...4.8.5

4.8.4 Bug fix
Notable features
  • Support for the latest torchvision version
Full changelog

What's Changed

  • Support latest torchvision by @lhoestq in https://github.com/huggingface/datasets/pull/8087
  • fix regression when loading JSON with one file = one object by @lhoestq in https://github.com/huggingface/datasets/pull/8086

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.3...4.8.4

4.8.3 Bug fix

Fixed the split_dataset_by_node step and corrected the Json.cast_storage docstring.

Full changelog

What's Changed

  • Fix split_dataset_by_node step by @lhoestq in https://github.com/huggingface/datasets/pull/8081
  • Fix docstring of Json.cast_storage by @albertvillanova in https://github.com/huggingface/datasets/pull/8080

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.2...4.8.3

4.8.1 Bug fix

Fixed formatted iter arrow function yielding twice.

Full changelog

What's Changed

  • Fix formatted iter arrow double yield by @HaukurPall in https://github.com/huggingface/datasets/pull/8063

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.0...4.8.1

4.8.0 Mixed
⚠ Upgrade required
  • Bumped dependencies: `dill` and `multiprocess` versions to add Python 3.14 support.
  • On macOS, push_to_hub now uses the `spawn` start method instead of `fork` to avoid segmentation faults.
Notable features
  • Read and write datasets from HF Storage Buckets via `load_dataset` with paths like `buckets/username/data-bucket` or `hf://buckets/...`.
  • IterableDataset.push_to_hub now accepts a `max_shard_size` argument (requires two dataset iterations).
  • Added more Arrow‑native operations and improved glob pattern support in archives for IterableDataset.
Full changelog

Dataset Features

  • Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in https://github.com/huggingface/datasets/pull/8064

    from datasets import load_dataset
    # load raw data from a Storage Bucket on HF
    ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"])
    # or manually, using hf:// paths
    ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"])
    # process, filter
    ds = ds.map(...).filter(...)
    # publish the AI-ready dataset
    ds.push_to_hub("username/my-dataset-ready-for-training")
    

    This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork).
    And it bumps dill and multiprocess versions to support python 3.14

  • Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in https://github.com/huggingface/datasets/pull/8068

    • added max_shard_size to IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)
    • more arrow-native iterable operations for IterableDataset
    • better support of glob patterns in archives, e.g. zip://*.jsonl::hf://datasets/username/dataset-name/data.zip
    • fixes for to_pandas, videofolder, load_dataset_builder kwargs

What's Changed

  • fix reshard_data_sources by @lhoestq in https://github.com/huggingface/datasets/pull/8061
  • Improve error message for invalid data_files pattern format by @kushalkkb in https://github.com/huggingface/datasets/pull/8060
  • fix null filling in missing jsonl columns by @lhoestq in https://github.com/huggingface/datasets/pull/8069

New Contributors

  • @kushalkkb made their first contribution in https://github.com/huggingface/datasets/pull/8060
  • @Michael-RDev made their first contribution in https://github.com/huggingface/datasets/pull/8068

Full Changelog: https://github.com/huggingface/datasets/compare/4.7.0...4.8.0

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

About

Stars
21,574
Forks
3,232
Languages
Python Makefile

Install & Platforms

Install via
pip

Beta — feedback welcome: [email protected]