ludwig

v0.17.0 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo LLM Frameworks

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

computer-vision data-centric data-science deep machine-learning deeplearning

+11 more

fine-tuning learning llama llama2 llm llm-training machinelearning mistral natural-language natural-language-processing pytorch

ReleasePort's take

Moderate signal

editorial:auto 2mo

Ludwig v0.17.0 introduces lazy preprocessing that decodes audio and image features on‑the‑fly, caches decoded tensors as memory‑mapped files for subsequent epochs, and uses a background prefetch thread to feed GPUs without blocking the training loop.

Why it matters: These changes reduce training startup latency and improve GPU utilization by overlapping decoding with computation, directly benefiting developers and SREs managing large‑scale ML workloads.

Summary

AI summary

Lazy preprocessing enables on‑the‑fly decoding of audio and image features, starting training instantly.

Changes in this release

Type	Severity	Summary	CVE
Feature
Feature	Medium	Lazy preprocessing for audio and image features allows on‑the‑fly decoding during training. Lazy preprocessing for audio and image features allows on‑the‑fly decoding during training. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	`preprocessing_mode: lazy_cached` caches decoded tensors as memory‑mapped files for subsequent epochs. `preprocessing_mode: lazy_cached` caches decoded tensors as memory‑mapped files for subsequent epochs. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Background prefetch thread decodes audio/image ahead of training loop, feeding GPU without blocking. Background prefetch thread decodes audio/image ahead of training loop, feeding GPU without blocking. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Ray distributed training decodes lazy features inside Ray data pipeline, parallelizing decode work. Ray distributed training decodes lazy features inside Ray data pipeline, parallelizing decode work. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	YAML‑based search spaces are now declaratively defined and loaded via `SearchSpace._from_specs()`. YAML‑based search spaces are now declaratively defined and loaded via `SearchSpace._from_specs()`. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Dataset quality analysis profiles dataset size, class balance, modality distribution before building search space. Dataset quality analysis profiles dataset size, class balance, modality distribution before building search space. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Hyperparameter caps (epoch counts, batch sizes) adapt automatically based on dataset size. Hyperparameter caps (epoch counts, batch sizes) adapt automatically based on dataset size. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Transformer combiners receive capped learning rate upper bound to prevent NaN loss during hyperopt. Transformer combiners receive capped learning rate upper bound to prevent NaN loss during hyperopt. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	`configs_from_dataframe` now correctly propagates `default_epochs` to `TrainerSpec` and imports default search space builder. `configs_from_dataframe` now correctly propagates `default_epochs` to `TrainerSpec` and imports default search space builder. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Configurable `prefetch_size` tunes background decoder queue depth for CPU/GPU overlap. Configurable `prefetch_size` tunes background decoder queue depth for CPU/GPU overlap. Source: granite4.1:30b@2026-05-22-audit Confidence: low	—
Feature	Low	Ludwig now ships a built‑in library of 145 HuggingFace datasets via `ludwig.datasets`, including 94 ready‑to‑use configs across multiple modalities. Ludwig now ships a built‑in library of 145 HuggingFace datasets via `ludwig.datasets`, including 94 ready‑to‑use configs across multiple modalities. Source: granite4.1:30b@2026-05-22-audit Confidence: low	—
Bugfix
Bugfix	Low	LLM extra (`pip install ludwig[llm]`) pins `torch>=2.7` for quantization and Flash Attention 2 support. LLM extra (`pip install ludwig[llm]`) pins `torch>=2.7` for quantization and Flash Attention 2 support. Source: granite4.1:30b@2026-05-22-audit Confidence: low	—
Bugfix	Low	Config validation now raises a clear error if `tabpfn_v2` combiner is used without the `tabpfn` package installed. Config validation now raises a clear error if `tabpfn_v2` combiner is used without the `tabpfn` package installed. Source: granite4.1:30b@2026-05-22-audit Confidence: low	—
Bugfix	Low	Ray lazy decode placement ensures decoding occurs inside the Ray data pipeline before reaching training actors, avoiding serialization overhead. Ray lazy decode placement ensures decoding occurs inside the Ray data pipeline before reaching training actors, avoiding serialization overhead. Source: granite4.1:30b@2026-05-22-audit Confidence: low	—
Refactor	Low	The monolithic `visualize.py` module has been split into a scoped `visualize/` package with submodules for different visualization types. The monolithic `visualize.py` module has been split into a scoped `visualize/` package with submodules for different visualization types. Source: granite4.1:30b@2026-05-22-audit Confidence: low	—
Refactor	Low	Stale duplicate `text/encoders.py` removed and encoder classes flattened for readability. Stale duplicate `text/encoders.py` removed and encoder classes flattened for readability. Source: granite4.1:30b@2026-05-22-audit Confidence: low	—
Other	Low	affected_surface affected_surface Source: llm_adapter@2026-05-21 Confidence: low	—

Full changelog

New Features

Lazy Preprocessing for Audio and Image

Ludwig 0.17 introduces lazy preprocessing — the most significant change to the training pipeline in several releases. Previously, audio and image features required a full preprocessing pass before training could begin: decode every file, resize/resample, write to disk. For large multimodal datasets, this meant waiting hours before a single training step ran.

Now you can start training immediately.

preprocessing_mode: lazy — audio and image features are decoded on the fly during training, directly from raw file paths. No upfront pass. Training starts in seconds.
preprocessing_mode: lazy_cached — decoded tensors are cached as memory-mapped files on the first pass through the data. Subsequent epochs hit the cache directly, with zero decode overhead after the first.
preprocessing_mode: eager — the previous default, preserved for full backwards compatibility.
prefetch_size — configurable prefetch queue depth for the background decoder thread, letting you tune the CPU/GPU overlap for your hardware.
A background prefetch thread decodes ahead of the training loop, keeping the GPU fed without blocking the forward pass.
Ray distributed training decodes lazy features inside the Ray data pipeline (not inside training actors), so decode work is properly parallelized across the cluster.

This matters most for audio and image datasets too large to preprocess in full — but it also makes iteration faster for any multimodal workload. (#4171, #4173)

Mega-AutoML: Rebuilt from the Ground Up

The AutoML infrastructure in Ludwig has been completely overhauled. (#4168, #4169)

YAML search space — search spaces are now declared in YAML and loaded via SearchSpace._from_specs(). This makes it straightforward to define custom search spaces, share them across experiments, and version-control them alongside your configs.

Dataset quality analysis — before building a search space, Ludwig now profiles your dataset: size, class balance, modality distribution, output cardinality. The search space is then constructed with awareness of what the data actually looks like.

Dataset-size-aware hyperparameter caps — epoch counts and batch sizes are now automatically capped based on dataset size, preventing both under-training on large datasets and OOM on tiny ones. Transformer-based combiners (cross_attention, perceiver) get additional batch size caps to prevent GPU memory exhaustion.

Learning rate and stability fixes — transformer-based combiners now have a capped learning rate upper bound to prevent NaN loss during hyperopt on sensitive architectures.

configs_from_dataframe improvements — default_epochs is now correctly threaded through to TrainerSpec, and the default search space builder is properly imported and called.

HuggingFace Dataset Library — 145 Datasets

Ludwig now ships with a built-in library of HuggingFace datasets, usable directly from ludwig.datasets. This release adds:

94 datasets spanning text classification, NER, question answering, summarization, audio classification, image classification, multimodal tasks, and more.
51 datasets with custom loaders — for complex HF datasets that require non-trivial loading logic (custom splits, column renaming, label normalization, multi-config handling).
ESC-50 (environmental audio classification), WikiANN (multilingual NER), GoEmotions (fine-grained emotion classification), New Yorker Caption Contest (multimodal humor).

These aren't just dataset references — each ships with a Ludwig config that maps columns to features, sets appropriate output types, and is smoke-tested end-to-end.

Refactoring

Visualize Package

The visualize.py module had grown to 4,144 lines. It's now split into a domain-scoped visualize/ package, with submodules organized by visualization type (learning curves, confusion matrices, calibration, hyperopt, etc.). The CLI entrypoint and all public APIs are unchanged. (#4154)

Text Encoders

Removed a stale duplicate text/encoders.py that had diverged from the canonical encoder implementations. Deep nesting in several encoder classes has been flattened for readability. (#4159)

Bug Fixes

LLM extra now requires torch>=2.7 — the LLM extra (pip install ludwig[llm]) now pins torch>=2.7, which is required for quantization and Flash Attention 2 support in the current transformers stack. Non-OSError exceptions during pretrained model loading no longer trigger the retry loop. (fix)
TabPFN v2 guard — Ludwig now raises a clear error at config validation time when a tabpfn_v2 combiner is configured but the tabpfn package is not installed, instead of failing at model construction. (fix)
Ray lazy decode placement — lazy audio/image features are now decoded inside the Ray data pipeline, before data reaches training actors. This keeps decode work off the critical path and avoids serialization of decoded tensors across Ray object store. Missing lazy_audio_params / lazy_image_params now emit a warning rather than a silent no-op. (fix, fix)
Smoke test stability — diversity retry logic for sorted classification datasets, media-aware shuffle buffer sizing, and per-modality buffer tuning to eliminate label collapse in small evaluation splits.

CI

Distributed integration tests now run in 6 parallel groups (up from 1), cutting distributed test wall time by ~5×. Integration test groups are renamed to sequential letters for clarity. (#4172)

Installation

pip install ludwig==0.17.0

GitHub: https://github.com/ludwig-ai/ludwig
Docs: https://ludwig.ai

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track ludwig

Get notified when new releases ship.

About ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models

All releases →