This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+11 more
ReleasePort's take
Moderate signalLudwig v0.17.0 introduces lazy preprocessing that decodes audio and image features on‑the‑fly, caches decoded tensors as memory‑mapped files for subsequent epochs, and uses a background prefetch thread to feed GPUs without blocking the training loop.
Why it matters: These changes reduce training startup latency and improve GPU utilization by overlapping decoding with computation, directly benefiting developers and SREs managing large‑scale ML workloads.
Summary
AI summaryLazy preprocessing enables on‑the‑fly decoding of audio and image features, starting training instantly.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Medium |
Lazy preprocessing for audio and image features allows on‑the‑fly decoding during training. Lazy preprocessing for audio and image features allows on‑the‑fly decoding during training. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
`preprocessing_mode: lazy_cached` caches decoded tensors as memory‑mapped files for subsequent epochs. `preprocessing_mode: lazy_cached` caches decoded tensors as memory‑mapped files for subsequent epochs. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Background prefetch thread decodes audio/image ahead of training loop, feeding GPU without blocking. Background prefetch thread decodes audio/image ahead of training loop, feeding GPU without blocking. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Ray distributed training decodes lazy features inside Ray data pipeline, parallelizing decode work. Ray distributed training decodes lazy features inside Ray data pipeline, parallelizing decode work. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
YAML‑based search spaces are now declaratively defined and loaded via `SearchSpace._from_specs()`. YAML‑based search spaces are now declaratively defined and loaded via `SearchSpace._from_specs()`. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Dataset quality analysis profiles dataset size, class balance, modality distribution before building search space. Dataset quality analysis profiles dataset size, class balance, modality distribution before building search space. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Hyperparameter caps (epoch counts, batch sizes) adapt automatically based on dataset size. Hyperparameter caps (epoch counts, batch sizes) adapt automatically based on dataset size. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Transformer combiners receive capped learning rate upper bound to prevent NaN loss during hyperopt. Transformer combiners receive capped learning rate upper bound to prevent NaN loss during hyperopt. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
`configs_from_dataframe` now correctly propagates `default_epochs` to `TrainerSpec` and imports default search space builder. `configs_from_dataframe` now correctly propagates `default_epochs` to `TrainerSpec` and imports default search space builder. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Configurable `prefetch_size` tunes background decoder queue depth for CPU/GPU overlap. Configurable `prefetch_size` tunes background decoder queue depth for CPU/GPU overlap. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Feature | Low |
Ludwig now ships a built‑in library of 145 HuggingFace datasets via `ludwig.datasets`, including 94 ready‑to‑use configs across multiple modalities. Ludwig now ships a built‑in library of 145 HuggingFace datasets via `ludwig.datasets`, including 94 ready‑to‑use configs across multiple modalities. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
LLM extra (`pip install ludwig[llm]`) pins `torch>=2.7` for quantization and Flash Attention 2 support. LLM extra (`pip install ludwig[llm]`) pins `torch>=2.7` for quantization and Flash Attention 2 support. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
Config validation now raises a clear error if `tabpfn_v2` combiner is used without the `tabpfn` package installed. Config validation now raises a clear error if `tabpfn_v2` combiner is used without the `tabpfn` package installed. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
Ray lazy decode placement ensures decoding occurs inside the Ray data pipeline before reaching training actors, avoiding serialization overhead. Ray lazy decode placement ensures decoding occurs inside the Ray data pipeline before reaching training actors, avoiding serialization overhead. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Refactor | Low |
The monolithic `visualize.py` module has been split into a scoped `visualize/` package with submodules for different visualization types. The monolithic `visualize.py` module has been split into a scoped `visualize/` package with submodules for different visualization types. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Refactor | Low |
Stale duplicate `text/encoders.py` removed and encoder classes flattened for readability. Stale duplicate `text/encoders.py` removed and encoder classes flattened for readability. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Other | Low |
affected_surface affected_surface Source: llm_adapter@2026-05-21 Confidence: low |
— |
Full changelog
New Features
Lazy Preprocessing for Audio and Image
Ludwig 0.17 introduces lazy preprocessing — the most significant change to the training pipeline in several releases. Previously, audio and image features required a full preprocessing pass before training could begin: decode every file, resize/resample, write to disk. For large multimodal datasets, this meant waiting hours before a single training step ran.
Now you can start training immediately.
preprocessing_mode: lazy— audio and image features are decoded on the fly during training, directly from raw file paths. No upfront pass. Training starts in seconds.preprocessing_mode: lazy_cached— decoded tensors are cached as memory-mapped files on the first pass through the data. Subsequent epochs hit the cache directly, with zero decode overhead after the first.preprocessing_mode: eager— the previous default, preserved for full backwards compatibility.prefetch_size— configurable prefetch queue depth for the background decoder thread, letting you tune the CPU/GPU overlap for your hardware.- A background prefetch thread decodes ahead of the training loop, keeping the GPU fed without blocking the forward pass.
- Ray distributed training decodes lazy features inside the Ray data pipeline (not inside training actors), so decode work is properly parallelized across the cluster.
This matters most for audio and image datasets too large to preprocess in full — but it also makes iteration faster for any multimodal workload. (#4171, #4173)
Mega-AutoML: Rebuilt from the Ground Up
The AutoML infrastructure in Ludwig has been completely overhauled. (#4168, #4169)
YAML search space — search spaces are now declared in YAML and loaded via SearchSpace._from_specs(). This makes it straightforward to define custom search spaces, share them across experiments, and version-control them alongside your configs.
Dataset quality analysis — before building a search space, Ludwig now profiles your dataset: size, class balance, modality distribution, output cardinality. The search space is then constructed with awareness of what the data actually looks like.
Dataset-size-aware hyperparameter caps — epoch counts and batch sizes are now automatically capped based on dataset size, preventing both under-training on large datasets and OOM on tiny ones. Transformer-based combiners (cross_attention, perceiver) get additional batch size caps to prevent GPU memory exhaustion.
Learning rate and stability fixes — transformer-based combiners now have a capped learning rate upper bound to prevent NaN loss during hyperopt on sensitive architectures.
configs_from_dataframe improvements — default_epochs is now correctly threaded through to TrainerSpec, and the default search space builder is properly imported and called.
HuggingFace Dataset Library — 145 Datasets
Ludwig now ships with a built-in library of HuggingFace datasets, usable directly from ludwig.datasets. This release adds:
- 94 datasets spanning text classification, NER, question answering, summarization, audio classification, image classification, multimodal tasks, and more.
- 51 datasets with custom loaders — for complex HF datasets that require non-trivial loading logic (custom splits, column renaming, label normalization, multi-config handling).
- ESC-50 (environmental audio classification), WikiANN (multilingual NER), GoEmotions (fine-grained emotion classification), New Yorker Caption Contest (multimodal humor).
These aren't just dataset references — each ships with a Ludwig config that maps columns to features, sets appropriate output types, and is smoke-tested end-to-end.
Refactoring
Visualize Package
The visualize.py module had grown to 4,144 lines. It's now split into a domain-scoped visualize/ package, with submodules organized by visualization type (learning curves, confusion matrices, calibration, hyperopt, etc.). The CLI entrypoint and all public APIs are unchanged. (#4154)
Text Encoders
Removed a stale duplicate text/encoders.py that had diverged from the canonical encoder implementations. Deep nesting in several encoder classes has been flattened for readability. (#4159)
Bug Fixes
- LLM extra now requires
torch>=2.7— the LLM extra (pip install ludwig[llm]) now pinstorch>=2.7, which is required for quantization and Flash Attention 2 support in the current transformers stack. Non-OSError exceptions during pretrained model loading no longer trigger the retry loop. (fix) - TabPFN v2 guard — Ludwig now raises a clear error at config validation time when a
tabpfn_v2combiner is configured but thetabpfnpackage is not installed, instead of failing at model construction. (fix) - Ray lazy decode placement — lazy audio/image features are now decoded inside the Ray data pipeline, before data reaches training actors. This keeps decode work off the critical path and avoids serialization of decoded tensors across Ray object store. Missing
lazy_audio_params/lazy_image_paramsnow emit a warning rather than a silent no-op. (fix, fix) - Smoke test stability — diversity retry logic for sorted classification datasets, media-aware shuffle buffer sizing, and per-modality buffer tuning to eliminate label collapse in small evaluation splits.
CI
Distributed integration tests now run in 6 parallel groups (up from 1), cutting distributed test wall time by ~5×. Integration test groups are renamed to sequential letters for clarity. (#4172)
Installation
pip install ludwig==0.17.0
GitHub: https://github.com/ludwig-ai/ludwig
Docs: https://ludwig.ai
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About ludwig
Low-code framework for building custom LLMs, neural networks, and other AI models
Related context
Related tools
Beta — feedback welcome: [email protected]