This release includes 3 breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
ReleasePort's take
Moderate signalTaskManager.load now returns a flat dictionary; SteeredHF is renamed to SteeredModel; enable_thinking cannot be used with multiple_choice or loglikelihood tasks.
Why it matters: Update imports for SteeredModel, adjust code expecting nested dicts from TaskManager.load, and remove enable_thinking flags in affected task configurations before upgrading to v0.4.12.
Summary
AI summarySteeredHF renamed to SteeredModel, vLLM minimum bumped to >=0.18, and enable_thinking disallowed for multiple_choice/loglikelihood tasks.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Breaking | Medium |
TaskManager.load() returns flat dict instead of nested structure TaskManager.load() returns flat dict instead of nested structure Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Breaking | Medium |
SteeredHF backend renamed to SteeredModel, update imports SteeredHF backend renamed to SteeredModel, update imports Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Breaking | Medium |
enable_thinking now disallowed for multiple_choice and loglikelihood tasks enable_thinking now disallowed for multiple_choice and loglikelihood tasks Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Breaking | Medium |
vLLM minimum version requirement bumped to 0.18 vLLM minimum version requirement bumped to 0.18 Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | High |
Added Megatron-LM (megatron-lm) backend with TP/EP/DP support Added Megatron-LM (megatron-lm) backend with TP/EP/DP support Source: granite4.1:30b@2026-05-24-audit Confidence: low |
— |
| Feature | Medium |
Native Tensor Parallelism for transformers models via tp_plan Native Tensor Parallelism for transformers models via tp_plan Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
InfiniteBench long-context evaluation tasks beyond 100K tokens InfiniteBench long-context evaluation tasks beyond 100K tokens Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Intel Gaudi support via optimum-habana backend Intel Gaudi support via optimum-habana backend Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Trackio logger enables per-sample Trace logging Trackio logger enables per-sample Trace logging Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
JFinQA Japanese Financial Numerical Reasoning QA benchmark added JFinQA Japanese Financial Numerical Reasoning QA benchmark added Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
CRUXEval Python code reasoning benchmark with multiple variants CRUXEval Python code reasoning benchmark with multiple variants Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Toksuite multilingual tokenization robustness benchmark added Toksuite multilingual tokenization robustness benchmark added Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
NEREL-bench Russian named-entity and relation-extraction benchmark added NEREL-bench Russian named-entity and relation-extraction benchmark added Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
TensorRT-LLM NVIDIA backend for optimized GPU inference TensorRT-LLM NVIDIA backend for optimized GPU inference Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
Megatron-LM backend with tensor/expert/data parallelism support Megatron-LM backend with tensor/expert/data parallelism support Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
LiteLLM AI gateway backend supports 100+ providers LiteLLM AI gateway backend supports 100+ providers Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Deprecation | Medium |
ConfigurableGroup deprecated, use new Group class instead ConfigurableGroup deprecated, use new Group class instead Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Bugfix | Medium |
RACE doc_to_text keeps blank marker, drops question body RACE doc_to_text keeps blank marker, drops question body Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
MMLU-Pro fewshot answers leak into user role in chat templates MMLU-Pro fewshot answers leak into user role in chat templates Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
GPQA preprocessing regex corrupts answer text with brackets GPQA preprocessing regex corrupts answer text with brackets Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
HeadQA doc_to_decontamination_query references nonexistent query field HeadQA doc_to_decontamination_query references nonexistent query field Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
BigBench multiple-choice tasks crash on mixed-format examples BigBench multiple-choice tasks crash on mixed-format examples Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Arabic normalization and prompt loading correctness improvements Arabic normalization and prompt loading correctness improvements Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
cache_requests always fails due to argparse type conflict cache_requests always fails due to argparse type conflict Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Median aggregation returns arbitrary element instead of median Median aggregation returns arbitrary element instead of median Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
TruthfulQA-gen dataset_path corrected to valid location TruthfulQA-gen dataset_path corrected to valid location Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
NorEval/NorIdiom !function imports use absolute module paths NorEval/NorIdiom !function imports use absolute module paths Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Async generation skips caching None responses Async generation skips caching None responses Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
vLLM Mistral tokenizer error fixed and improved vLLM Mistral tokenizer error fixed and improved Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Fixed french_bench_topic_based_nli doc_to_decontamination_query referencing nonexistent 'texte' field Fixed french_bench_topic_based_nli doc_to_decontamination_query referencing nonexistent 'texte' field Source: granite4.1:30b@2026-05-24-audit Confidence: high |
— |
| Bugfix | Medium |
Fixed IFEval RephraseChecker greedy‑regex bug in strip_changes method Fixed IFEval RephraseChecker greedy‑regex bug in strip_changes method Source: granite4.1:30b@2026-05-24-audit Confidence: high |
— |
| Bugfix | Medium |
Resolved vLLM data‑parallel with Ray issues; pinned vLLM >=0.18 and removed MP distribution Resolved vLLM data‑parallel with Ray issues; pinned vLLM >=0.18 and removed MP distribution Source: granite4.1:30b@2026-05-24-audit Confidence: high |
— |
| Bugfix | Medium |
IFEval RephraseChecker greedy-regex bug causes incorrect matching IFEval RephraseChecker greedy-regex bug causes incorrect matching Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Bugfix | Medium |
vLLM data-parallel with Ray fixes and improvements vLLM data-parallel with Ray fixes and improvements Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Refactor | Low |
ConfigurableGroup deprecated; new Group class directly holds child tasks ConfigurableGroup deprecated; new Group class directly holds child tasks Source: granite4.1:30b@2026-05-24-audit Confidence: low |
— |
Full changelog
New release with four new model backends, tensor parallel support for transformers based models (hf), new benchmarks, a TaskManager refactor, and a long tail of task correctness fixes.
Highlights
New Model Backends
- TensorRT-LLM (
trt-llm) — NVIDIA TensorRT-LLM backend for optimized GPU inference by @Tracin in #3628 - Megatron-LM (
megatron-lm) — Megatron-LM backend with TP/EP/DP support by @shangxiaokang in #3521 (with follow-up hardening in #3607) - Intel Gaudi — Gaudi support via
optimum-habanaby @12010486 in #3550 - LiteLLM AI gateway (
litellm) — Use LiteLLM as a unified API gateway for 100+ providers by @RheagalFire in #3721 - Native Tensor Parallelism for HF backend — multi-GPU TP for
transformersmodels viatp_planby @YangKai0616 in #3692
TaskManager Refactor (#3549)
TaskManager.load(...)returns a flat{tasks, groups}dict instead of the legacy nested{ConfigurableGroup: {name: Task}}.evaluate()accepts both shapes;load_task_or_group(...)andget_task_dict(...)are deprecated shims that return the old shape.- New
Groupclass directly holds its child tasks;ConfigurableGroupis now a deprecated wrapper around it. - Duplicate task/group configs within the same root are skipped with a log message instead of silently overwritten. (Custom
include_pathentries still override defaults.)
Breaking Changes
SteeredHFrenamed toSteeredModel— update imports if you're using the steering backend by @adrian-sauter in #3592- vLLM minimum bumped to
>=0.18as part of the data-parallel-with-Ray fixes by @baberabb in #3725 enable_thinkingis now disallowed formultiple_choice/ loglikelihood tasks, andthink_end_tokenis now required whenenable_thinking=True. Configurations that combined these previously failed silently by @fxmarty-amd in #3675
New Logger
- Trackio logger with per-sample
Tracelogging by @abidlabs in #3733
New Benchmarks & Tasks
- InfiniteBench — long-context evaluation beyond 100K tokens (12 sub-tasks: code debug/run, KV retrieval, longbook QA/summarization, math find, passkey, etc.) by @siddhant-rajhans in #3662
- CRUXEval — Python code reasoning benchmark with input/output prediction variants (incl. CoT and pass@k variants) by @ThomasHeap in #3699
- Toksuite — multilingual tokenization-robustness benchmark (Chinese, English, and more) by @gsaltintas in #3669
- NEREL-bench — Russian named-entity / relation-extraction benchmark by @bond005 in #3650
- JFinQA — Japanese Financial Numerical Reasoning QA (1000 questions, with consistency / numerical / temporal splits) by @ajtgjmdjp in #3570
Fixes & Improvements
Task Fixes
- Fixed GPQA preprocessing regex that corrupted answer text containing brackets by @Robby955 in #3691 and @Chessing234 in #3735
- Fixed MMLU-Pro and MMLU-Pro-Plus few-shot answers leaking into the user role under chat templates by @kiwaku in #3693, #3747
- Fixed RACE
doc_to_textkeeping a blank marker and dropping the question body by @Chessing234 in #3716 - Fixed BigBench multiple-choice tasks crashing on mixed-format examples (filtered out free-form examples) by @Chessing234 in #3702
- Fixed HeadQA
doc_to_decontamination_querypointing at a nonexistentqueryfield by @Chessing234 in #3718 - Fixed french_bench_topic_based_nli
doc_to_decontamination_querypointing at nonexistenttextefield by @Chessing234 in #3719 - Fixed TruthfulQA-gen
dataset_pathby @zhngstl in #3723 - Fixed NorEval/NorIdiom
!functionimports to use absolute module paths by @Anai-Guo in #3731 - Fixed IFEval
RephraseChecker.strip_changesgreedy-regex bug by @Chessing234 in #3737 - Fixed correctness issues in Arabic normalization and prompt loading by @RinZ27 in #3589
- Updated BLiMP dataset path by @jmichaelov in #3596
- Replaced all references to the
CohereForAIorg withCohereLabsby @juliafalcao in #3631
What's Changed
- refactor(Taskmanager)! by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3549
- fix(cli):
--cache_requestsalways fails due to argparsetype/choicesconflict by @maxidl in https://github.com/EleutherAI/lm-evaluation-harness/pull/3588 - feat: Add Megatron-LM backend with TP/EP/DP support by @shangxiaokang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3521
- Fix: #3293 (pybass UnboundLocalError on outputs in Exception Logging) by @lucafossen in https://github.com/EleutherAI/lm-evaluation-harness/pull/3601
- [fix] Add missing tokenization progress bar by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3605
- fix: improve model_args type coercion in handle_arg_string by @ManasVardhan in https://github.com/EleutherAI/lm-evaluation-harness/pull/3608
- fix: harden Megatron GPT layer spec setup for eval by @shangxiaokang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3607
- Update vLLM import of
resolve_hf_chat_templateby @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3595 - Add docstring for HFLM init keyword arguments by @joshuaswanson in https://github.com/EleutherAI/lm-evaluation-harness/pull/3630
- Update all mentions of the
CohereForAIorganization toCohereLabsby @juliafalcao in https://github.com/EleutherAI/lm-evaluation-harness/pull/3631 - Skip caching None responses in async generation path by @joshuaswanson in https://github.com/EleutherAI/lm-evaluation-harness/pull/3633
- Fix correctness issues in Arabic normalization and prompt loading by @RinZ27 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3589
- fix(evaluate tests) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3634
- fix: propagate custom aggregation to dict-valued metric result keys by @s-zx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3626
- chore(ci-updates) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3635
- Update BLiMP dataset path by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3596
- Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions) by @ajtgjmdjp in https://github.com/EleutherAI/lm-evaluation-harness/pull/3570
- Rename SteeredHF to SteeredModel in lm_eval/models/init.py by @adrian-sauter in https://github.com/EleutherAI/lm-evaluation-harness/pull/3592
- fix: Update
WatsonxLLMclass mapping and errors by @Rafal-Chrzanowski-IBM in https://github.com/EleutherAI/lm-evaluation-harness/pull/3591 - Add Intel Gaudi support by @12010486 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3550
- [fix] Disallow
enable_thinkingwithoutput_type: multiple_choicetasks / loglikelihood tasks; raise error in casethink_end_tokenis not provided withenable_thinking=Trueby @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3675 - fix(vllm): fix dp with ray. remove mp distribution; pin vllm >=0.18 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3725
- refactor(utils): fix mistral tokenizer error; improve doc-strings by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3728
- fix(vllm): fix vllm tokenizer for Mistral; rm default
gpu_memory_utilization=0.9by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3732 - Fix GPQA preprocess stripping mathematical bracket expressions by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3735
- Guard vLLM tok_encode against prefix_token_id being None by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3724
- fix(ifeval): use non-greedy regex in RephraseChecker.strip_changes by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3737
- fix: bound request cache filename length by @princepal9120 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3729
- fix codeowners by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3738
- Fix dataset_path for truthfulqa_gen by @zhngstl in https://github.com/EleutherAI/lm-evaluation-harness/pull/3723
- fix(vllm): disallow data_parallel with enable_expert_parallel by @FazeelUsmani in https://github.com/EleutherAI/lm-evaluation-harness/pull/3734
- Add Trackio logger with per-sample Trace logging by @abidlabs in https://github.com/EleutherAI/lm-evaluation-harness/pull/3733
- Fix headqa doc_to_decontamination_query pointing at nonexistent 'query' field by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3718
- Fix french_bench_topic_based_nli doc_to_decontamination_query pointing at nonexistent 'texte' field by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3719
- fix(noreval/noridiom): use absolute module paths for !function imports (#3624) by @Anai-Guo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3731
- Fix DummyLM.generate_until printing context as gen_kwargs by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3711
- Fix MultiChoiceRegexFilter.find_match IndexError on all-empty capture groups by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3708
- fix(model_comparator): fix ImportError from scipy.stats.norm import by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3742
- Fix zeno_visualize discarding tasks intersection result by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3739
- fix: don't pass task stop sequences to vLLM for reasoning models by @jwmacd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3700
- feat: Add [ LiteLLM AI gateway ] as model backend by @RheagalFire in https://github.com/EleutherAI/lm-evaluation-harness/pull/3721
- Fix RACE doc_to_text keeping blank marker and dropping the question body by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3716
- Fix BigBench multiple-choice crash on mixed-format tasks by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3702
- Fix GPQA preprocessing: remove bracket-stripping regex that corrupts answer text by @Robby955 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3691
- Fix mmlu_pro fewshot answers leaking into user role under chat template by @kiwaku in https://github.com/EleutherAI/lm-evaluation-harness/pull/3693
- fix(mmlu_pro_plus): sync fixes from
mmlu_proby @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3747 - chore: cleap up deps; fix ci lint by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3748
- Fix DummyLM.generate_until write_out printing context as gen_kwargs by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3714
- Fix median aggregation returning arbitrary element instead of median by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3696
- fix(api): chat payload leaking top-level text type by @felixmr1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3745
- [BUGFIX] Consistent handling of None answers and cache by @RawthiL in https://github.com/EleutherAI/lm-evaluation-harness/pull/3656
- Adding Cruxeval by @ThomasHeap in https://github.com/EleutherAI/lm-evaluation-harness/pull/3699
- [Task] NEREL-bench by @bond005 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3650
- Added Toksuite Benchmark by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3669
- Add InfiniteBench: long-context evaluation beyond 100K tokens by @siddhant-rajhans in https://github.com/EleutherAI/lm-evaluation-harness/pull/3662
- fix: Reset batch_sizes cache before each _loglikelihood_tokens call by @nevertmr in https://github.com/EleutherAI/lm-evaluation-harness/pull/3654
- feat: add TRT-LLM backend. by @Tracin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3628
- [Feat] Add native Tensor Parallelism support for HF backend by @YangKai0616 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3692
- feat(release): 0.4.12 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3763
New Contributors
- @maxidl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3588
- @shangxiaokang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3521
- @ManasVardhan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3608
- @joshuaswanson made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3630
- @RinZ27 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3589
- @s-zx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3626
- @ajtgjmdjp made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3570
- @adrian-sauter made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3592
- @Rafal-Chrzanowski-IBM made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3591
- @12010486 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3550
- @Chessing234 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3735
- @princepal9120 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3729
- @zhngstl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3723
- @FazeelUsmani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3734
- @abidlabs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3733
- @Anai-Guo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3731
- @jwmacd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3700
- @RheagalFire made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3721
- @Robby955 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3691
- @kiwaku made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3693
- @felixmr1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3745
- @ThomasHeap made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3699
- @siddhant-rajhans made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3662
- @nevertmr made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3654
- @Tracin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3628
- @YangKai0616 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3692
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.11...v0.4.12
Breaking Changes
- SteeredHF renamed to SteeredModel — update imports accordingly
- Minimum vLLM version bumped to >=0.18 due to data‑parallel with Ray fixes
- `enable_thinking` is now disallowed for `multiple_choice` and loglikelihood tasks; `think_end_token` becomes required when `enable_thinking=True`
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About EleutherAI / Lm-Evaluation-Harness
All releases →Related context
Beta — feedback welcome: [email protected]