Skip to content

This release includes 3 breaking changes for platform teams planning a safe upgrade.

Published 23d AI Coding Tools
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

evaluation-framework language-model transformer

ReleasePort's take

Moderate signal
editorial:auto 13d

TaskManager.load now returns a flat dictionary; SteeredHF is renamed to SteeredModel; enable_thinking cannot be used with multiple_choice or loglikelihood tasks.

Why it matters: Update imports for SteeredModel, adjust code expecting nested dicts from TaskManager.load, and remove enable_thinking flags in affected task configurations before upgrading to v0.4.12.

Summary

AI summary

SteeredHF renamed to SteeredModel, vLLM minimum bumped to >=0.18, and enable_thinking disallowed for multiple_choice/loglikelihood tasks.

Changes in this release

Breaking Medium

TaskManager.load() returns flat dict instead of nested structure

TaskManager.load() returns flat dict instead of nested structure

Source: llm_adapter@2026-05-21

Confidence: high

Breaking Medium

SteeredHF backend renamed to SteeredModel, update imports

SteeredHF backend renamed to SteeredModel, update imports

Source: llm_adapter@2026-05-21

Confidence: high

Breaking Medium

enable_thinking now disallowed for multiple_choice and loglikelihood tasks

enable_thinking now disallowed for multiple_choice and loglikelihood tasks

Source: llm_adapter@2026-05-21

Confidence: high

Breaking Medium

vLLM minimum version requirement bumped to 0.18

vLLM minimum version requirement bumped to 0.18

Source: llm_adapter@2026-05-21

Confidence: low

Feature High

Added Megatron-LM (megatron-lm) backend with TP/EP/DP support

Added Megatron-LM (megatron-lm) backend with TP/EP/DP support

Source: granite4.1:30b@2026-05-24-audit

Confidence: low

Feature Medium

Native Tensor Parallelism for transformers models via tp_plan

Native Tensor Parallelism for transformers models via tp_plan

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

InfiniteBench long-context evaluation tasks beyond 100K tokens

InfiniteBench long-context evaluation tasks beyond 100K tokens

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Intel Gaudi support via optimum-habana backend

Intel Gaudi support via optimum-habana backend

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Trackio logger enables per-sample Trace logging

Trackio logger enables per-sample Trace logging

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

JFinQA Japanese Financial Numerical Reasoning QA benchmark added

JFinQA Japanese Financial Numerical Reasoning QA benchmark added

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

CRUXEval Python code reasoning benchmark with multiple variants

CRUXEval Python code reasoning benchmark with multiple variants

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Toksuite multilingual tokenization robustness benchmark added

Toksuite multilingual tokenization robustness benchmark added

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

NEREL-bench Russian named-entity and relation-extraction benchmark added

NEREL-bench Russian named-entity and relation-extraction benchmark added

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

TensorRT-LLM NVIDIA backend for optimized GPU inference

TensorRT-LLM NVIDIA backend for optimized GPU inference

Source: llm_adapter@2026-05-21

Confidence: low

Feature Medium

Megatron-LM backend with tensor/expert/data parallelism support

Megatron-LM backend with tensor/expert/data parallelism support

Source: llm_adapter@2026-05-21

Confidence: low

Feature Medium

LiteLLM AI gateway backend supports 100+ providers

LiteLLM AI gateway backend supports 100+ providers

Source: llm_adapter@2026-05-21

Confidence: low

Deprecation Medium

ConfigurableGroup deprecated, use new Group class instead

ConfigurableGroup deprecated, use new Group class instead

Source: llm_adapter@2026-05-21

Confidence: low

Bugfix Medium

RACE doc_to_text keeps blank marker, drops question body

RACE doc_to_text keeps blank marker, drops question body

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

MMLU-Pro fewshot answers leak into user role in chat templates

MMLU-Pro fewshot answers leak into user role in chat templates

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

GPQA preprocessing regex corrupts answer text with brackets

GPQA preprocessing regex corrupts answer text with brackets

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

HeadQA doc_to_decontamination_query references nonexistent query field

HeadQA doc_to_decontamination_query references nonexistent query field

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

BigBench multiple-choice tasks crash on mixed-format examples

BigBench multiple-choice tasks crash on mixed-format examples

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

Arabic normalization and prompt loading correctness improvements

Arabic normalization and prompt loading correctness improvements

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

cache_requests always fails due to argparse type conflict

cache_requests always fails due to argparse type conflict

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

Median aggregation returns arbitrary element instead of median

Median aggregation returns arbitrary element instead of median

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

TruthfulQA-gen dataset_path corrected to valid location

TruthfulQA-gen dataset_path corrected to valid location

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

NorEval/NorIdiom !function imports use absolute module paths

NorEval/NorIdiom !function imports use absolute module paths

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

Async generation skips caching None responses

Async generation skips caching None responses

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

vLLM Mistral tokenizer error fixed and improved

vLLM Mistral tokenizer error fixed and improved

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

Fixed french_bench_topic_based_nli doc_to_decontamination_query referencing nonexistent 'texte' field

Fixed french_bench_topic_based_nli doc_to_decontamination_query referencing nonexistent 'texte' field

Source: granite4.1:30b@2026-05-24-audit

Confidence: high

Bugfix Medium

Fixed IFEval RephraseChecker greedy‑regex bug in strip_changes method

Fixed IFEval RephraseChecker greedy‑regex bug in strip_changes method

Source: granite4.1:30b@2026-05-24-audit

Confidence: high

Bugfix Medium

Resolved vLLM data‑parallel with Ray issues; pinned vLLM >=0.18 and removed MP distribution

Resolved vLLM data‑parallel with Ray issues; pinned vLLM >=0.18 and removed MP distribution

Source: granite4.1:30b@2026-05-24-audit

Confidence: high

Bugfix Medium

IFEval RephraseChecker greedy-regex bug causes incorrect matching

IFEval RephraseChecker greedy-regex bug causes incorrect matching

Source: llm_adapter@2026-05-21

Confidence: low

Bugfix Medium

vLLM data-parallel with Ray fixes and improvements

vLLM data-parallel with Ray fixes and improvements

Source: llm_adapter@2026-05-21

Confidence: low

Refactor Low

ConfigurableGroup deprecated; new Group class directly holds child tasks

ConfigurableGroup deprecated; new Group class directly holds child tasks

Source: granite4.1:30b@2026-05-24-audit

Confidence: low

Full changelog

New release with four new model backends, tensor parallel support for transformers based models (hf), new benchmarks, a TaskManager refactor, and a long tail of task correctness fixes.

Highlights

New Model Backends

  • TensorRT-LLM (trt-llm) — NVIDIA TensorRT-LLM backend for optimized GPU inference by @Tracin in #3628
  • Megatron-LM (megatron-lm) — Megatron-LM backend with TP/EP/DP support by @shangxiaokang in #3521 (with follow-up hardening in #3607)
  • Intel Gaudi — Gaudi support via optimum-habana by @12010486 in #3550
  • LiteLLM AI gateway (litellm) — Use LiteLLM as a unified API gateway for 100+ providers by @RheagalFire in #3721
  • Native Tensor Parallelism for HF backend — multi-GPU TP for transformers models via tp_plan by @YangKai0616 in #3692

TaskManager Refactor (#3549)

  • TaskManager.load(...) returns a flat {tasks, groups} dict instead of the legacy nested {ConfigurableGroup: {name: Task}}. evaluate() accepts both shapes; load_task_or_group(...) and get_task_dict(...) are deprecated shims that return the old shape.
  • New Group class directly holds its child tasks; ConfigurableGroup is now a deprecated wrapper around it.
  • Duplicate task/group configs within the same root are skipped with a log message instead of silently overwritten. (Custom include_path entries still override defaults.)

Breaking Changes

  • SteeredHF renamed to SteeredModel — update imports if you're using the steering backend by @adrian-sauter in #3592
  • vLLM minimum bumped to >=0.18 as part of the data-parallel-with-Ray fixes by @baberabb in #3725
  • enable_thinking is now disallowed for multiple_choice / loglikelihood tasks, and think_end_token is now required when enable_thinking=True. Configurations that combined these previously failed silently by @fxmarty-amd in #3675

New Logger

  • Trackio logger with per-sample Trace logging by @abidlabs in #3733

New Benchmarks & Tasks

  • InfiniteBench — long-context evaluation beyond 100K tokens (12 sub-tasks: code debug/run, KV retrieval, longbook QA/summarization, math find, passkey, etc.) by @siddhant-rajhans in #3662
  • CRUXEval — Python code reasoning benchmark with input/output prediction variants (incl. CoT and pass@k variants) by @ThomasHeap in #3699
  • Toksuite — multilingual tokenization-robustness benchmark (Chinese, English, and more) by @gsaltintas in #3669
  • NEREL-bench — Russian named-entity / relation-extraction benchmark by @bond005 in #3650
  • JFinQA — Japanese Financial Numerical Reasoning QA (1000 questions, with consistency / numerical / temporal splits) by @ajtgjmdjp in #3570

Fixes & Improvements

Task Fixes

  • Fixed GPQA preprocessing regex that corrupted answer text containing brackets by @Robby955 in #3691 and @Chessing234 in #3735
  • Fixed MMLU-Pro and MMLU-Pro-Plus few-shot answers leaking into the user role under chat templates by @kiwaku in #3693, #3747
  • Fixed RACE doc_to_text keeping a blank marker and dropping the question body by @Chessing234 in #3716
  • Fixed BigBench multiple-choice tasks crashing on mixed-format examples (filtered out free-form examples) by @Chessing234 in #3702
  • Fixed HeadQA doc_to_decontamination_query pointing at a nonexistent query field by @Chessing234 in #3718
  • Fixed french_bench_topic_based_nli doc_to_decontamination_query pointing at nonexistent texte field by @Chessing234 in #3719
  • Fixed TruthfulQA-gen dataset_path by @zhngstl in #3723
  • Fixed NorEval/NorIdiom !function imports to use absolute module paths by @Anai-Guo in #3731
  • Fixed IFEval RephraseChecker.strip_changes greedy-regex bug by @Chessing234 in #3737
  • Fixed correctness issues in Arabic normalization and prompt loading by @RinZ27 in #3589
  • Updated BLiMP dataset path by @jmichaelov in #3596
  • Replaced all references to the CohereForAI org with CohereLabs by @juliafalcao in #3631

What's Changed

  • refactor(Taskmanager)! by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3549
  • fix(cli): --cache_requests always fails due to argparse type/choices conflict by @maxidl in https://github.com/EleutherAI/lm-evaluation-harness/pull/3588
  • feat: Add Megatron-LM backend with TP/EP/DP support by @shangxiaokang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3521
  • Fix: #3293 (pybass UnboundLocalError on outputs in Exception Logging) by @lucafossen in https://github.com/EleutherAI/lm-evaluation-harness/pull/3601
  • [fix] Add missing tokenization progress bar by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3605
  • fix: improve model_args type coercion in handle_arg_string by @ManasVardhan in https://github.com/EleutherAI/lm-evaluation-harness/pull/3608
  • fix: harden Megatron GPT layer spec setup for eval by @shangxiaokang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3607
  • Update vLLM import of resolve_hf_chat_template by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3595
  • Add docstring for HFLM init keyword arguments by @joshuaswanson in https://github.com/EleutherAI/lm-evaluation-harness/pull/3630
  • Update all mentions of the CohereForAI organization to CohereLabs by @juliafalcao in https://github.com/EleutherAI/lm-evaluation-harness/pull/3631
  • Skip caching None responses in async generation path by @joshuaswanson in https://github.com/EleutherAI/lm-evaluation-harness/pull/3633
  • Fix correctness issues in Arabic normalization and prompt loading by @RinZ27 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3589
  • fix(evaluate tests) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3634
  • fix: propagate custom aggregation to dict-valued metric result keys by @s-zx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3626
  • chore(ci-updates) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3635
  • Update BLiMP dataset path by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3596
  • Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions) by @ajtgjmdjp in https://github.com/EleutherAI/lm-evaluation-harness/pull/3570
  • Rename SteeredHF to SteeredModel in lm_eval/models/init.py by @adrian-sauter in https://github.com/EleutherAI/lm-evaluation-harness/pull/3592
  • fix: Update WatsonxLLM class mapping and errors by @Rafal-Chrzanowski-IBM in https://github.com/EleutherAI/lm-evaluation-harness/pull/3591
  • Add Intel Gaudi support by @12010486 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3550
  • [fix] Disallow enable_thinking with output_type: multiple_choice tasks / loglikelihood tasks; raise error in case think_end_token is not provided with enable_thinking=True by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3675
  • fix(vllm): fix dp with ray. remove mp distribution; pin vllm >=0.18 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3725
  • refactor(utils): fix mistral tokenizer error; improve doc-strings by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3728
  • fix(vllm): fix vllm tokenizer for Mistral; rm default gpu_memory_utilization=0.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3732
  • Fix GPQA preprocess stripping mathematical bracket expressions by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3735
  • Guard vLLM tok_encode against prefix_token_id being None by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3724
  • fix(ifeval): use non-greedy regex in RephraseChecker.strip_changes by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3737
  • fix: bound request cache filename length by @princepal9120 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3729
  • fix codeowners by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3738
  • Fix dataset_path for truthfulqa_gen by @zhngstl in https://github.com/EleutherAI/lm-evaluation-harness/pull/3723
  • fix(vllm): disallow data_parallel with enable_expert_parallel by @FazeelUsmani in https://github.com/EleutherAI/lm-evaluation-harness/pull/3734
  • Add Trackio logger with per-sample Trace logging by @abidlabs in https://github.com/EleutherAI/lm-evaluation-harness/pull/3733
  • Fix headqa doc_to_decontamination_query pointing at nonexistent 'query' field by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3718
  • Fix french_bench_topic_based_nli doc_to_decontamination_query pointing at nonexistent 'texte' field by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3719
  • fix(noreval/noridiom): use absolute module paths for !function imports (#3624) by @Anai-Guo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3731
  • Fix DummyLM.generate_until printing context as gen_kwargs by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3711
  • Fix MultiChoiceRegexFilter.find_match IndexError on all-empty capture groups by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3708
  • fix(model_comparator): fix ImportError from scipy.stats.norm import by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3742
  • Fix zeno_visualize discarding tasks intersection result by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3739
  • fix: don't pass task stop sequences to vLLM for reasoning models by @jwmacd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3700
  • feat: Add [ LiteLLM AI gateway ] as model backend by @RheagalFire in https://github.com/EleutherAI/lm-evaluation-harness/pull/3721
  • Fix RACE doc_to_text keeping blank marker and dropping the question body by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3716
  • Fix BigBench multiple-choice crash on mixed-format tasks by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3702
  • Fix GPQA preprocessing: remove bracket-stripping regex that corrupts answer text by @Robby955 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3691
  • Fix mmlu_pro fewshot answers leaking into user role under chat template by @kiwaku in https://github.com/EleutherAI/lm-evaluation-harness/pull/3693
  • fix(mmlu_pro_plus): sync fixes from mmlu_pro by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3747
  • chore: cleap up deps; fix ci lint by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3748
  • Fix DummyLM.generate_until write_out printing context as gen_kwargs by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3714
  • Fix median aggregation returning arbitrary element instead of median by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3696
  • fix(api): chat payload leaking top-level text type by @felixmr1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3745
  • [BUGFIX] Consistent handling of None answers and cache by @RawthiL in https://github.com/EleutherAI/lm-evaluation-harness/pull/3656
  • Adding Cruxeval by @ThomasHeap in https://github.com/EleutherAI/lm-evaluation-harness/pull/3699
  • [Task] NEREL-bench by @bond005 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3650
  • Added Toksuite Benchmark by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3669
  • Add InfiniteBench: long-context evaluation beyond 100K tokens by @siddhant-rajhans in https://github.com/EleutherAI/lm-evaluation-harness/pull/3662
  • fix: Reset batch_sizes cache before each _loglikelihood_tokens call by @nevertmr in https://github.com/EleutherAI/lm-evaluation-harness/pull/3654
  • feat: add TRT-LLM backend. by @Tracin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3628
  • [Feat] Add native Tensor Parallelism support for HF backend by @YangKai0616 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3692
  • feat(release): 0.4.12 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3763

New Contributors

  • @maxidl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3588
  • @shangxiaokang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3521
  • @ManasVardhan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3608
  • @joshuaswanson made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3630
  • @RinZ27 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3589
  • @s-zx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3626
  • @ajtgjmdjp made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3570
  • @adrian-sauter made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3592
  • @Rafal-Chrzanowski-IBM made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3591
  • @12010486 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3550
  • @Chessing234 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3735
  • @princepal9120 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3729
  • @zhngstl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3723
  • @FazeelUsmani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3734
  • @abidlabs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3733
  • @Anai-Guo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3731
  • @jwmacd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3700
  • @RheagalFire made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3721
  • @Robby955 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3691
  • @kiwaku made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3693
  • @felixmr1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3745
  • @ThomasHeap made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3699
  • @siddhant-rajhans made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3662
  • @nevertmr made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3654
  • @Tracin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3628
  • @YangKai0616 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3692

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.11...v0.4.12

Breaking Changes

  • SteeredHF renamed to SteeredModel — update imports accordingly
  • Minimum vLLM version bumped to >=0.18 due to data‑parallel with Ray fixes
  • `enable_thinking` is now disallowed for `multiple_choice` and loglikelihood tasks; `think_end_token` becomes required when `enable_thinking=True`

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track EleutherAI / Lm-Evaluation-Harness

Get notified when new releases ship.

Sign up free

About EleutherAI / Lm-Evaluation-Harness

All releases →

Beta — feedback welcome: [email protected]