Skip to content

This release includes 2 breaking changes for platform teams planning a safe upgrade.

Published 4mo AI Coding Tools
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

evaluation-framework language-model transformer

Affected surfaces

breaking_upgrade

Summary

AI summary

Base package no longer installs model backends by default, requiring explicit installation.

Full changelog

Highlights

The big change this release: the base package no longer installs model backends by default. We've also added new benchmarks and expanded multilingual support.

Breaking Change: Lightweight Core with Optional Backends

pip install lm_eval no longer installs the HuggingFace/torch stack by default. (#3428)

The core package no longer includes backends. Install them explicitly:

pip install lm_eval          # core only, no model backends
pip install lm_eval[hf]      # HuggingFace backend (transformers, torch, accelerate)
pip install lm_eval[vllm]    # vLLM backend
pip install lm_eval[api]     # API backends (OpenAI, Anthropic, etc.)

Additional breaking change: Accessing model classes via attribute no longer works:

# This still works:
from lm_eval.models.huggingface import HFLM

# This now raises AttributeError:
import lm_eval.models
lm_eval.models.huggingface.HFLM

CLI Refactor

The CLI now uses explicit subcommands and supports YAML config files (#3440):

lm-eval run --model hf --tasks hellaswag      # run evaluations
lm-eval run --config my_config.yaml           # load args from YAML config
lm-eval ls tasks                               # list available tasks
lm-eval validate --tasks hellaswag,arc_easy   # validate task configs

Backward compatible when omitting run still works: lm-eval --model hf --tasks hellaswag

See lm-eval --help or the CLI documentation for details.

Other Improvements

  • Decoupled ContextSampler with new build_qa_turn helper (#3429)
  • Normalized gen_kwargs with truncation_side support for vLLM (#3509)

New Benchmarks & Tasks

  • PISA task by @HallerPatrick in #3412
  • SLR-Bench (Scalable Logical Reasoning Benchmark) by @Ahmad21Omar in #3305
  • OpenAI Multilingual MMLU by @Helw150 in #3473
  • ULQA benchmark by @keramjan in #3340
  • IFEval in Spanish and Catalan by @juliafalcao in #3467
  • TruthfulQA-VA for Catalan by @sgs97ua in #3469
  • Multiple Bangla benchmarks by @Ismail-Hossain-1 in #3454
  • NeurIPS E2LM Competition submissions: Team Shaikespear, Morai, and Noor by @younesbelkada in #3437, #3443, #3444

Model Support

  • Ministral-3 adapter (hf-mistral3) by @medhakimbedhief in #3487

Fixes & Improvements

Task Fixes

  • Fixed leading whitespace leakage in MMLU-Pro by @baberabb in #3500
  • Fixed gen_prefix delimiter handling in multiple-choice tasks by @baberabb in #3508
  • Fixed MGSM stop criteria in Iberian languages by @juliafalcao in #3465
  • Fixed a=0 as valid answer index in build_qa_turn by @ezylopx5 in #3488
  • Fixed fewshot_config not being applied to fewshot docs by @baberabb in #3461
  • Updated GSM8K, WinoGrande, and SuperGLUE to use full HF dataset paths by @baberabb in #3523, #3525, #3527
  • Fixed gsm8k_cot_llama target_delimiter issue by @baberabb in #3526
  • Updated LIBRA task utils by @bond005 in #3520

Backend Fixes

  • Fixed vLLM off-by-one max_length error by @baberabb in #3503
  • Resolved deprecated vllm.transformers_utils.get_tokenizer import by @DarkLight1337 in #3482
  • Fixed SGLang import and removed duplicate tasks by @baberabb in #3492
  • Removed deprecated AutoModelForVision2Seq by @baberabb in #3522
  • Fixed Anthropic chat model mapping by @lucafossen in #3453
  • Fixed bug preventing = sign in checkpoint names by @mrinaldi97 in #3517
  • Fixed pretty_print_task for external custom configs by @safikhanSoofiyani in #3436
  • Fixed CLI regressions by @fxmarty-amd in #3449

New Contributors

  • @safikhanSoofiyani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3436
  • @lucafossen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3453
  • @Ahmad21Omar made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3305
  • @ezylopx5 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3488
  • @juliafalcao made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3467
  • @medhakimbedhief made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3487
  • @ntenenz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3489
  • @keramjan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3340
  • @bond005 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3520
  • @mrinaldi97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3517
  • @wogns3623 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3523

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.2...v0.4.10

Breaking Changes

  • pip install lm_eval no longer installs the HuggingFace/torch stack (and other model backends) by default; they must be installed explicitly via extras like [hf], [vllm], or [api].
  • Accessing model classes via attribute (e.g., `lm_eval.models.huggingface.HFLM`) now raises AttributeError and is no longer supported.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track EleutherAI / Lm-Evaluation-Harness

Get notified when new releases ship.

Sign up free

About EleutherAI / Lm-Evaluation-Harness

All releases →

Related context

Earlier breaking changes

  • v0.4.12 vLLM minimum version requirement bumped to 0.18
  • v0.4.12 enable_thinking now disallowed for multiple_choice and loglikelihood tasks
  • v0.4.12 SteeredHF backend renamed to SteeredModel, update imports
  • v0.4.12 TaskManager.load() returns flat dict instead of nested structure

Beta — feedback welcome: [email protected]