This release includes 2 breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
Affected surfaces
Summary
AI summaryBase package no longer installs model backends by default, requiring explicit installation.
Full changelog
Highlights
The big change this release: the base package no longer installs model backends by default. We've also added new benchmarks and expanded multilingual support.
Breaking Change: Lightweight Core with Optional Backends
pip install lm_eval no longer installs the HuggingFace/torch stack by default. (#3428)
The core package no longer includes backends. Install them explicitly:
pip install lm_eval # core only, no model backends
pip install lm_eval[hf] # HuggingFace backend (transformers, torch, accelerate)
pip install lm_eval[vllm] # vLLM backend
pip install lm_eval[api] # API backends (OpenAI, Anthropic, etc.)
Additional breaking change: Accessing model classes via attribute no longer works:
# This still works:
from lm_eval.models.huggingface import HFLM
# This now raises AttributeError:
import lm_eval.models
lm_eval.models.huggingface.HFLM
CLI Refactor
The CLI now uses explicit subcommands and supports YAML config files (#3440):
lm-eval run --model hf --tasks hellaswag # run evaluations
lm-eval run --config my_config.yaml # load args from YAML config
lm-eval ls tasks # list available tasks
lm-eval validate --tasks hellaswag,arc_easy # validate task configs
Backward compatible when omitting run still works: lm-eval --model hf --tasks hellaswag
See lm-eval --help or the CLI documentation for details.
Other Improvements
- Decoupled
ContextSamplerwith newbuild_qa_turnhelper (#3429) - Normalized
gen_kwargswithtruncation_sidesupport for vLLM (#3509)
New Benchmarks & Tasks
- PISA task by @HallerPatrick in #3412
- SLR-Bench (Scalable Logical Reasoning Benchmark) by @Ahmad21Omar in #3305
- OpenAI Multilingual MMLU by @Helw150 in #3473
- ULQA benchmark by @keramjan in #3340
- IFEval in Spanish and Catalan by @juliafalcao in #3467
- TruthfulQA-VA for Catalan by @sgs97ua in #3469
- Multiple Bangla benchmarks by @Ismail-Hossain-1 in #3454
- NeurIPS E2LM Competition submissions: Team Shaikespear, Morai, and Noor by @younesbelkada in #3437, #3443, #3444
Model Support
- Ministral-3 adapter (
hf-mistral3) by @medhakimbedhief in #3487
Fixes & Improvements
Task Fixes
- Fixed leading whitespace leakage in MMLU-Pro by @baberabb in #3500
- Fixed
gen_prefixdelimiter handling in multiple-choice tasks by @baberabb in #3508 - Fixed MGSM stop criteria in Iberian languages by @juliafalcao in #3465
- Fixed
a=0as valid answer index inbuild_qa_turnby @ezylopx5 in #3488 - Fixed
fewshot_confignot being applied to fewshot docs by @baberabb in #3461 - Updated GSM8K, WinoGrande, and SuperGLUE to use full HF dataset paths by @baberabb in #3523, #3525, #3527
- Fixed
gsm8k_cot_llamatarget_delimiterissue by @baberabb in #3526 - Updated LIBRA task utils by @bond005 in #3520
Backend Fixes
- Fixed vLLM off-by-one
max_lengtherror by @baberabb in #3503 - Resolved deprecated
vllm.transformers_utils.get_tokenizerimport by @DarkLight1337 in #3482 - Fixed SGLang import and removed duplicate tasks by @baberabb in #3492
- Removed deprecated
AutoModelForVision2Seqby @baberabb in #3522 - Fixed Anthropic chat model mapping by @lucafossen in #3453
- Fixed bug preventing
=sign in checkpoint names by @mrinaldi97 in #3517 - Fixed
pretty_print_taskfor external custom configs by @safikhanSoofiyani in #3436 - Fixed CLI regressions by @fxmarty-amd in #3449
New Contributors
- @safikhanSoofiyani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3436
- @lucafossen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3453
- @Ahmad21Omar made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3305
- @ezylopx5 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3488
- @juliafalcao made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3467
- @medhakimbedhief made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3487
- @ntenenz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3489
- @keramjan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3340
- @bond005 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3520
- @mrinaldi97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3517
- @wogns3623 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3523
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.2...v0.4.10
Breaking Changes
- pip install lm_eval no longer installs the HuggingFace/torch stack (and other model backends) by default; they must be installed explicitly via extras like [hf], [vllm], or [api].
- Accessing model classes via attribute (e.g., `lm_eval.models.huggingface.HFLM`) now raises AttributeError and is no longer supported.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About EleutherAI / Lm-Evaluation-Harness
All releases →Related context
Related tools
Earlier breaking changes
Beta — feedback welcome: [email protected]