EleutherAI / Lm-Evaluation-Harness

v0.4.10 Breaking

This release includes 2 breaking changes for platform teams planning a safe upgrade.

Published 6mo AI Coding Tools

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

evaluation-framework language-model transformer

Affected surfaces

breaking_upgrade

Summary

AI summary

Base package no longer installs model backends by default, requiring explicit installation.

Full changelog

Highlights

The big change this release: the base package no longer installs model backends by default. We've also added new benchmarks and expanded multilingual support.

Breaking Change: Lightweight Core with Optional Backends

pip install lm_eval no longer installs the HuggingFace/torch stack by default. (#3428)

The core package no longer includes backends. Install them explicitly:

pip install lm_eval          # core only, no model backends
pip install lm_eval[hf]      # HuggingFace backend (transformers, torch, accelerate)
pip install lm_eval[vllm]    # vLLM backend
pip install lm_eval[api]     # API backends (OpenAI, Anthropic, etc.)

Additional breaking change: Accessing model classes via attribute no longer works:

# This still works:
from lm_eval.models.huggingface import HFLM

# This now raises AttributeError:
import lm_eval.models
lm_eval.models.huggingface.HFLM

CLI Refactor

The CLI now uses explicit subcommands and supports YAML config files (#3440):

lm-eval run --model hf --tasks hellaswag      # run evaluations
lm-eval run --config my_config.yaml           # load args from YAML config
lm-eval ls tasks                               # list available tasks
lm-eval validate --tasks hellaswag,arc_easy   # validate task configs

Backward compatible when omitting run still works: lm-eval --model hf --tasks hellaswag

See lm-eval --help or the CLI documentation for details.

Other Improvements

Decoupled ContextSampler with new build_qa_turn helper (#3429)
Normalized gen_kwargs with truncation_side support for vLLM (#3509)

New Benchmarks & Tasks

PISA task by @HallerPatrick in #3412
SLR-Bench (Scalable Logical Reasoning Benchmark) by @Ahmad21Omar in #3305
OpenAI Multilingual MMLU by @Helw150 in #3473
ULQA benchmark by @keramjan in #3340
IFEval in Spanish and Catalan by @juliafalcao in #3467
TruthfulQA-VA for Catalan by @sgs97ua in #3469
Multiple Bangla benchmarks by @Ismail-Hossain-1 in #3454
NeurIPS E2LM Competition submissions: Team Shaikespear, Morai, and Noor by @younesbelkada in #3437, #3443, #3444

Model Support

Ministral-3 adapter (hf-mistral3) by @medhakimbedhief in #3487

Fixes & Improvements

Task Fixes

Fixed leading whitespace leakage in MMLU-Pro by @baberabb in #3500
Fixed gen_prefix delimiter handling in multiple-choice tasks by @baberabb in #3508
Fixed MGSM stop criteria in Iberian languages by @juliafalcao in #3465
Fixed a=0 as valid answer index in build_qa_turn by @ezylopx5 in #3488
Fixed fewshot_config not being applied to fewshot docs by @baberabb in #3461
Updated GSM8K, WinoGrande, and SuperGLUE to use full HF dataset paths by @baberabb in #3523, #3525, #3527
Fixed gsm8k_cot_llama target_delimiter issue by @baberabb in #3526
Updated LIBRA task utils by @bond005 in #3520

Backend Fixes

Fixed vLLM off-by-one max_length error by @baberabb in #3503
Resolved deprecated vllm.transformers_utils.get_tokenizer import by @DarkLight1337 in #3482
Fixed SGLang import and removed duplicate tasks by @baberabb in #3492
Removed deprecated AutoModelForVision2Seq by @baberabb in #3522
Fixed Anthropic chat model mapping by @lucafossen in #3453
Fixed bug preventing = sign in checkpoint names by @mrinaldi97 in #3517
Fixed pretty_print_task for external custom configs by @safikhanSoofiyani in #3436
Fixed CLI regressions by @fxmarty-amd in #3449

New Contributors

@safikhanSoofiyani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3436
@lucafossen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3453
@Ahmad21Omar made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3305
@ezylopx5 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3488
@juliafalcao made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3467
@medhakimbedhief made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3487
@ntenenz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3489
@keramjan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3340
@bond005 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3520
@mrinaldi97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3517
@wogns3623 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3523

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.2...v0.4.10

Breaking Changes

pip install lm_eval no longer installs the HuggingFace/torch stack (and other model backends) by default; they must be installed explicitly via extras like [hf], [vllm], or [api].
Accessing model classes via attribute (e.g., `lm_eval.models.huggingface.HFLM`) now raises AttributeError and is no longer supported.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track EleutherAI / Lm-Evaluation-Harness

Get notified when new releases ship.

About EleutherAI / Lm-Evaluation-Harness

All releases →

Related context

Related tools

Earlier breaking changes

v0.4.12 vLLM minimum version requirement bumped to 0.18
v0.4.12 enable_thinking now disallowed for multiple_choice and loglikelihood tasks
v0.4.12 SteeredHF backend renamed to SteeredModel, update imports
v0.4.12 TaskManager.load() returns flat dict instead of nested structure