EleutherAI / Lm-Evaluation-Harness

v0.4.9.2 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 8mo AI Coding Tools

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

evaluation-framework language-model transformer

Affected surfaces

breaking_upgrade

Summary

AI summary

Python 3.10 is now the minimum required runtime version.

Full changelog

This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum required version.

New Benchmarks & Tasks

A big wave of new evaluation tasks this release:

AIME and MATH500 math reasoning benchmarks by @jannalulu in #3248, #3311
BabiLong and Longbench v2 for long-context evaluation by @jannalulu in #3287, #3338
GraphWalks by @jannalulu in #3377
ZhoBLiMP, BLiMP-NL, TurBLiMP, LM-SynEval, and BHS linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265
Icelandic WinoGrande by @jmichaelov in #3277
CLIcK Korean benchmark by @shing100 in #3173
MMLU-Redux (generative) and Spanish translation by @luiscosio in #2705
EsBBQ and CaBBQ bias benchmarks by @valleruizf in #3167
EQBench in Spanish and Catalan by @priverabsc in #3168
Anthropic discrim-eval by @Helw150 in #3091
XNLI-VA by @FranValero97 in #3194
Bangla MMLU (Titulm) by @Ismail-Hossain-1 in #3317
HumanEval infilling by @its-alpesh in #3299
CNN-DailyMail 3.0.0 by @preordinary in #3426
Global PIQA and new acc_norm_bytes metric by @baberabb in #3368

Fixes & Improvements

Core Changes:

Python 3.10 minimum by @jannalulu in #3337
Unpinned datasets library by @baberabb in #3316
BOS token handling: Delegate to tokenizer; add_bos_token now defaults to None by @baberabb in #3347
Renamed LOGLEVEL env var to LMEVAL_LOG_LEVEL to avoid conflicts by @fxmarty-amd in #3418
Resolve duplicate task names with safeguards by @giuliolovisotto in #3394

Task Fixes:

Fixed MMLU-Redux to exclude samples without error_type="ok" and display summary table by @fxmarty-amd in #3410, #3406
Fixed AIME answer extraction by @jannalulu in #3353
Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361
Fixed crows_pairs dataset by @jannalulu in #3378
Fixed Gemma tokenizer add_bos_token not updating by @DarkLight1337 in #3206
Fixed lambada_multilingual_stablelm by @jmichaelov, @HallerPatrick in #3294, #3222
Fixed CodeXGLUE by @gsaltintas in #3238
Pinned correct MMLUSR version by @christinaexyou in #3350
Updated minerva_math by @baberabb in #3259

Backend Fixes:

Fixed vLLM import errors when not installed by @fxmarty-amd in #3292
Fixed vLLM data_parallel_size>1 issue by @Dornavineeth in #3303
Resolved deprecated vllm.utils.get_open_port by @DarkLight1337 in #3398
Fixed GPT series model bugs by @zinccat in #3348
Fixed PIL image hashing to use actual bytes by @tboerstad in #3331
Fixed additional_config parsing by @brian-dellabetta in #3393
Fixed batch chunking seed handling with groupby by @slimfrkha in #3047
Fixed no-output error handling by @Oseltamivir in #3395
Replaced deprecated torch_dtype with dtype by @AbdulmalikDS in #3415
Fixed custom task config reading by @SkyR0ver in #3425

Model & Backend Support

OpenAI GPT-5 support by @babyplutokurt in #3247
Azure OpenAI support by @zinccat in #3349
Fine-tuned Gemma3 evaluation support by @LearnerSXH in #3234
OpenVINO text2text models by @nikita-savelyevv in #3101
Intel XPU support for HFLM by @kaixuanliu in #3211
Attention head steering support by @luciaquirke in #3279
Leverage vLLM's tokenizer_info endpoint to avoid manual duplication by @m-misiura in #3185

What's Changed

Remove trust_remote_code: True from updated datasets by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3213
Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234
Fix add_bos_token not updated for Gemma tokenizer by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3206
remove incomplete compilation instructions, solves #3233 by @ceferisbarov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242
Update utils.py by @Anri-Lombard in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246
Adding support for OpenAI GPT-5 model by @babyplutokurt in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247
Add xnli_va dataset by @FranValero97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194
Add ZhoBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3218
Add BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3221
Add TurBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3219
Add LM-SynEval Benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3184
Fix unknown group key to tag in yaml config for lambada_multilingual_stablelm by @HallerPatrick in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222
update minerva_math by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3259
feat: Add CLIcK task by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3173
Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091
Add support for OpenVINO text2text generation models by @nikita-savelyevv in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101
Update MMLU-ProX task by @weihao1115 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174
Support for AIME dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248
feat(scrolls): delete chat_template from kwargs by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267
pacify pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3268
Fix codexglue by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238
Add BHS benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3265
Add acc_norm metric to BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3272
Add acc_norm metric to ZhoBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3271
Add EsBBQ and CaBBQ tasks by @valleruizf in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167
Add support for steering individual attention heads by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/3279
Add the Icelandic WinoGrande benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3277
Ignore seed when splitting batch in chunks with groupby by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3047
[fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3292
Fix LongBench Evaluation by @TimurAysin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273
add intel xpu support for HFLM by @kaixuanliu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211
feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in https://github.com/EleutherAI/lm-evaluation-harness/pull/2705
Add BabiLong by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3287
Add AIME to task description by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3296
Add humaneval_infilling task by @its-alpesh in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299
Add eqbench tasks in Spanish and Catalan by @priverabsc in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168
[fix] add math and longbench to test dependencies by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3321
Fix: VLLM model when data_parallel_size>1 by @Dornavineeth in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303
unpin datasets; update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3316
bump to python 3.10 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3337
Longbench v2 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3338
Leverage vllm's tokenizer_info endpoint to avoid manual duplication by @m-misiura in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185
Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317
remove duplicate tags/groups by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3343
Align humaneval_64_instruct task label in README to name in yaml file by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3344
Fixes bugs when using gpt series model by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348
[fix] aime doesn't extract answers by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3353
add global_piqa; add acc_norm_bytes metric by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3368
[fix] crows_pairs dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3378
Fix issue 3355 assertion error by @marksverdhei in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356
fix(gsm8k): align README to yaml file by @neoheartbeats in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388
added azure openai support by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3349
Delegate BOS to the tokenizer; add_bos_token defaults to None by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3347
fix trust_remote_code=True for longbench by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3361
[feat] add graphwalks by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3377
Longbench group fix by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3359
Resolve deprecation of vllm.utils.get_open_port by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3398
Trim whitespace in remove_whitespace filter by @ziqing-huang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408
Fixes #3391 avoid error on no-output by @Oseltamivir in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395
Fix PIL image hashing to use actual bytes instead of object repr by @tboerstad in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331
[MMLU redux] Do not use samples which do not have error_type="ok" by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3410
fix: resolve duplicate task names and add safeguards. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/3394
Add MATH500 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3311
[bugfix] additional_config parsing by @brian-dellabetta in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393
fix(tasks):pin correct MMLUSR version by @christinaexyou in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350
Fix lambada_multilingual_stablelm by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3294
Fix descriptions in the Moral Stories and Histoires Morales tasks. by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/3374
Replace deprecated torch_dtype parameter with dtype by @AbdulmalikDS in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415
[fix] Fix mmlu_redux not displaying summary table + display to the user the tasks / yaml that are actually pulled by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3406
Rename the conflicting environment variable LOGLEVEL to LMEVAL_LOG_LEVEL by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3418
Update SGLang installation and documentation links by @Bobchenyx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422
Fix reading custom task configs by @SkyR0ver in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425
New Task: Add CNN-DailyMail (3.0.0) by @preordinary in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426

New Contributors

@LearnerSXH made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234
@ceferisbarov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242
@Anri-Lombard made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246
@babyplutokurt made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247
@FranValero97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194
@HallerPatrick made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222
@Helw150 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091
@nikita-savelyevv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101
@weihao1115 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174
@jannalulu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248
@slimfrkha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267
@gsaltintas made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238
@valleruizf made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167
@TimurAysin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273
@kaixuanliu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211
@its-alpesh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299
@priverabsc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168
@Dornavineeth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303
@m-misiura made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185
@Ismail-Hossain-1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317
@zinccat made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348
@marksverdhei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356
@neoheartbeats made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388
@ziqing-huang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408
@Oseltamivir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395
@tboerstad made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331
@brian-dellabetta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393
@christinaexyou made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350
@AbdulmalikDS made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415
@Bobchenyx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422
@SkyR0ver made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425
@preordinary made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2

Breaking Changes

Minimum required runtime version changed to Python 3.10

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track EleutherAI / Lm-Evaluation-Harness

Get notified when new releases ship.

About EleutherAI / Lm-Evaluation-Harness

All releases →

Related context

Related tools

Earlier breaking changes

v0.4.12 vLLM minimum version requirement bumped to 0.18
v0.4.12 enable_thinking now disallowed for multiple_choice and loglikelihood tasks
v0.4.12 SteeredHF backend renamed to SteeredModel, update imports
v0.4.12 TaskManager.load() returns flat dict instead of nested structure