This release includes 1 breaking change for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
Affected surfaces
Summary
AI summaryPython 3.10 is now the minimum required runtime version.
Full changelog
This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum required version.
New Benchmarks & Tasks
A big wave of new evaluation tasks this release:
- AIME and MATH500 math reasoning benchmarks by @jannalulu in #3248, #3311
- BabiLong and Longbench v2 for long-context evaluation by @jannalulu in #3287, #3338
- GraphWalks by @jannalulu in #3377
- ZhoBLiMP, BLiMP-NL, TurBLiMP, LM-SynEval, and BHS linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265
- Icelandic WinoGrande by @jmichaelov in #3277
- CLIcK Korean benchmark by @shing100 in #3173
- MMLU-Redux (generative) and Spanish translation by @luiscosio in #2705
- EsBBQ and CaBBQ bias benchmarks by @valleruizf in #3167
- EQBench in Spanish and Catalan by @priverabsc in #3168
- Anthropic discrim-eval by @Helw150 in #3091
- XNLI-VA by @FranValero97 in #3194
- Bangla MMLU (Titulm) by @Ismail-Hossain-1 in #3317
- HumanEval infilling by @its-alpesh in #3299
- CNN-DailyMail 3.0.0 by @preordinary in #3426
- Global PIQA and new
acc_norm_bytesmetric by @baberabb in #3368
Fixes & Improvements
Core Changes:
- Python 3.10 minimum by @jannalulu in #3337
- Unpinned
datasetslibrary by @baberabb in #3316 - BOS token handling: Delegate to tokenizer;
add_bos_tokennow defaults toNoneby @baberabb in #3347 - Renamed
LOGLEVELenv var toLMEVAL_LOG_LEVELto avoid conflicts by @fxmarty-amd in #3418 - Resolve duplicate task names with safeguards by @giuliolovisotto in #3394
Task Fixes:
- Fixed MMLU-Redux to exclude samples without
error_type="ok"and display summary table by @fxmarty-amd in #3410, #3406 - Fixed AIME answer extraction by @jannalulu in #3353
- Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361
- Fixed
crows_pairsdataset by @jannalulu in #3378 - Fixed Gemma tokenizer
add_bos_tokennot updating by @DarkLight1337 in #3206 - Fixed
lambada_multilingual_stablelmby @jmichaelov, @HallerPatrick in #3294, #3222 - Fixed CodeXGLUE by @gsaltintas in #3238
- Pinned correct MMLUSR version by @christinaexyou in #3350
- Updated
minerva_mathby @baberabb in #3259
Backend Fixes:
- Fixed vLLM import errors when not installed by @fxmarty-amd in #3292
- Fixed vLLM
data_parallel_size>1issue by @Dornavineeth in #3303 - Resolved deprecated
vllm.utils.get_open_portby @DarkLight1337 in #3398 - Fixed GPT series model bugs by @zinccat in #3348
- Fixed PIL image hashing to use actual bytes by @tboerstad in #3331
- Fixed
additional_configparsing by @brian-dellabetta in #3393 - Fixed batch chunking seed handling with groupby by @slimfrkha in #3047
- Fixed no-output error handling by @Oseltamivir in #3395
- Replaced deprecated
torch_dtypewithdtypeby @AbdulmalikDS in #3415 - Fixed custom task config reading by @SkyR0ver in #3425
Model & Backend Support
- OpenAI GPT-5 support by @babyplutokurt in #3247
- Azure OpenAI support by @zinccat in #3349
- Fine-tuned Gemma3 evaluation support by @LearnerSXH in #3234
- OpenVINO text2text models by @nikita-savelyevv in #3101
- Intel XPU support for HFLM by @kaixuanliu in #3211
- Attention head steering support by @luciaquirke in #3279
- Leverage vLLM's
tokenizer_infoendpoint to avoid manual duplication by @m-misiura in #3185
What's Changed
- Remove
trust_remote_code: Truefrom updated datasets by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3213 - Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234
- Fix
add_bos_tokennot updated for Gemma tokenizer by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3206 - remove incomplete compilation instructions, solves #3233 by @ceferisbarov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242
- Update utils.py by @Anri-Lombard in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246
- Adding support for OpenAI GPT-5 model by @babyplutokurt in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247
- Add xnli_va dataset by @FranValero97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194
- Add ZhoBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3218
- Add BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3221
- Add TurBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3219
- Add LM-SynEval Benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3184
- Fix unknown group key to tag in yaml config for
lambada_multilingual_stablelmby @HallerPatrick in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222 - update
minerva_mathby @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3259 - feat: Add CLIcK task by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3173
- Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091
- Add support for OpenVINO text2text generation models by @nikita-savelyevv in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101
- Update MMLU-ProX task by @weihao1115 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174
- Support for AIME dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248
- feat(scrolls): delete chat_template from kwargs by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267
- pacify pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3268
- Fix codexglue by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238
- Add BHS benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3265
- Add
acc_normmetric to BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3272 - Add
acc_normmetric to ZhoBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3271 - Add EsBBQ and CaBBQ tasks by @valleruizf in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167
- Add support for steering individual attention heads by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/3279
- Add the Icelandic WinoGrande benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3277
- Ignore seed when splitting batch in chunks with groupby by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3047
- [fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3292
- Fix LongBench Evaluation by @TimurAysin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273
- add intel xpu support for HFLM by @kaixuanliu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211
- feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in https://github.com/EleutherAI/lm-evaluation-harness/pull/2705
- Add BabiLong by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3287
- Add AIME to task description by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3296
- Add humaneval_infilling task by @its-alpesh in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299
- Add eqbench tasks in Spanish and Catalan by @priverabsc in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168
- [fix] add math and longbench to test dependencies by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3321
- Fix: VLLM model when data_parallel_size>1 by @Dornavineeth in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303
- unpin datasets; update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3316
- bump to python 3.10 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3337
- Longbench v2 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3338
- Leverage vllm's
tokenizer_infoendpoint to avoid manual duplication by @m-misiura in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185 - Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317
- remove duplicate tags/groups by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3343
- Align
humaneval_64_instructtask label in README to name in yaml file by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3344 - Fixes bugs when using gpt series model by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348
- [fix] aime doesn't extract answers by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3353
- add global_piqa; add acc_norm_bytes metric by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3368
- [fix] crows_pairs dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3378
- Fix issue 3355 assertion error by @marksverdhei in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356
- fix(gsm8k): align README to yaml file by @neoheartbeats in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388
- added azure openai support by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3349
- Delegate BOS to the tokenizer;
add_bos_tokendefaults toNoneby @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3347 - fix trust_remote_code=True for longbench by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3361
- [feat] add graphwalks by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3377
- Longbench group fix by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3359
- Resolve deprecation of
vllm.utils.get_open_portby @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3398 - Trim whitespace in remove_whitespace filter by @ziqing-huang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408
- Fixes #3391 avoid error on no-output by @Oseltamivir in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395
- Fix PIL image hashing to use actual bytes instead of object repr by @tboerstad in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331
- [MMLU redux] Do not use samples which do not have
error_type="ok"by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3410 - fix: resolve duplicate task names and add safeguards. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/3394
- Add MATH500 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3311
- [bugfix] additional_config parsing by @brian-dellabetta in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393
- fix(tasks):pin correct MMLUSR version by @christinaexyou in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350
- Fix
lambada_multilingual_stablelmby @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3294 - Fix descriptions in the Moral Stories and Histoires Morales tasks. by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/3374
- Replace deprecated torch_dtype parameter with dtype by @AbdulmalikDS in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415
- [fix] Fix mmlu_redux not displaying summary table + display to the user the tasks / yaml that are actually pulled by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3406
- Rename the conflicting environment variable
LOGLEVELtoLMEVAL_LOG_LEVELby @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3418 - Update SGLang installation and documentation links by @Bobchenyx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422
- Fix reading custom task configs by @SkyR0ver in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425
- New Task: Add CNN-DailyMail (3.0.0) by @preordinary in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426
New Contributors
- @LearnerSXH made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234
- @ceferisbarov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242
- @Anri-Lombard made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246
- @babyplutokurt made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247
- @FranValero97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194
- @HallerPatrick made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222
- @Helw150 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091
- @nikita-savelyevv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101
- @weihao1115 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174
- @jannalulu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248
- @slimfrkha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267
- @gsaltintas made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238
- @valleruizf made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167
- @TimurAysin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273
- @kaixuanliu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211
- @its-alpesh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299
- @priverabsc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168
- @Dornavineeth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303
- @m-misiura made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185
- @Ismail-Hossain-1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317
- @zinccat made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348
- @marksverdhei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356
- @neoheartbeats made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388
- @ziqing-huang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408
- @Oseltamivir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395
- @tboerstad made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331
- @brian-dellabetta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393
- @christinaexyou made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350
- @AbdulmalikDS made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415
- @Bobchenyx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422
- @SkyR0ver made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425
- @preordinary made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2
Breaking Changes
- Minimum required runtime version changed to Python 3.10
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About EleutherAI / Lm-Evaluation-Harness
All releases →Related context
Related tools
Earlier breaking changes
Beta — feedback welcome: [email protected]