EleutherAI / Lm-Evaluation-Harness

v0.4.11 Feature

This release adds 1 notable feature for engineering teams evaluating rollout.

Published 5mo AI Coding Tools

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

evaluation-framework language-model transformer

Affected surfaces

rce_ssrf deps

Summary

AI summary

Minor fixes and improvements.

Full changelog

v0.4.11 Release Notes

Minor release. Stay tuned for bigger changes next release.

New Platform Support

Windows ML Backend — Native Windows ML inference support by @chapsiru and @chemwolf6922 in #3470, #3564, #3565

New Benchmarks & Tasks

BEAR knowledge probe by @plonerma in #3496

Task Version Changes

The following tasks have updated versions. Results from a previous task versions may not be directly comparable. See the linked PRs or individual task READMEs for changelogs.

afrobench_belebele (all variants): 2 → 3 in #3551
evalita_llm: 0.0 → 0.1 in #3551
include (all 90 language variants): 0.0 → 0.1 in #3551
mgsm_direct (all 11 language variants): 3.0 → 4.0 by @LakshyaChaudhry in #3574

Fixes & Improvements

Fixed SQuAD v2 evaluation by @HydrogenSulfate in #3535
Fixed MasakhaNEWS tasks — replaced non-existent headline_text field with headline by @Mr-Neutr0n in #3567
Fixed incorrect task configs by @baberabb in #3552
Replaced eval() with ast.literal_eval in task configs for safer parsing by @baberabb in #3577
Fixed SGLang duplicate registration error by @enpimashin in #3543
Restored hf_transfer import check by @baberabb in #3563
Fixed modify_gen_kwargs call in vLLM VLMs by @hmellor in #3573
Refactored vLLM gen_kwargs normalization inline to modify_gen_kwargs; fixed cached gen_kwargs mutation by @baberabb in #3582
Fixed README for task-listing CLI command by @UltimateJupiter in #3545
Updated dependencies by @baberabb in #3546

New Contributors

@HydrogenSulfate made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3535
@UltimateJupiter made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3545
@enpimashin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3543
@chapsiru made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3470
@chemwolf6922 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3565
@plonerma made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3496
@hmellor made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3573
@Mr-Neutr0n made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3567
@LakshyaChaudhry made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3574

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.10...v0.4.11

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track EleutherAI / Lm-Evaluation-Harness

Get notified when new releases ship.

About EleutherAI / Lm-Evaluation-Harness

All releases →

Related context

Related tools

Earlier breaking changes

v0.4.12 vLLM minimum version requirement bumped to 0.18
v0.4.12 enable_thinking now disallowed for multiple_choice and loglikelihood tasks
v0.4.12 SteeredHF backend renamed to SteeredModel, update imports
v0.4.12 TaskManager.load() returns flat dict instead of nested structure