Skip to content

This release adds 1 notable feature for engineering teams evaluating rollout.

Published 3mo AI Coding Tools
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

evaluation-framework language-model transformer

Affected surfaces

rce_ssrf deps

Summary

AI summary

Minor fixes and improvements.

Full changelog

v0.4.11 Release Notes

Minor release. Stay tuned for bigger changes next release.

New Platform Support

  • Windows ML Backend — Native Windows ML inference support by @chapsiru and @chemwolf6922 in #3470, #3564, #3565

New Benchmarks & Tasks

  • BEAR knowledge probe by @plonerma in #3496

Task Version Changes

The following tasks have updated versions. Results from a previous task versions may not be directly comparable. See the linked PRs or individual task READMEs for changelogs.

afrobench_belebele (all variants): 2 → 3 in #3551
evalita_llm: 0.0 → 0.1 in #3551
include (all 90 language variants): 0.0 → 0.1 in #3551
mgsm_direct (all 11 language variants): 3.0 → 4.0 by @LakshyaChaudhry in #3574

Fixes & Improvements

  • Fixed SQuAD v2 evaluation by @HydrogenSulfate in #3535
  • Fixed MasakhaNEWS tasks — replaced non-existent headline_text field with headline by @Mr-Neutr0n in #3567
  • Fixed incorrect task configs by @baberabb in #3552
  • Replaced eval() with ast.literal_eval in task configs for safer parsing by @baberabb in #3577
  • Fixed SGLang duplicate registration error by @enpimashin in #3543
  • Restored hf_transfer import check by @baberabb in #3563
  • Fixed modify_gen_kwargs call in vLLM VLMs by @hmellor in #3573
  • Refactored vLLM gen_kwargs normalization inline to modify_gen_kwargs; fixed cached gen_kwargs mutation by @baberabb in #3582
  • Fixed README for task-listing CLI command by @UltimateJupiter in #3545
  • Updated dependencies by @baberabb in #3546

New Contributors

  • @HydrogenSulfate made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3535
  • @UltimateJupiter made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3545
  • @enpimashin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3543
  • @chapsiru made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3470
  • @chemwolf6922 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3565
  • @plonerma made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3496
  • @hmellor made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3573
  • @Mr-Neutr0n made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3567
  • @LakshyaChaudhry made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3574

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.10...v0.4.11

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track EleutherAI / Lm-Evaluation-Harness

Get notified when new releases ship.

Sign up free

About EleutherAI / Lm-Evaluation-Harness

All releases →

Related context

Earlier breaking changes

  • v0.4.12 vLLM minimum version requirement bumped to 0.18
  • v0.4.12 enable_thinking now disallowed for multiple_choice and loglikelihood tasks
  • v0.4.12 SteeredHF backend renamed to SteeredModel, update imports
  • v0.4.12 TaskManager.load() returns flat dict instead of nested structure

Beta — feedback welcome: [email protected]