Skip to content

EleutherAI / Lm-Evaluation-Harness

AI Coding Tools

A unified framework for evaluating generative language models across dozens of academic benchmarks, custom prompts, and multiple inference back‑ends.

Python Latest v0.4.12 · 23d ago Security brief →

Features

  • Supports >60 standard LLM benchmarks with hundreds of subtasks
  • Flexible model loading via Transformers, vLLM, GPT‑NeoX, Megatron‑DeepSpeed, and commercial APIs (OpenAI, TextSynth)
  • Configurable prompt design with Jinja2 and import from Promptsource

Recent releases

View all 6 releases →
No immediate action
v0.4.12 Breaking risk

SteeredModel rename + vLLM bump + thinking flag change

Review required
v0.4.11 Maintenance
RCE / SSRF Dependencies

Routine maintenance and dependency updates.

Review required
v0.4.10 Breaking risk
Breaking upgrade

Optional backend installation

Review required
v0.4.9.2 Breaking risk
Breaking upgrade

Python 3.10 minimum

No immediate action
v0.4.9.1 Breaking risk

New benchmarks + TruthfulQA

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

About

Stars
12,802
Forks
3,320
Languages
Python Shell C++

Install & Platforms

Install via
pip

Community & Support

Beta — feedback welcome: [email protected]