Verdict

AI Coding Tools

Benchmark and compare language models on your own data to select the best model and measure improvements

Track releases GitHub

Python Latest v0.2.0 · 3mo ago Security brief →

Features

Run prompts across multiple LLMs (OpenAI, Anthropic, Google Gemini, etc.)
Score responses with pluggable reference‑based or LLM‑as‑judge metrics
Side‑by‑side comparison UI for picking the optimal model
Track how prompt engineering or fine‑tuning changes affect scores

Recent releases

View all 2 releases →

v0.2.0 New feature 3mo

Notable features

CSV support via Dataset.from_csv() with default column names `input` and `ideal` and overrides `input_field`/`output_field`
Arbitrary JSONL field mapping through CLI flags `--input-field` / `--output-field` and Python API
Label‑free evaluation allowing datasets without reference answers; reference‑based metrics emit a clear upfront error

Full changelog

What's new in 0.2.0

Dataset

CSV support via Dataset.from_csv() — default column names input and ideal, with input_field/output_field overrides for custom schemas
Arbitrary JSONL field mapping via --input-field / --output-field CLI flags and Python API
Label-free evaluation — datasets without reference answers work end-to-end; reference-based metrics raise a clear error upfront

Metrics

Multi-dimensional LLM-as-judge via the dimensions parameter — score multiple criteria (e.g. fluency, accuracy, safety) in a single judge call

View release on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Releases

View all →

Releases per month

M

A

M

J

J

Cadence 0.0 / wk

Last release 103d

Tracked 2

Security

Full profile →

Security score 7.8/10

OpenSSF —

Open CVEs 0

Active maintainer

Community

GitHub stars 4

Contributors 90d 0

Open issues 3

Open PRs 3

Stars/wk velocity 0.0

About

Stars

4

Forks

0

Language

Python

View on GitHub Documentation

Install & Platforms

Install via

pip

Similar tools

AI/ML benchmark for local LLM inference and XGBoost training on GPU/CPU

EleutherAI / Lm-Evaluation-Harness

Find the best local LLM for your hardware, ranked by benchmarks

BlazeUp-AI/Observal](https:

About

Stars

4

Forks

0

Language

Python

View on GitHub Documentation

Install & Platforms

Install via

pip

Similar tools

AI/ML benchmark for local LLM inference and XGBoost training on GPU/CPU

EleutherAI / Lm-Evaluation-Harness

Find the best local LLM for your hardware, ranked by benchmarks

BlazeUp-AI/Observal](https:

Beta — feedback welcome: [email protected]