Skip to content

Verdict

AI Coding Tools
Python Latest v0.2.0 · 1mo ago Security brief →

Features

  • Benchmark LLMs across custom prompts and datasets
  • Score responses with pluggable reference‑based or LLM‑judge metrics
  • Side‑by‑side comparison to pick the best model for a task

Recent releases

View all 2 releases →
v0.2.0 New feature
Notable features
  • CSV support via Dataset.from_csv() with default column names `input` and `ideal` and overrides `input_field`/`output_field`
  • Arbitrary JSONL field mapping through CLI flags `--input-field` / `--output-field` and Python API
  • Label‑free evaluation allowing datasets without reference answers; reference‑based metrics emit a clear upfront error
Full changelog

What's new in 0.2.0

Dataset

  • CSV support via Dataset.from_csv() — default column names input and ideal, with input_field/output_field overrides for custom schemas
  • Arbitrary JSONL field mapping via --input-field / --output-field CLI flags and Python API
  • Label-free evaluation — datasets without reference answers work end-to-end; reference-based metrics raise a clear error upfront

Metrics

  • Multi-dimensional LLM-as-judge via the dimensions parameter — score multiple criteria (e.g. fluency, accuracy, safety) in a single judge call

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

About

Stars
4
Forks
0
Language
Python

Install & Platforms

Install via
pip

Beta — feedback welcome: [email protected]