Skip to content
Features
-
Benchmark LLMs across custom prompts and datasets
-
Score responses with pluggable reference‑based or LLM‑judge metrics
-
Side‑by‑side comparison to pick the best model for a task
v0.2.0
New feature
·
Notable features
- CSV support via Dataset.from_csv() with default column names `input` and `ideal` and overrides `input_field`/`output_field`
- Arbitrary JSONL field mapping through CLI flags `--input-field` / `--output-field` and Python API
- Label‑free evaluation allowing datasets without reference answers; reference‑based metrics emit a clear upfront error
Full changelog
What's new in 0.2.0
Dataset
- CSV support via
Dataset.from_csv() — default column names input and ideal, with input_field/output_field overrides for custom schemas
- Arbitrary JSONL field mapping via
--input-field / --output-field CLI flags and Python API
- Label-free evaluation — datasets without reference answers work end-to-end; reference-based metrics raise a clear error upfront
Metrics
- Multi-dimensional LLM-as-judge via the
dimensions parameter — score multiple criteria (e.g. fluency, accuracy, safety) in a single judge call
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
About
View on GitHub
Documentation
Search tools, categories, lists, and users
Use ↑↓ to navigate, Enter to open, Esc to close
No results for ""
⌘K to open
↑↓ navigate
⏎ open