This release adds 5 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+12 more
Summary
AI summaryAdded multi-turn conversation testing with the turns: YAML field.
Full changelog
What's new in 0.4.0
Multi-turn conversation testing
Test stateful, multi-step conversations with the new turns: YAML field. Each turn gets the accumulated conversation history injected automatically.
name: flight-booking-conversation
turns:
- query: "I want to fly from NYC to Paris next Friday"
expected:
tools: [search_flights]
- query: "Book the cheapest economy option"
expected:
tools: [book_flight]
output:
contains: ["confirmed", "Paris"]
- query: "Send me a confirmation email"
expected:
tools: [send_email]
expected:
tools: [search_flights, book_flight, send_email]
thresholds:
min_score: 80
A/B endpoint comparison
Run the same test suite against two endpoints and get a per-test verdict table.
evalview compare \
--v1 http://prod.internal/invoke --label-v1 "gpt-4o (prod)" \
--v2 http://staging.internal/invoke --label-v2 "claude-sonnet (staging)" \
--tests tests/
Cloud baseline sync
evalview login # OAuth sign-in
evalview snapshot # baselines auto-sync to cloud
evalview check # teammates pull your baselines automatically
Other highlights
evalview capture— HTTP proxy records real agent traffic as test YAMLsevalview install-hooks— inject regression checks into git pre-push- Silent model update detection — alerts when provider swaps model behind same API name
- Gradual drift detection — OLS regression over 10-check window
- Semantic diff —
--semantic-diffscores by meaning, not character similarity - Auto-open HTML report after every
evalview run - evalview init now auto-detects your agent endpoint and generates starter tests
- Test quality gating — low-quality generated tests are skipped, not silently polluting scores
- mypy clean — 0 errors across 109 source files
Community contributions
- Pydantic field validation for
TestCase(#54 by @illbeurs) - Edge tests for
CostEvaluatorandLatencyEvaluator(#55 by @illbeurs) health_check()onOllamaAdapter(#57 by @gauravxthakur)ConsoleReporterdocstrings (#56 by @gauravxthakur)
Full changelog: CHANGELOG.md
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About hidai25/eval-view
Regression testing framework for AI agents. Save golden baselines, detect behavioral drift, and block regressions in CI. Works with LangGraph, CrewAI, OpenAI, Claude, and any HTTP API.
Related context
Related tools
Beta — feedback welcome: [email protected]