hidai25/eval-view

v0.4.0 Feature

This release adds 5 notable features for engineering teams evaluating rollout.

Published 4mo Developer Productivity

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agent-benchmark agent-evaluation agentic-ai ai-agents anthropic autogen

+12 more

cli crewai evaluation langchain-agent langgraph llm mcp openai-assistants pytest python regression-testing testing

Summary

AI summary

Added multi-turn conversation testing with the turns: YAML field.

Full changelog

What's new in 0.4.0

Multi-turn conversation testing

Test stateful, multi-step conversations with the new turns: YAML field. Each turn gets the accumulated conversation history injected automatically.

name: flight-booking-conversation
turns:
  - query: "I want to fly from NYC to Paris next Friday"
    expected:
      tools: [search_flights]
  - query: "Book the cheapest economy option"
    expected:
      tools: [book_flight]
      output:
        contains: ["confirmed", "Paris"]
  - query: "Send me a confirmation email"
    expected:
      tools: [send_email]
expected:
  tools: [search_flights, book_flight, send_email]
thresholds:
  min_score: 80

A/B endpoint comparison

Run the same test suite against two endpoints and get a per-test verdict table.

evalview compare \
  --v1 http://prod.internal/invoke --label-v1 "gpt-4o (prod)" \
  --v2 http://staging.internal/invoke --label-v2 "claude-sonnet (staging)" \
  --tests tests/

Cloud baseline sync

evalview login      # OAuth sign-in
evalview snapshot   # baselines auto-sync to cloud
evalview check      # teammates pull your baselines automatically

Other highlights

evalview capture — HTTP proxy records real agent traffic as test YAMLs
evalview install-hooks — inject regression checks into git pre-push
Silent model update detection — alerts when provider swaps model behind same API name
Gradual drift detection — OLS regression over 10-check window
Semantic diff — --semantic-diff scores by meaning, not character similarity
Auto-open HTML report after every evalview run
evalview init now auto-detects your agent endpoint and generates starter tests
Test quality gating — low-quality generated tests are skipped, not silently polluting scores
mypy clean — 0 errors across 109 source files

Community contributions

Pydantic field validation for TestCase (#54 by @illbeurs)
Edge tests for CostEvaluator and LatencyEvaluator (#55 by @illbeurs)
health_check() on OllamaAdapter (#57 by @gauravxthakur)
ConsoleReporter docstrings (#56 by @gauravxthakur)

Full changelog: CHANGELOG.md

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track hidai25/eval-view

Get notified when new releases ship.

About hidai25/eval-view

Regression testing framework for AI agents. Save golden baselines, detect behavioral drift, and block regressions in CI. Works with LangGraph, CrewAI, OpenAI, Claude, and any HTTP API.

All releases →