Skip to content

hidai25/eval-view

v0.4.0 Feature

This release adds 5 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agent-benchmark agent-evaluation agentic-ai ai-agents anthropic autogen
+12 more
cli crewai evaluation langchain-agent langgraph llm mcp openai-assistants pytest python regression-testing testing

Summary

AI summary

Added multi-turn conversation testing with the turns: YAML field.

Full changelog

What's new in 0.4.0

Multi-turn conversation testing

Test stateful, multi-step conversations with the new turns: YAML field. Each turn gets the accumulated conversation history injected automatically.

name: flight-booking-conversation
turns:
  - query: "I want to fly from NYC to Paris next Friday"
    expected:
      tools: [search_flights]
  - query: "Book the cheapest economy option"
    expected:
      tools: [book_flight]
      output:
        contains: ["confirmed", "Paris"]
  - query: "Send me a confirmation email"
    expected:
      tools: [send_email]
expected:
  tools: [search_flights, book_flight, send_email]
thresholds:
  min_score: 80

A/B endpoint comparison

Run the same test suite against two endpoints and get a per-test verdict table.

evalview compare \
  --v1 http://prod.internal/invoke --label-v1 "gpt-4o (prod)" \
  --v2 http://staging.internal/invoke --label-v2 "claude-sonnet (staging)" \
  --tests tests/

Cloud baseline sync

evalview login      # OAuth sign-in
evalview snapshot   # baselines auto-sync to cloud
evalview check      # teammates pull your baselines automatically

Other highlights

  • evalview capture — HTTP proxy records real agent traffic as test YAMLs
  • evalview install-hooks — inject regression checks into git pre-push
  • Silent model update detection — alerts when provider swaps model behind same API name
  • Gradual drift detection — OLS regression over 10-check window
  • Semantic diff--semantic-diff scores by meaning, not character similarity
  • Auto-open HTML report after every evalview run
  • evalview init now auto-detects your agent endpoint and generates starter tests
  • Test quality gating — low-quality generated tests are skipped, not silently polluting scores
  • mypy clean — 0 errors across 109 source files

Community contributions

  • Pydantic field validation for TestCase (#54 by @illbeurs)
  • Edge tests for CostEvaluator and LatencyEvaluator (#55 by @illbeurs)
  • health_check() on OllamaAdapter (#57 by @gauravxthakur)
  • ConsoleReporter docstrings (#56 by @gauravxthakur)

Full changelog: CHANGELOG.md

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track hidai25/eval-view

Get notified when new releases ship.

Sign up free

About hidai25/eval-view

Regression testing framework for AI agents. Save golden baselines, detect behavioral drift, and block regressions in CI. Works with LangGraph, CrewAI, OpenAI, Claude, and any HTTP API.

All releases →

Related context

Beta — feedback welcome: [email protected]