Skip to content

hidai25/eval-view

v0.2.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agent-benchmark agent-evaluation agentic-ai ai-agents anthropic autogen
+12 more
cli crewai evaluation langchain-agent langgraph llm mcp openai-assistants pytest python regression-testing testing

Summary

AI summary

Sequence evaluator defaults to subsequence matching and adds pass@k/pass^k reliability metrics with suite type tagging.

Full changelog

What's New

This release introduces flexible sequence evaluation modes, industry-standard reliability metrics, and test suite categorization - inspired by Anthropic's agent evaluation best practices.

Flexible Sequence Evaluation

The sequence evaluator was too strict - exact matching penalized agents for finding valid alternative paths. Now defaults to subsequence matching which verifies critical tools appear in order without failing on extras.

Three modes available:

  • subsequence (default): Expected tools in order, extras allowed
  • exact: Legacy strict matching
  • unordered: Just check tools were called
# Per-test override
adapter_config:
  sequence_mode: unordered

Reliability Metrics: pass@k & pass^k

Industry-standard metrics now in statistical summaries:

  • pass@k: "Will it work if I give it a few tries?" (probability of at least 1 success)
  • pass^k: "Will it work reliably every time?" (probability of ALL trials succeeding)
Reliability Metrics:
  pass@10:       99.9% (usually finds a solution)
  pass^10:       2.8% (unreliable)

Suite Types: Capability vs Regression

Tag tests to distinguish expected failures from critical regressions:

name: complex-reasoning
suite_type: capability  # Expected to fail sometimes (hill climbing)

name: login-flow  
suite_type: regression  # Must pass (safety net)

Console output now shows:

  • 🚨 REGRESSION for regression test failures (red alert)
  • ⚡ CLIMBING for capability test failures (expected, yellow)

Full Changelog: https://github.com/hidai25/eval-view/compare/v0.1.5...v0.2.0

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track hidai25/eval-view

Get notified when new releases ship.

Sign up free

About hidai25/eval-view

Regression testing framework for AI agents. Save golden baselines, detect behavioral drift, and block regressions in CI. Works with LangGraph, CrewAI, OpenAI, Claude, and any HTTP API.

All releases →

Related context

Beta — feedback welcome: [email protected]