hidai25/eval-view

v0.2.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 6mo Developer Productivity

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agent-benchmark agent-evaluation agentic-ai ai-agents anthropic autogen

+12 more

cli crewai evaluation langchain-agent langgraph llm mcp openai-assistants pytest python regression-testing testing

Summary

AI summary

Sequence evaluator defaults to subsequence matching and adds pass@k/pass^k reliability metrics with suite type tagging.

Full changelog

What's New

This release introduces flexible sequence evaluation modes, industry-standard reliability metrics, and test suite categorization - inspired by Anthropic's agent evaluation best practices.

Flexible Sequence Evaluation

The sequence evaluator was too strict - exact matching penalized agents for finding valid alternative paths. Now defaults to subsequence matching which verifies critical tools appear in order without failing on extras.

Three modes available:

subsequence (default): Expected tools in order, extras allowed
exact: Legacy strict matching
unordered: Just check tools were called

# Per-test override
adapter_config:
  sequence_mode: unordered

Reliability Metrics: pass@k & pass^k

Industry-standard metrics now in statistical summaries:

pass@k: "Will it work if I give it a few tries?" (probability of at least 1 success)
pass^k: "Will it work reliably every time?" (probability of ALL trials succeeding)

Reliability Metrics:
  pass@10:       99.9% (usually finds a solution)
  pass^10:       2.8% (unreliable)

Suite Types: Capability vs Regression

Tag tests to distinguish expected failures from critical regressions:

name: complex-reasoning
suite_type: capability  # Expected to fail sometimes (hill climbing)

name: login-flow  
suite_type: regression  # Must pass (safety net)

Console output now shows:

🚨 REGRESSION for regression test failures (red alert)
⚡ CLIMBING for capability test failures (expected, yellow)

Full Changelog: https://github.com/hidai25/eval-view/compare/v0.1.5...v0.2.0

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track hidai25/eval-view

Get notified when new releases ship.

About hidai25/eval-view

Regression testing framework for AI agents. Save golden baselines, detect behavioral drift, and block regressions in CI. Works with LangGraph, CrewAI, OpenAI, Claude, and any HTTP API.

All releases →