This release adds 5 notable features for engineering teams evaluating rollout.
Published 4mo
Developer Productivity
✓ No known CVEs patched
✓ No known CVEs patched in this version
Topics
agent-benchmark
agent-evaluation
agentic-ai
ai-agents
anthropic
autogen
+12 more
cli
crewai
evaluation
langchain-agent
langgraph
llm
mcp
openai-assistants
pytest
python
regression-testing
testing
Summary
AI summarySequence evaluation now supports partial credit scoring.
Full changelog
What's New
CLI Statistical Mode Flags
--runs Nflag: Run each test N times for statistical evaluation (pass@k metrics)--pass-rateflag: Set required pass rate for--runsmode (default: 0.8)--difficultyfilter: Filter tests by difficulty level
Difficulty Levels for Test Cases
- New
difficultyfield:trivial,easy,medium,hard,expert - Console reporter shows difficulty column and breakdown
Partial Credit for Sequence Evaluation
- Sequence scoring now uses partial credit instead of binary pass/fail
- Example: 3/5 expected steps completed = 60% score (not 0%)
Fixed
--runsCLI flag now properly implemented (was documented but missing)
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About hidai25/eval-view
Regression testing framework for AI agents. Save golden baselines, detect behavioral drift, and block regressions in CI. Works with LangGraph, CrewAI, OpenAI, Claude, and any HTTP API.
Related context
Related tools
Beta — feedback welcome: [email protected]