- Update any code referencing `record_observation` to use `think`
- Migrate Skillbook v1 usage to the new v2 schema; legacy aliases are no longer available
- Skillbook v1 legacy aliases removed — only Skillbook v2 schema remains
- `record_observation` renamed to `think`
- RecursiveAgent core abstraction extracted for generic recursive PydanticAI agent with sandbox and microcompaction
- RR collapsed into a single RRStep, implementing true recursive loop
- Agentic SkillManager initial tool‑calling loop with atomic mutation tools (add_skill, update_skill, remove_skill, tag_skill) and read‑only tools
Full changelog
This is the merger of two release lines that had not yet shipped to PyPI: the 0.11.0 architectural rewrite and the 0.12.0 SkillManager hardening. Skipping a separate v0.11.0 tag — v0.12.0 supersets it.
0.11.0 — Architectural rewrite
RecursiveAgentcore abstraction extracted from RR (ace/core/recursive_agent.py). Generic recursive PydanticAI agent with sandbox, microcompaction, default tool set, depth-aware sub-agent registration.- RR collapsed into a single
RRStep. Orchestrator/worker split, batch machinery, andAttachInsightSourcesStepremoved. RR is now a true recursive loop. - Skillbook v2 — full schema rewrite, section-grouped storage (
context/harness), richerInsightSourceprovenance, BM25-backed retrieval (rank-bm25runtime dep).Skillbook.as_prompt()now returns markdown;python-toondropped. - Agentic SkillManager (first cut) — tool-calling loop (
ace/implementations/sm_tools.py) with atomic mutation tools (add_skill,update_skill,remove_skill,tag_skill) and read-only tools (search_skills,read_skill). - Reflector skillbook tools — Reflector can introspect / propose updates from inside the recursive loop.
- Anthropic prompt caching enabled by default for RR;
cache_read_tokens/cache_write_tokensforwarded in run metadata. - Logfire spans around recursive agent sessions.
- Online / offline mode in the ACE runner.
record_observationrenamed tothink.
0.12.0 — SM hardening
- Cross-trace generalization gate (four-criterion: ≥3 instances across ≥2 domains, named slot, no API-specific params in action, verifiable runtime trigger). Backed by skill_generalization.md (14 cited sources).
- Action-equivalence rule — splits on action, not trigger surface.
- Atomicity rule for
insight— one trigger + one action; explicit good/bad shape examples. - ICL-grounded insight format drawn from icl_skill_formatting.md: 15-50 word cap, imperative voice, positive framing default.
- Evidence-only tagging — SM no longer iterates
injected_skill_ids; tags only skills the reflection actually implicates. - Broaden-via-comparison for UPDATE — same root cause in different niches → broaden
issue, don't duplicate. - Prompt caching for SM via
CachePoint(ttl="5m"), mirroring RR. - Hard removal cap removed —
harmful_count >= 3no longer auto-REMOVES skills. update_skillssignature:sourceis optional;SkillbookViewdropped from parameters.- Skillbook v1 legacy aliases removed — v2 is the only schema.
End-to-end retail result (Haiku 4.5)
| Metric | Value |
|---|---|
| Baseline pass@1 | 45.0% |
| With learned skillbook | 67.5% |
| Δ pass@1 | +22.5 pp (12 improved, 3 regressed) |
| Skillbook size | 35 skills |
Tau-bench fix
evaluation_type=ALL_WITH_NL_ASSERTIONS on both run_task and run_tasks call sites in ace-eval/src/ace_eval/e2e/benchmarks/tau_bench.py. Retail and any future benchmark with NL_ASSERTION in reward_basis now produces real reward numbers instead of crashing in reward computation.
See CHANGELOG.md for full details.