This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+5 more
ReleasePort's take
Light signalv0.7.0 introduces _PROMPT_ONLY_MODELS and _NO_RECOMMENDED_SAMPLING_MODELS flags for batch_eval, renames eval dataset files to a versioned format, and adds --html/--markdown output options in report.py.
Why it matters: The release adds _PROMPT_ONLY_MODELS and _NO_RECOMMENDED_SAMPLING_MODELS flags to batch_eval, renames eval dataset files to a versioned format (eval_results_vX.Y.Z.jsonl), and expands CLI output options with --html and --markdown in report.py.
Summary
AI summaryUpdates 0.7.0] — 2026-05-22, Known limitations, and Q4/Q8 across a mixed release.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Breaking | Medium |
Changes error reporting: step enforcement and prerequisite violations now emit tool‑channel messages with [StepEnforcementError] / [PrereqError]. Changes error reporting: step enforcement and prerequisite violations now emit tool‑channel messages with [StepEnforcementError] / [PrereqError]. Source: llm_adapter@2026-05-25 Confidence: high |
— |
| Breaking | Medium |
Unknown‑tool handling now replies with [UnknownToolError] on the tool channel instead of user nudges. Unknown‑tool handling now replies with [UnknownToolError] on the tool channel instead of user nudges. Source: llm_adapter@2026-05-25 Confidence: high |
— |
| Feature | Medium |
Adds _PROMPT_ONLY_MODELS flag to batch_eval to skip native FC for unsupported models. Adds _PROMPT_ONLY_MODELS flag to batch_eval to skip native FC for unsupported models. Source: llm_adapter@2026-05-25 Confidence: high |
— |
| Feature | Medium |
Adds _NO_RECOMMENDED_SAMPLING_MODELS flag to batch_eval for models lacking sampling guidance. Adds _NO_RECOMMENDED_SAMPLING_MODELS flag to batch_eval for models lacking sampling guidance. Source: llm_adapter@2026-05-25 Confidence: high |
— |
| Feature | Medium |
Introduces MODEL_REGISTRY.md documenting all known models with status tiers. Introduces MODEL_REGISTRY.md documenting all known models with status tiers. Source: llm_adapter@2026-05-25 Confidence: low |
— |
| Feature | Medium |
Versioned eval dataset files renamed to eval_results_vX.Y.Z.jsonl. Versioned eval dataset files renamed to eval_results_vX.Y.Z.jsonl. Source: llm_adapter@2026-05-25 Confidence: low |
— |
| Feature | Medium |
Adds --html and --markdown flags to report.py for output formats. Adds --html and --markdown flags to report.py for output formats. Source: llm_adapter@2026-05-25 Confidence: low |
— |
| Feature | Low |
Adds Granite 4.1 8B, Gemma-4-E4B, and phi-4 to the eval lineup. Adds Granite 4.1 8B, Gemma-4-E4B, and phi-4 to the eval lineup. Source: granite4.1:30b@2026-05-25-audit Confidence: low |
— |
| Feature | Low |
Refreshes eval lineup by cutting weak models (Llama 3.1 8B, Mistral 7B v0.3, etc.) and adding stronger ones. Refreshes eval lineup by cutting weak models (Llama 3.1 8B, Mistral 7B v0.3, etc.) and adding stronger ones. Source: granite4.1:30b@2026-05-25-audit Confidence: low |
— |
| Feature | Low |
Updates eval dataset to eval_results_v0.7.0.jsonl with 96,200 rows and performance changes. Updates eval dataset to eval_results_v0.7.0.jsonl with 96,200 rows and performance changes. Source: granite4.1:30b@2026-05-25-audit Confidence: low |
— |
| Feature | Low |
Modifies README opener to lead with contract before eval pitch and adds "What forge isn't" section. Modifies README opener to lead with contract before eval pitch and adds "What forge isn't" section. Source: granite4.1:30b@2026-05-25-audit Confidence: low |
— |
| Feature | Low |
Regenerates dashboard and markdown views against the v0.7.0 dataset, reshuffling leaderboard rankings. Regenerates dashboard and markdown views against the v0.7.0 dataset, reshuffling leaderboard rankings. Source: granite4.1:30b@2026-05-25-audit Confidence: low |
— |
| Feature | Low |
Updates BACKEND_SETUP documentation with Anthropic section using pip install "forge-guardrails[anthropic]". Updates BACKEND_SETUP documentation with Anthropic section using pip install "forge-guardrails[anthropic]". Source: granite4.1:30b@2026-05-25-audit Confidence: low |
— |
| Bugfix | Medium |
Updates WorkflowRunner docstring and tree with missing kwargs, error types, and message types. Updates WorkflowRunner docstring and tree with missing kwargs, error types, and message types. Source: llm_adapter@2026-05-25 Confidence: high |
— |
| Bugfix | Medium |
Corrects CompactStrategy and ContextManager signatures to use budget_tokens and context_thresholds callbacks. Corrects CompactStrategy and ContextManager signatures to use budget_tokens and context_thresholds callbacks. Source: llm_adapter@2026-05-25 Confidence: high |
— |
| Bugfix | Medium |
Adds missing sampling kwargs and chat_template_kwargs to LlamafileClient constructor documentation. Adds missing sampling kwargs and chat_template_kwargs to LlamafileClient constructor documentation. Source: llm_adapter@2026-05-25 Confidence: high |
— |
| Bugfix | Medium |
Adds MODEL_FAMILIES entries for granite-4.1-8b and phi-4-Q4_K_M in report.py for correct rollups. Adds MODEL_FAMILIES entries for granite-4.1-8b and phi-4-Q4_K_M in report.py for correct rollups. Source: granite4.1:30b@2026-05-25-audit Confidence: low |
— |
| Bugfix | Low |
Updates WORKFLOW.md flowchart node names, edges, and compaction-priority table for new error wire shapes. Updates WORKFLOW.md flowchart node names, edges, and compaction-priority table for new error wire shapes. Source: granite4.1:30b@2026-05-25-audit Confidence: low |
— |
| Refactor | Low |
Rewrites MODEL_GUIDE, ARCHITECTURE, BACKEND_SETUP documentation sections for conciseness and added Anthropic setup instructions. Rewrites MODEL_GUIDE, ARCHITECTURE, BACKEND_SETUP documentation sections for conciseness and added Anthropic setup instructions. Source: granite4.1:30b@2026-05-25-audit Confidence: low |
— |
| Refactor | Low |
Removes stale bfcl/ reference from WORKFLOW.md module diagram. Removes stale bfcl/ reference from WORKFLOW.md module diagram. Source: granite4.1:30b@2026-05-25-audit Confidence: low |
— |
Full changelog
[0.7.0] — 2026-05-22
Added
- Granite 4.1 8B + Gemma-4-E4B + phi-4 — added to the eval lineup. Granite 4.1 mirrors the IBM greedy-decoding convention pending formal published sampling guidance; phi-4 has no formal sampling recommendation and falls through to backend defaults.
_PROMPT_ONLY_MODELSinbatch_eval— skips native FC for models lacking training for the OpenAItool_callsschema (currently: phi-4, verified via curl 2026-05-14)._NO_RECOMMENDED_SAMPLING_MODELSinbatch_eval— runsrecommended_sampling=Falsefor models without formal sampling guidance from any official source, so the eval doesn't raiseUnsupportedModelErroron them.MODEL_REGISTRY.md— new doc enumerating every model forge knows about, classified as Current (in v0.7.0 eval), Retired (cut from current eval), or Unpublished (sampling params staged, no published eval). Sampling values, source links, identity-key conventions.- Versioned eval datasets — committed dataset files renamed to
eval_results_vX.Y.Z.jsonl. Prior versions kept in LFS for reproducibility. report.py--html+--markdownflags surfaced in README and EVAL_GUIDE examples.
Changed
- Step enforcement + prerequisite violations surface on the tool channel. Previously,
WorkflowRunneremitted these as trailingrole="user"nudges after the assistanttool_call. v0.7.0 emits onerole="tool"message per blocked call with[StepEnforcementError]/[PrereqError]prefixes — the canonical "tool call failed, try again" wire shape OpenAI-tool-trained models are pretrained on. Surfaced by v4 forge-code dogfooding (gpt-oss-120b reliably exhausted prerequisite-violation budget under the old shape). - Unknown-tool retry on the tool channel. Same refactor applied to
ResponseValidatorunknown-tool path:[UnknownToolError]tool-error reply instead of a user nudge. - Eval lineup refresh — cut Llama 3.1 8B, Mistral 7B v0.3, Mistral Nemo 12B, Granite 4.0 (h-micro / h-tiny). All scored bare <30% on the v0.6.0 dataset — too weak to be informative, superseded by Ministral-3 / Granite 4.1 / phi-4. Sampling defaults retained in
sampling_defaults.pyfor backward compatibility (see MODEL_REGISTRY Retired tier). - Eval dataset —
eval_results_v0.7.0.jsonl(96,200 rows, 74 cells; rig-01). Apples-to-apples delta on 21 common configs vs v0.6.0: +0.7pt overall, -1.2pt advanced_reasoning — both within CI. Published-leaderboard floor lifts +16.9pt via composition (weak-model cuts). - Dashboard + markdown views regenerated against v0.7.0 dataset. Top of leaderboard reshuffled: Ministral-3 14B Reasoning Q4 LS/N now #1 at 84.5% (was Ministral-3 8B Instruct Q8 LS/P at 86.5% in v0.6.0; now #3 at 84.4%).
- MODEL_GUIDE rewrite — trimmed to opinions + rationale (333 → 145 lines). Full leaderboard, OG-18 100% list, hard suite top-5, models-to-avoid tables moved to the dashboard / markdown views. Sampling-parameters and "backend matters" sections retained. Native-vs-prompt heuristic corrected: not workload-driven, sensitivity is per-family.
- ARCHITECTURE rebuild — cut signature restating (1701 → 165 lines); the doc now covers design principles, surface modes, guardrail rationale, compaction priority rationale, respond-tool rationale, sampling opt-in semantics. Source is authoritative for class signatures; WORKFLOW.md owns the diagrams; ADRs own past decisions.
- BACKEND_SETUP rewrite — cut model-pick prose, Windows-specific install steps, Ollama Modelfile tutorial, llamafile distribution explainer, per-backend "run the eval" subsections, VRAM tables (360 → 135 lines). Per-backend section now: boot command + flag table + curl smoke-test + forge client snippet. Added Anthropic section using
pip install "forge-guardrails[anthropic]". - README opener — leads with the contract (any tools, any order; structure opt-in via
required_steps/prerequisites/terminal_tool) before the eval pitch. New "What forge isn't" (not an agent orchestrator, not a coding harness) preempts the conflations that surfaced on HN. Three-ways list reordered with proxy first (most popular entry point). Quick Start swapped from Ollama to llama-server.
Fixed
- WorkflowRunner docstring + tree — added missing
retry_nudgekwarg,cancel_eventparameter onrun(),PREREQUISITE_NUDGE+CONTEXT_WARNINGmessage types,MaxIterationsError/PrerequisiteError/StepEnforcementError/WorkflowCancelledErrorin Raises lists across docs. - CompactStrategy + ContextManager signatures in docs —
trigger_tokens→budget_tokens(the strategy owns its own threshold logic now);compact_threshold→context_thresholds+on_context_thresholdcallbacks. LlamafileClientconstructor docs — added missing sampling kwargs (top_p,top_k,min_p,repeat_penalty,presence_penalty),chat_template_kwargs,slot_id.- MODEL_FAMILIES in
report.py— added entries forgranite-4.1-8b(Q4/Q8) andphi-4-Q4_K_Mso cross-backend rollups inby-backend.mdgroup these new models correctly. - WORKFLOW.md agentic-loop flowchart — node names + edges updated to reflect the tool-error wire shape (
STEP_TOOL_ERROR,PREREQ_TOOL_ERROR,UNKNOWN_TOOL_ERROR); compaction-priority table fixed (step_nudgeandprerequisite_nudgearerole=tool,retry_nudgeremainsrole=user). - Stale
bfcl/reference removed from WORKFLOW.md module diagram (directory was removed pre-v0.7.0; ADR-009 retained as historical artifact).
Known limitations
- Anthropic numbers not re-measured in v0.7.0. The Anthropic ablation matrix (~$272 to run) was not re-executed for v0.7.0. Numbers cited in any v0.7.0 doc are from the v0.6.0 dataset (
eval_results_v0.6.0.jsonl). Tool-error-channel changes affect frontier models' wire on guardrail-fire paths too, but expected movement is small.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Forge
All releases →Related context
Related tools
Earlier breaking changes
Featured in
Beta — feedback welcome: [email protected]