Forge

v0.7.0 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-workflow agents function-calling llama-cpp llamafile

+5 more

llm ollama python self-hosted tool-calling

ReleasePort's take

Light signal

editorial:auto 2mo

v0.7.0 introduces _PROMPT_ONLY_MODELS and _NO_RECOMMENDED_SAMPLING_MODELS flags for batch_eval, renames eval dataset files to a versioned format, and adds --html/--markdown output options in report.py.

Why it matters: The release adds _PROMPT_ONLY_MODELS and _NO_RECOMMENDED_SAMPLING_MODELS flags to batch_eval, renames eval dataset files to a versioned format (eval_results_vX.Y.Z.jsonl), and expands CLI output options with --html and --markdown in report.py.

Summary

AI summary

Updates 0.7.0] — 2026-05-22, Known limitations, and Q4/Q8 across a mixed release.

Changes in this release

Type	Severity	Summary	CVE
Breaking	Medium	Changes error reporting: step enforcement and prerequisite violations now emit tool‑channel messages with [StepEnforcementError] / [PrereqError]. Changes error reporting: step enforcement and prerequisite violations now emit tool‑channel messages with [StepEnforcementError] / [PrereqError]. Source: llm_adapter@2026-05-25 Confidence: high	—
Breaking	Medium	Unknown‑tool handling now replies with [UnknownToolError] on the tool channel instead of user nudges. Unknown‑tool handling now replies with [UnknownToolError] on the tool channel instead of user nudges. Source: llm_adapter@2026-05-25 Confidence: high	—
Feature
Feature	Medium	Adds _PROMPT_ONLY_MODELS flag to batch_eval to skip native FC for unsupported models. Adds _PROMPT_ONLY_MODELS flag to batch_eval to skip native FC for unsupported models. Source: llm_adapter@2026-05-25 Confidence: high	—
Feature	Medium	Adds _NO_RECOMMENDED_SAMPLING_MODELS flag to batch_eval for models lacking sampling guidance. Adds _NO_RECOMMENDED_SAMPLING_MODELS flag to batch_eval for models lacking sampling guidance. Source: llm_adapter@2026-05-25 Confidence: high	—
Feature	Medium	Introduces MODEL_REGISTRY.md documenting all known models with status tiers. Introduces MODEL_REGISTRY.md documenting all known models with status tiers. Source: llm_adapter@2026-05-25 Confidence: low	—
Feature	Medium	Versioned eval dataset files renamed to eval_results_vX.Y.Z.jsonl. Versioned eval dataset files renamed to eval_results_vX.Y.Z.jsonl. Source: llm_adapter@2026-05-25 Confidence: low	—
Feature	Medium	Adds --html and --markdown flags to report.py for output formats. Adds --html and --markdown flags to report.py for output formats. Source: llm_adapter@2026-05-25 Confidence: low	—
Feature	Low	Adds Granite 4.1 8B, Gemma-4-E4B, and phi-4 to the eval lineup. Adds Granite 4.1 8B, Gemma-4-E4B, and phi-4 to the eval lineup. Source: granite4.1:30b@2026-05-25-audit Confidence: low	—
Feature	Low	Refreshes eval lineup by cutting weak models (Llama 3.1 8B, Mistral 7B v0.3, etc.) and adding stronger ones. Refreshes eval lineup by cutting weak models (Llama 3.1 8B, Mistral 7B v0.3, etc.) and adding stronger ones. Source: granite4.1:30b@2026-05-25-audit Confidence: low	—
Feature	Low	Updates eval dataset to eval_results_v0.7.0.jsonl with 96,200 rows and performance changes. Updates eval dataset to eval_results_v0.7.0.jsonl with 96,200 rows and performance changes. Source: granite4.1:30b@2026-05-25-audit Confidence: low	—
Feature	Low	Modifies README opener to lead with contract before eval pitch and adds "What forge isn't" section. Modifies README opener to lead with contract before eval pitch and adds "What forge isn't" section. Source: granite4.1:30b@2026-05-25-audit Confidence: low	—
Feature	Low	Regenerates dashboard and markdown views against the v0.7.0 dataset, reshuffling leaderboard rankings. Regenerates dashboard and markdown views against the v0.7.0 dataset, reshuffling leaderboard rankings. Source: granite4.1:30b@2026-05-25-audit Confidence: low	—
Feature	Low	Updates BACKEND_SETUP documentation with Anthropic section using pip install "forge-guardrails[anthropic]". Updates BACKEND_SETUP documentation with Anthropic section using pip install "forge-guardrails[anthropic]". Source: granite4.1:30b@2026-05-25-audit Confidence: low	—
Bugfix
Bugfix	Medium	Updates WorkflowRunner docstring and tree with missing kwargs, error types, and message types. Updates WorkflowRunner docstring and tree with missing kwargs, error types, and message types. Source: llm_adapter@2026-05-25 Confidence: high	—
Bugfix	Medium	Corrects CompactStrategy and ContextManager signatures to use budget_tokens and context_thresholds callbacks. Corrects CompactStrategy and ContextManager signatures to use budget_tokens and context_thresholds callbacks. Source: llm_adapter@2026-05-25 Confidence: high	—
Bugfix	Medium	Adds missing sampling kwargs and chat_template_kwargs to LlamafileClient constructor documentation. Adds missing sampling kwargs and chat_template_kwargs to LlamafileClient constructor documentation. Source: llm_adapter@2026-05-25 Confidence: high	—
Bugfix	Medium	Adds MODEL_FAMILIES entries for granite-4.1-8b and phi-4-Q4_K_M in report.py for correct rollups. Adds MODEL_FAMILIES entries for granite-4.1-8b and phi-4-Q4_K_M in report.py for correct rollups. Source: granite4.1:30b@2026-05-25-audit Confidence: low	—
Bugfix	Low	Updates WORKFLOW.md flowchart node names, edges, and compaction-priority table for new error wire shapes. Updates WORKFLOW.md flowchart node names, edges, and compaction-priority table for new error wire shapes. Source: granite4.1:30b@2026-05-25-audit Confidence: low	—
Refactor	Low	Rewrites MODEL_GUIDE, ARCHITECTURE, BACKEND_SETUP documentation sections for conciseness and added Anthropic setup instructions. Rewrites MODEL_GUIDE, ARCHITECTURE, BACKEND_SETUP documentation sections for conciseness and added Anthropic setup instructions. Source: granite4.1:30b@2026-05-25-audit Confidence: low	—
Refactor	Low	Removes stale bfcl/ reference from WORKFLOW.md module diagram. Removes stale bfcl/ reference from WORKFLOW.md module diagram. Source: granite4.1:30b@2026-05-25-audit Confidence: low	—

Full changelog

[0.7.0] — 2026-05-22

Added

Granite 4.1 8B + Gemma-4-E4B + phi-4 — added to the eval lineup. Granite 4.1 mirrors the IBM greedy-decoding convention pending formal published sampling guidance; phi-4 has no formal sampling recommendation and falls through to backend defaults.
_PROMPT_ONLY_MODELS in batch_eval — skips native FC for models lacking training for the OpenAI tool_calls schema (currently: phi-4, verified via curl 2026-05-14).
_NO_RECOMMENDED_SAMPLING_MODELS in batch_eval — runs recommended_sampling=False for models without formal sampling guidance from any official source, so the eval doesn't raise UnsupportedModelError on them.
MODEL_REGISTRY.md — new doc enumerating every model forge knows about, classified as Current (in v0.7.0 eval), Retired (cut from current eval), or Unpublished (sampling params staged, no published eval). Sampling values, source links, identity-key conventions.
Versioned eval datasets — committed dataset files renamed to eval_results_vX.Y.Z.jsonl. Prior versions kept in LFS for reproducibility.
report.py --html + --markdown flags surfaced in README and EVAL_GUIDE examples.

Changed

Step enforcement + prerequisite violations surface on the tool channel. Previously, WorkflowRunner emitted these as trailing role="user" nudges after the assistant tool_call. v0.7.0 emits one role="tool" message per blocked call with [StepEnforcementError] / [PrereqError] prefixes — the canonical "tool call failed, try again" wire shape OpenAI-tool-trained models are pretrained on. Surfaced by v4 forge-code dogfooding (gpt-oss-120b reliably exhausted prerequisite-violation budget under the old shape).
Unknown-tool retry on the tool channel. Same refactor applied to ResponseValidator unknown-tool path: [UnknownToolError] tool-error reply instead of a user nudge.
Eval lineup refresh — cut Llama 3.1 8B, Mistral 7B v0.3, Mistral Nemo 12B, Granite 4.0 (h-micro / h-tiny). All scored bare <30% on the v0.6.0 dataset — too weak to be informative, superseded by Ministral-3 / Granite 4.1 / phi-4. Sampling defaults retained in sampling_defaults.py for backward compatibility (see MODEL_REGISTRY Retired tier).
Eval dataset — eval_results_v0.7.0.jsonl (96,200 rows, 74 cells; rig-01). Apples-to-apples delta on 21 common configs vs v0.6.0: +0.7pt overall, -1.2pt advanced_reasoning — both within CI. Published-leaderboard floor lifts +16.9pt via composition (weak-model cuts).
Dashboard + markdown views regenerated against v0.7.0 dataset. Top of leaderboard reshuffled: Ministral-3 14B Reasoning Q4 LS/N now #1 at 84.5% (was Ministral-3 8B Instruct Q8 LS/P at 86.5% in v0.6.0; now #3 at 84.4%).
MODEL_GUIDE rewrite — trimmed to opinions + rationale (333 → 145 lines). Full leaderboard, OG-18 100% list, hard suite top-5, models-to-avoid tables moved to the dashboard / markdown views. Sampling-parameters and "backend matters" sections retained. Native-vs-prompt heuristic corrected: not workload-driven, sensitivity is per-family.
ARCHITECTURE rebuild — cut signature restating (1701 → 165 lines); the doc now covers design principles, surface modes, guardrail rationale, compaction priority rationale, respond-tool rationale, sampling opt-in semantics. Source is authoritative for class signatures; WORKFLOW.md owns the diagrams; ADRs own past decisions.
BACKEND_SETUP rewrite — cut model-pick prose, Windows-specific install steps, Ollama Modelfile tutorial, llamafile distribution explainer, per-backend "run the eval" subsections, VRAM tables (360 → 135 lines). Per-backend section now: boot command + flag table + curl smoke-test + forge client snippet. Added Anthropic section using pip install "forge-guardrails[anthropic]".
README opener — leads with the contract (any tools, any order; structure opt-in via required_steps/prerequisites/terminal_tool) before the eval pitch. New "What forge isn't" (not an agent orchestrator, not a coding harness) preempts the conflations that surfaced on HN. Three-ways list reordered with proxy first (most popular entry point). Quick Start swapped from Ollama to llama-server.

Fixed

WorkflowRunner docstring + tree — added missing retry_nudge kwarg, cancel_event parameter on run(), PREREQUISITE_NUDGE + CONTEXT_WARNING message types, MaxIterationsError / PrerequisiteError / StepEnforcementError / WorkflowCancelledError in Raises lists across docs.
CompactStrategy + ContextManager signatures in docs — trigger_tokens → budget_tokens (the strategy owns its own threshold logic now); compact_threshold → context_thresholds + on_context_threshold callbacks.
LlamafileClient constructor docs — added missing sampling kwargs (top_p, top_k, min_p, repeat_penalty, presence_penalty), chat_template_kwargs, slot_id.
MODEL_FAMILIES in report.py — added entries for granite-4.1-8b (Q4/Q8) and phi-4-Q4_K_M so cross-backend rollups in by-backend.md group these new models correctly.
WORKFLOW.md agentic-loop flowchart — node names + edges updated to reflect the tool-error wire shape (STEP_TOOL_ERROR, PREREQ_TOOL_ERROR, UNKNOWN_TOOL_ERROR); compaction-priority table fixed (step_nudge and prerequisite_nudge are role=tool, retry_nudge remains role=user).
Stale bfcl/ reference removed from WORKFLOW.md module diagram (directory was removed pre-v0.7.0; ADR-009 retained as historical artifact).

Known limitations

Anthropic numbers not re-measured in v0.7.0. The Anthropic ablation matrix (~$272 to run) was not re-executed for v0.7.0. Numbers cited in any v0.7.0 doc are from the v0.6.0 dataset (eval_results_v0.6.0.jsonl). Tool-error-channel changes affect frontier models' wire on guardrail-fire paths too, but expected movement is small.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track Forge

Get notified when new releases ship.

About Forge

All releases →

Related context

Related tools

Earlier breaking changes

v0.7.5 Changes default behavior to replay no reasoning blocks.
v0.7.4 Deprecates pydantic `.model_*` API on `ToolCall` and `TextResponse` dataclasses; construction no longer validates argument shape.
v0.7.3 Renames `--mode {native,prompt}` to `--backend-capability {native,prompt}`; no deprecation alias.

Featured in

2026-W21