Skip to content

Forge

v0.7.0 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-workflow agents function-calling llama-cpp llamafile
+5 more
llm ollama python self-hosted tool-calling

ReleasePort's take

Light signal
editorial:auto 10d

v0.7.0 introduces _PROMPT_ONLY_MODELS and _NO_RECOMMENDED_SAMPLING_MODELS flags for batch_eval, renames eval dataset files to a versioned format, and adds --html/--markdown output options in report.py.

Why it matters: The release adds _PROMPT_ONLY_MODELS and _NO_RECOMMENDED_SAMPLING_MODELS flags to batch_eval, renames eval dataset files to a versioned format (eval_results_vX.Y.Z.jsonl), and expands CLI output options with --html and --markdown in report.py.

Summary

AI summary

Updates 0.7.0] — 2026-05-22, Known limitations, and Q4/Q8 across a mixed release.

Changes in this release

Breaking Medium

Changes error reporting: step enforcement and prerequisite violations now emit tool‑channel messages with [StepEnforcementError] / [PrereqError].

Changes error reporting: step enforcement and prerequisite violations now emit tool‑channel messages with [StepEnforcementError] / [PrereqError].

Source: llm_adapter@2026-05-25

Confidence: high

Breaking Medium

Unknown‑tool handling now replies with [UnknownToolError] on the tool channel instead of user nudges.

Unknown‑tool handling now replies with [UnknownToolError] on the tool channel instead of user nudges.

Source: llm_adapter@2026-05-25

Confidence: high

Feature Medium

Adds _PROMPT_ONLY_MODELS flag to batch_eval to skip native FC for unsupported models.

Adds _PROMPT_ONLY_MODELS flag to batch_eval to skip native FC for unsupported models.

Source: llm_adapter@2026-05-25

Confidence: high

Feature Medium

Adds _NO_RECOMMENDED_SAMPLING_MODELS flag to batch_eval for models lacking sampling guidance.

Adds _NO_RECOMMENDED_SAMPLING_MODELS flag to batch_eval for models lacking sampling guidance.

Source: llm_adapter@2026-05-25

Confidence: high

Feature Medium

Introduces MODEL_REGISTRY.md documenting all known models with status tiers.

Introduces MODEL_REGISTRY.md documenting all known models with status tiers.

Source: llm_adapter@2026-05-25

Confidence: low

Feature Medium

Versioned eval dataset files renamed to eval_results_vX.Y.Z.jsonl.

Versioned eval dataset files renamed to eval_results_vX.Y.Z.jsonl.

Source: llm_adapter@2026-05-25

Confidence: low

Feature Medium

Adds --html and --markdown flags to report.py for output formats.

Adds --html and --markdown flags to report.py for output formats.

Source: llm_adapter@2026-05-25

Confidence: low

Feature Low

Adds Granite 4.1 8B, Gemma-4-E4B, and phi-4 to the eval lineup.

Adds Granite 4.1 8B, Gemma-4-E4B, and phi-4 to the eval lineup.

Source: granite4.1:30b@2026-05-25-audit

Confidence: low

Feature Low

Refreshes eval lineup by cutting weak models (Llama 3.1 8B, Mistral 7B v0.3, etc.) and adding stronger ones.

Refreshes eval lineup by cutting weak models (Llama 3.1 8B, Mistral 7B v0.3, etc.) and adding stronger ones.

Source: granite4.1:30b@2026-05-25-audit

Confidence: low

Feature Low

Updates eval dataset to eval_results_v0.7.0.jsonl with 96,200 rows and performance changes.

Updates eval dataset to eval_results_v0.7.0.jsonl with 96,200 rows and performance changes.

Source: granite4.1:30b@2026-05-25-audit

Confidence: low

Feature Low

Modifies README opener to lead with contract before eval pitch and adds "What forge isn't" section.

Modifies README opener to lead with contract before eval pitch and adds "What forge isn't" section.

Source: granite4.1:30b@2026-05-25-audit

Confidence: low

Feature Low

Regenerates dashboard and markdown views against the v0.7.0 dataset, reshuffling leaderboard rankings.

Regenerates dashboard and markdown views against the v0.7.0 dataset, reshuffling leaderboard rankings.

Source: granite4.1:30b@2026-05-25-audit

Confidence: low

Feature Low

Updates BACKEND_SETUP documentation with Anthropic section using pip install "forge-guardrails[anthropic]".

Updates BACKEND_SETUP documentation with Anthropic section using pip install "forge-guardrails[anthropic]".

Source: granite4.1:30b@2026-05-25-audit

Confidence: low

Bugfix Medium

Updates WorkflowRunner docstring and tree with missing kwargs, error types, and message types.

Updates WorkflowRunner docstring and tree with missing kwargs, error types, and message types.

Source: llm_adapter@2026-05-25

Confidence: high

Bugfix Medium

Corrects CompactStrategy and ContextManager signatures to use budget_tokens and context_thresholds callbacks.

Corrects CompactStrategy and ContextManager signatures to use budget_tokens and context_thresholds callbacks.

Source: llm_adapter@2026-05-25

Confidence: high

Bugfix Medium

Adds missing sampling kwargs and chat_template_kwargs to LlamafileClient constructor documentation.

Adds missing sampling kwargs and chat_template_kwargs to LlamafileClient constructor documentation.

Source: llm_adapter@2026-05-25

Confidence: high

Bugfix Medium

Adds MODEL_FAMILIES entries for granite-4.1-8b and phi-4-Q4_K_M in report.py for correct rollups.

Adds MODEL_FAMILIES entries for granite-4.1-8b and phi-4-Q4_K_M in report.py for correct rollups.

Source: granite4.1:30b@2026-05-25-audit

Confidence: low

Bugfix Low

Updates WORKFLOW.md flowchart node names, edges, and compaction-priority table for new error wire shapes.

Updates WORKFLOW.md flowchart node names, edges, and compaction-priority table for new error wire shapes.

Source: granite4.1:30b@2026-05-25-audit

Confidence: low

Refactor Low

Rewrites MODEL_GUIDE, ARCHITECTURE, BACKEND_SETUP documentation sections for conciseness and added Anthropic setup instructions.

Rewrites MODEL_GUIDE, ARCHITECTURE, BACKEND_SETUP documentation sections for conciseness and added Anthropic setup instructions.

Source: granite4.1:30b@2026-05-25-audit

Confidence: low

Refactor Low

Removes stale bfcl/ reference from WORKFLOW.md module diagram.

Removes stale bfcl/ reference from WORKFLOW.md module diagram.

Source: granite4.1:30b@2026-05-25-audit

Confidence: low

Full changelog

[0.7.0] — 2026-05-22

Added

  • Granite 4.1 8B + Gemma-4-E4B + phi-4 — added to the eval lineup. Granite 4.1 mirrors the IBM greedy-decoding convention pending formal published sampling guidance; phi-4 has no formal sampling recommendation and falls through to backend defaults.
  • _PROMPT_ONLY_MODELS in batch_eval — skips native FC for models lacking training for the OpenAI tool_calls schema (currently: phi-4, verified via curl 2026-05-14).
  • _NO_RECOMMENDED_SAMPLING_MODELS in batch_eval — runs recommended_sampling=False for models without formal sampling guidance from any official source, so the eval doesn't raise UnsupportedModelError on them.
  • MODEL_REGISTRY.md — new doc enumerating every model forge knows about, classified as Current (in v0.7.0 eval), Retired (cut from current eval), or Unpublished (sampling params staged, no published eval). Sampling values, source links, identity-key conventions.
  • Versioned eval datasets — committed dataset files renamed to eval_results_vX.Y.Z.jsonl. Prior versions kept in LFS for reproducibility.
  • report.py --html + --markdown flags surfaced in README and EVAL_GUIDE examples.

Changed

  • Step enforcement + prerequisite violations surface on the tool channel. Previously, WorkflowRunner emitted these as trailing role="user" nudges after the assistant tool_call. v0.7.0 emits one role="tool" message per blocked call with [StepEnforcementError] / [PrereqError] prefixes — the canonical "tool call failed, try again" wire shape OpenAI-tool-trained models are pretrained on. Surfaced by v4 forge-code dogfooding (gpt-oss-120b reliably exhausted prerequisite-violation budget under the old shape).
  • Unknown-tool retry on the tool channel. Same refactor applied to ResponseValidator unknown-tool path: [UnknownToolError] tool-error reply instead of a user nudge.
  • Eval lineup refresh — cut Llama 3.1 8B, Mistral 7B v0.3, Mistral Nemo 12B, Granite 4.0 (h-micro / h-tiny). All scored bare <30% on the v0.6.0 dataset — too weak to be informative, superseded by Ministral-3 / Granite 4.1 / phi-4. Sampling defaults retained in sampling_defaults.py for backward compatibility (see MODEL_REGISTRY Retired tier).
  • Eval dataseteval_results_v0.7.0.jsonl (96,200 rows, 74 cells; rig-01). Apples-to-apples delta on 21 common configs vs v0.6.0: +0.7pt overall, -1.2pt advanced_reasoning — both within CI. Published-leaderboard floor lifts +16.9pt via composition (weak-model cuts).
  • Dashboard + markdown views regenerated against v0.7.0 dataset. Top of leaderboard reshuffled: Ministral-3 14B Reasoning Q4 LS/N now #1 at 84.5% (was Ministral-3 8B Instruct Q8 LS/P at 86.5% in v0.6.0; now #3 at 84.4%).
  • MODEL_GUIDE rewrite — trimmed to opinions + rationale (333 → 145 lines). Full leaderboard, OG-18 100% list, hard suite top-5, models-to-avoid tables moved to the dashboard / markdown views. Sampling-parameters and "backend matters" sections retained. Native-vs-prompt heuristic corrected: not workload-driven, sensitivity is per-family.
  • ARCHITECTURE rebuild — cut signature restating (1701 → 165 lines); the doc now covers design principles, surface modes, guardrail rationale, compaction priority rationale, respond-tool rationale, sampling opt-in semantics. Source is authoritative for class signatures; WORKFLOW.md owns the diagrams; ADRs own past decisions.
  • BACKEND_SETUP rewrite — cut model-pick prose, Windows-specific install steps, Ollama Modelfile tutorial, llamafile distribution explainer, per-backend "run the eval" subsections, VRAM tables (360 → 135 lines). Per-backend section now: boot command + flag table + curl smoke-test + forge client snippet. Added Anthropic section using pip install "forge-guardrails[anthropic]".
  • README opener — leads with the contract (any tools, any order; structure opt-in via required_steps/prerequisites/terminal_tool) before the eval pitch. New "What forge isn't" (not an agent orchestrator, not a coding harness) preempts the conflations that surfaced on HN. Three-ways list reordered with proxy first (most popular entry point). Quick Start swapped from Ollama to llama-server.

Fixed

  • WorkflowRunner docstring + tree — added missing retry_nudge kwarg, cancel_event parameter on run(), PREREQUISITE_NUDGE + CONTEXT_WARNING message types, MaxIterationsError / PrerequisiteError / StepEnforcementError / WorkflowCancelledError in Raises lists across docs.
  • CompactStrategy + ContextManager signatures in docstrigger_tokensbudget_tokens (the strategy owns its own threshold logic now); compact_thresholdcontext_thresholds + on_context_threshold callbacks.
  • LlamafileClient constructor docs — added missing sampling kwargs (top_p, top_k, min_p, repeat_penalty, presence_penalty), chat_template_kwargs, slot_id.
  • MODEL_FAMILIES in report.py — added entries for granite-4.1-8b (Q4/Q8) and phi-4-Q4_K_M so cross-backend rollups in by-backend.md group these new models correctly.
  • WORKFLOW.md agentic-loop flowchart — node names + edges updated to reflect the tool-error wire shape (STEP_TOOL_ERROR, PREREQ_TOOL_ERROR, UNKNOWN_TOOL_ERROR); compaction-priority table fixed (step_nudge and prerequisite_nudge are role=tool, retry_nudge remains role=user).
  • Stale bfcl/ reference removed from WORKFLOW.md module diagram (directory was removed pre-v0.7.0; ADR-009 retained as historical artifact).

Known limitations

  • Anthropic numbers not re-measured in v0.7.0. The Anthropic ablation matrix (~$272 to run) was not re-executed for v0.7.0. Numbers cited in any v0.7.0 doc are from the v0.6.0 dataset (eval_results_v0.6.0.jsonl). Tool-error-channel changes affect frontier models' wire on guardrail-fire paths too, but expected movement is small.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Forge

Get notified when new releases ship.

Sign up free

About Forge

All releases →

Related context

Earlier breaking changes

  • v0.7.4 Deprecates pydantic `.model_*` API on `ToolCall` and `TextResponse` dataclasses; construction no longer validates argument shape.
  • v0.7.3 Renames `--mode {native,prompt}` to `--backend-capability {native,prompt}`; no deprecation alias.

Featured in

Beta — feedback welcome: [email protected]