Skip to content

chernistry/bernstein

v2.4.0 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agent-framework agent-orchestrator agentic-ai ai-agents ai-coding aider
+14 more
anthropic claude-code cli-tool codex-cli coding-agent deterministic-scheduler hmac-audit llm mcp-server model-context-protocol multi-agent parallel-worktrees python swe-bench

Summary

AI summary

Broad release touches CI and infrastructure, Observability surfaces, Highlights, and feat.

Changes in this release

Security Medium

Approval responses bound to server-minted single-use nonce; mismatches surface as 409 NONCE_MISMATCH, evicted replays as 410 NONCE_EXPIRED, foreclosing stale-button replay on superseded prompts.

Approval responses bound to server-minted single-use nonce; mismatches surface as 409 NONCE_MISMATCH, evicted replays as 410 NONCE_EXPIRED, foreclosing stale-button replay on superseded prompts.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Feature Medium

Unified bernstein doctor observe aggregates four observability backends into one table with delta-since-last-check, per-PR sticky summary comment, and daily trends snapshot.

Unified bernstein doctor observe aggregates four observability backends into one table with delta-since-last-check, per-PR sticky summary comment, and daily trends snapshot.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Feature Medium

Single-writer RunActor owns canonical per-session state behind async event queue with bounded replay buffer emitting Gap marker on eviction.

Single-writer RunActor owns canonical per-session state behind async event queue with bounded replay buffer emitting Gap marker on eviction.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Feature Medium

Spec-quality gate refuses to advance feature spec until deterministic library-only rule set passes, routing failures through auto-fix loop and surfacing unresolved items to operator.

Spec-quality gate refuses to advance feature spec until deterministic library-only rule set passes, routing failures through auto-fix loop and surfacing unresolved items to operator.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Feature Medium

Declarative task DAG adds parallel_safe and story_id fields; backlog parser learns markdown checkboxes; topological_iter_with_parallel yields ready batches honouring cycle detection; bernstein plan dag / tasks dag render DAG with parallel batches highlighted.

Declarative task DAG adds parallel_safe and story_id fields; backlog parser learns markdown checkboxes; topological_iter_with_parallel yields ready batches honouring cycle detection; bernstein plan dag / tasks dag render DAG with parallel batches highlighted.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Feature Medium

Three-layer skill customization (BASE/TEAM/USER) under XDG paths with deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append; missing layers fall through cleanly.

Three-layer skill customization (BASE/TEAM/USER) under XDG paths with deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append; missing layers fall through cleanly.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Feature Medium

Empirical-confidence ledger backs model recommender with per-decision outcomes in SQLite store; prefers measured outcomes over capability-tier heuristic and bandit arm, refusing values below documented threshold (default 5).

Empirical-confidence ledger backs model recommender with per-decision outcomes in SQLite store; prefers measured outcomes over capability-tier heuristic and bandit arm, refusing values below documented threshold (default 5).

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Feature Medium

Bernstein doctor sonar subcommand pulls project measures from SonarQube with rich-table or JSON output; soft-fails when env vars unset.

Bernstein doctor sonar subcommand pulls project measures from SonarQube with rich-table or JSON output; soft-fails when env vars unset.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Feature Medium

Bernstein doctor glitchtip subcommand pulls last-24h issue counts, 7-day trend, and top unresolved issues from GlitchTip; soft-fails when token unset.

Bernstein doctor glitchtip subcommand pulls last-24h issue counts, 7-day trend, and top unresolved issues from GlitchTip; soft-fails when token unset.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Feature Medium

Per-PR sticky Sonar comment workflow posts advisory PR comment with project-level Sonar measures; never blocks merge.

Per-PR sticky Sonar comment workflow posts advisory PR comment with project-level Sonar measures; never blocks merge.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Feature Medium

Daily GlitchTip alert sweep workflow mirrors fatal-level issues into sticky GitHub issues labelled glitchtip-alert and auto-closes when resolved.

Daily GlitchTip alert sweep workflow mirrors fatal-level issues into sticky GitHub issues labelled glitchtip-alert and auto-closes when resolved.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Performance Medium

Sonar scan workflow now consumes existing coverage artifact via workflow_run, avoiding full re-run of unit suite and fitting memory budget.

Sonar scan workflow now consumes existing coverage artifact via workflow_run, avoiding full re-run of unit suite and fitting memory budget.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: high

Bugfix Medium

Restores str() coercion in _run_git error formatter to prevent TypeError when Path used in argv list.

Restores str() coercion in _run_git error formatter to prevent TypeError when Path used in argv list.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: low

Refactor Medium

Bulk refurb autofix wave 4 (FURB184 + leftovers) reduces mechanical idiom rewrites across src/ by ~163 items.

Bulk refurb autofix wave 4 (FURB184 + leftovers) reduces mechanical idiom rewrites across src/ by ~163 items.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: low

Refactor Medium

Refurb cluster D (FURB139 / 143 / 179 strings and enumerate) applies 16 autofixes for GraphQL query constants, redundant outer or, and nested list/set comprehensions.

Refurb cluster D (FURB139 / 143 / 179 strings and enumerate) applies 16 autofixes for GraphQL query constants, redundant outer or, and nested list/set comprehensions.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: low

Refactor Medium

Refurb cluster E (FURB182 / 183 / 142 / 101 misc) performs 33 safe rewrites: folds hashlib.update into sha256 constructor, replaces for x in iter s.add with s.update, switches open to Path.read_text/bytes, and simplifies empty format expressions.

Refurb cluster E (FURB182 / 183 / 142 / 101 misc) performs 33 safe rewrites: folds hashlib.update into sha256 constructor, replaces for x in iter s.add with s.update, switches open to Path.read_text/bytes, and simplifies empty format expressions.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: low

Refactor Medium

Refurb cluster B (FURB109 / 108 / 126 control flow) uses tuples instead of lists for static membership, collapses x == a or x == b to x in (a,b), and drops redundant else after return.

Refurb cluster B (FURB109 / 108 / 126 control flow) uses tuples instead of lists for static membership, collapses x == a or x == b to x in (a,b), and drops redundant else after return.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: low

Other Medium

Doc-drift refresh reconciles 16 documents with current source-of-truth public surfaces across concepts, GUI, SDD partitions, and more.

Doc-drift refresh reconciles 16 documents with current source-of-truth public surfaces across concepts, GUI, SDD partitions, and more.

Source: granite4.1:8b-q6_K@2026-05-20

Confidence: low

Full changelog

v2.4.0 - Observability surfaces, single-writer run state, declarative planning gates

Release date: 2026-05-20
Commits since v2.3.1: 33

Highlights

  • Unified bernstein doctor observe umbrella rolls the four observability backends (Sonar, GlitchTip, Dependency-Track, GitHub Code Scanning) into one aggregated table with delta-since-last-check, plus a per-PR sticky summary comment and a daily trends snapshot. Each backend soft-fails to SKIPPED when its env vars are unset, so a fresh checkout stays green.
  • Single-writer RunActor owns canonical per-session state behind one async event queue with a bounded replay buffer that emits an explicit Gap{up_to_seq} marker on eviction, making reconnect-after-eviction observable instead of silently lossy.
  • Spec-quality gate refuses to advance a feature spec until a deterministic, library-only rule set passes; failures route through a bounded auto-fix loop and surface unresolved items to the operator rather than dispatching an implementer against a weak spec.
  • Declarative task DAG: tasks gain parallel_safe and story_id fields, the backlog parser learns [T<id>] [P] [USn] markdown checkboxes, topological_iter_with_parallel yields ready batches honouring cycle detection, and bernstein plan dag / bernstein tasks dag render the DAG with parallel batches highlighted; replaces the file-overlap heuristic for tasks that declare the flag while preserving the legacy heuristic as a fall-back.
  • Three-layer skill customization (BASE / TEAM / USER) under XDG paths with a per-field deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append; missing layers fall through cleanly.
  • Empirical-confidence ledger backs the model recommender: an append-only SQLite store of per-decision outcomes feeds a sample-size-gated query that prefers measured outcomes over the capability-tier heuristic and over the bandit arm, refusing to return a value below a documented threshold (default 5).
  • Approval responses are now bound to a 16-byte server-minted single-use nonce; mismatches surface as 409 NONCE_MISMATCH and evicted replays as 410 NONCE_EXPIRED, foreclosing stale-button replay on superseded prompts.
  • Canonical stream-signal vocabulary (COMPLETED, FAILED, QUESTION, PLAN_DRAFT, PLAN_READY, BLOCKED) parseable from any wrapped CLI stdout so non-stream-json adapters surface lifecycle events through the same channel as native stream-json adapters.
  • CI hardening across the board: the Sonar scan consumes the existing coverage artifact via workflow_run (and workflow_dispatch bootstraps a coverage-bearing first scan), the review-bot-ack gate no longer cancels its own required check, the Schemathesis smoke timeout is widened to stop flaky cancellations, and the runtime Docker images are pinned back to python:3.13-slim.
  • Four refurb auto-fix waves (wave 4 plus clusters B / D / E) land about 320 mechanical idiom rewrites across src/, taking FURB142 to zero and substantially reducing the FURB184 / FURB138 / FURB124 / FURB182 / FURB101 / FURB109 / FURB108 / FURB126 backlog.

What ships

Observability surfaces

  • Unified bernstein doctor observe (#1650). Umbrella command that runs each per-backend probe (Sonar, GlitchTip, Dependency-Track, GitHub Code Scanning) in order and renders one aggregated Rich table with metric, value, delta-since-last-check, threshold, and status columns. Supports --json (machine-readable) and --watch (re-runs every 60 seconds). Each backend soft-fails to SKIPPED when its env vars are unset so the umbrella keeps running on a fresh checkout. Per-backend deltas are computed against a small snapshot cache at .sdd/observability/<backend>.json (suppressible via --no-persist). The dt, code-scanning, and observe Click commands are registered directly in bernstein.cli.main so the wiring survives independent refactors of advanced_cmd.py. A per-PR pr-observability-summary.yml workflow posts a sticky Markdown comment rendered from the observe JSON, and a daily docs-observability-snapshot.yml cron (06:00 UTC) writes docs/observability/snapshots/<date>.json and re-renders docs/observability/trends.md via a dependency-free unicode sparkline. Probe crash messages store only the exception type in persisted snapshots so tokens or URLs cannot leak. Docs at docs/observability/unified-doctor.md. Tests at tests/unit/cli/doctor/test_observe.py cover probe soft-fails, delta math, Click wiring, JSON shape, persistence toggle, and exit-code mapping.
  • bernstein doctor sonar (#1648). New subcommand pulling project measures from a configured SonarQube server: coverage, code smells by severity, bugs, vulnerabilities, security hotspots, and cognitive-complexity hotspots. Rich-table or --json output. Soft-fails (exit 0) when SONAR_HOST_URL / SONAR_TOKEN are unset and prints a one-line hint at docs/observability/sonar.md. Advisory baseline at $XDG_DATA_HOME/bernstein/sonar-baseline.json lets the parent bernstein doctor group nudge when open smells exceed the threshold or vulnerabilities regress. 28 hermetic tests via httpx.MockTransport.
  • bernstein doctor glitchtip (#1646). New subcommand pulling last-24h issue counts by severity, a 7-day trend, and the top unresolved issues from the configured GlitchTip server. Rich-table or --json output. Soft-fails when BERNSTEIN_GLITCHTIP_TOKEN is unset. Optional baseline cache at ~/.local/share/bernstein/glitchtip-baseline.json powers a nudge under bernstein doctor --suggest-docs when the GlitchTip API reports new unresolved issues since the last check. 25 unit tests cover the fetcher, baseline persistence, nudge logic, Click wiring, and soft-fail behaviour.
  • Sticky PR Sonar comment (#1648). New .github/workflows/sonar-pr-comment.yml posts a sticky advisory PR comment with project-level Sonar measures. Soft signal only, never blocks merge.
  • Daily GlitchTip alert sweep (#1646). New .github/workflows/glitchtip-insights.yml (06:30 UTC + workflow_dispatch) mirrors fatal-level GlitchTip issues into sticky GitHub issues labelled glitchtip-alert. The mirror auto-closes when the GlitchTip side resolves. Workflow now validates HTTP status on the resolved-issues fetch and runs gh issue subprocesses with check=True so reconciliation failures fail the run instead of being swallowed.

Security

  • Approval-nonce binding (#1642). Mints a 16-byte server-generated nonce per pending approval. The reply must echo the exact value or the gate refuses to resolve, foreclosing stale-button replay on superseded prompts and any path where the agent process could forge its own approval response.
    • core/approval/models: nonce field on PendingApproval (hex on the wire); to_dict(include_nonce=False) for adapter-facing serialisations; new ApprovalNonceMismatch / ApprovalNonceExpired errors.
    • core/approval/queue: resolve() validates the supplied nonce in constant time. Server-internal callers (TTL evict, wait_for timeout) keep the back-compat no-nonce path so they cannot deadlock.
    • core/routes/approvals: HTTP reply now requires a nonce. Mismatches surface 409 NONCE_MISMATCH. Replays against an evicted approval surface 410 NONCE_EXPIRED. The live-fragment HTML threads the nonce through the button handlers.
    • cli/commands/approval_cmd: approve-tool / reject-tool read the on-disk record and thread the nonce back through resolve().
    • A missing nonce body field defaults to an empty string at the schema layer so it flows through the handler and surfaces as 409 NONCE_MISMATCH via the existing _coerce_nonce guard, instead of being rejected at the Pydantic layer with 422.
    • Closes #1619.

Reliability and runtime

  • Single-writer RunActor (#1641). Introduces a per-session actor that owns canonical run state. Mutations flow as typed events through one async queue. A pure apply_event reducer applies them with monotonic seq numbers. ReplayBuffer is a bounded ring (default 1024) that emits an explicit Gap{up_to_seq} marker when a subscriber asks for an evicted range, so a reconnect-after-eviction is observable instead of silently corrupt. The approval gate gains an opt-in session_id kwarg that mirrors approval events into a registered RunActor via run_actor_registry. The file-driven decision contract is unchanged; the actor feed runs alongside. Migrating the remaining writers (worker subprocess, watchdog, lifecycle hooks, hooks_receiver) is a follow-up. Refs #1630.
  • Canonical stream-signal protocol (#1638). New core/protocols/stream_signals.py defines a small text-line vocabulary (COMPLETED, FAILED, QUESTION, PLAN_DRAFT, PLAN_READY, BLOCKED), a parser, a producer-side format helper, and conformance helpers. CLIAdapter grows an optional stream_signal_parser hook; the default delegates to the canonical parser, adapters override to map a native protocol onto the canonical vocabulary. ConformanceReport surfaces missing terminal signals as a soft warning so adapters without canonical signals stay visible without failing. Tests cover parse, format round-trip, malformed-input resilience, concurrent multi-adapter parsing, terminal-signal check, default vs. override hook behaviour, plan, and question round-trip. Docs at docs/adapters/stream_signals.md describe the vocabulary with shell and Python wrapper examples. Resolves #1632.
  • Declarative task DAG (#1655). Adds a declarative task DAG layer so the planner sets per-task parallel safety at task-generation time instead of having the scheduler infer it from file overlap. The Task schema gains parallel_safe (default False) and story_id (Optional[str]) with round-trip support in Task.from_dict. The backlog parser recognises the [T<id>] [P] [USn] markdown checkbox format and the matching YAML frontmatter keys. New core/orchestration/task_dag.py provides TaskNode, TaskDag (markdown + YAML loaders), and topological_iter_with_parallel yielding ready batches; cycles raise TaskDagCycleError. adaptive_parallelism.tasks_safe_to_run_in_parallel consumes the declarative flag directly; the file-overlap heuristic is preserved only for legacy tasks that lack the attribute. CLI: bernstein plan dag --file <path> (also reachable as bernstein tasks dag --file) renders the DAG with parallel batches highlighted and lists story rollback groups. Docs at docs/orchestration/task-dag.md and docs/operations/task_format.md. Tests cover schema and parser round-trip, scheduler consumption, and single-task / sequential-chain / parallel-batch / mixed parallel-serial / cycle-detection paths. Closes #1634.

Quality and routing

  • Empirical-confidence ledger (#1653). New core/quality/empirical_confidence.py: an append-only SQLite ledger (agent_outcomes table) of per-decision outcomes, with a sample-size-gated ConfidenceQuery that returns None below the documented threshold (default 5) instead of fabricating a value. core/routing/model_recommender.py consults the ledger first; the existing capability-tier heuristic and the bandit arm remain as documented fall-backs for cells that have not accumulated enough samples. Default DB path: ${XDG_DATA_HOME:-~/.local/share}/bernstein/empirical-confidence.db. Override via BERNSTEIN_CONFIDENCE_DB; threshold via BERNSTEIN_CONFIDENCE_MIN_SAMPLES. Docs at docs/quality/empirical-confidence.md cover the schema, the sample-size rationale, and the routing precedence order. 16 new ledger tests plus 8 router regression tests pass. Closes #1622.

Planning gates

  • Spec-quality gate (#1652). New core/planning/spec_quality.py: a deterministic, library-only gate that evaluates a feature spec against a small, pluggable rule set before the orchestrator dispatches an implementer. Default rules cover acceptance-criteria-present, out-of-scope-present, tested-via-present, no-TODO, no-placeholder, and ref-paths-exist. Specs that fail any required rule route through a bounded auto-fix loop (default 3 iterations); when the budget is exhausted the gate raises SpecQualityUnresolvedError so callers can surface the unresolved items without re-evaluating. Rules are pluggable through the bernstein.spec_quality_rules entry-point group; broken plugins are skipped, never crash the gate, and plugin RuleResult ids are normalised to the owning rule. CLI surfaces: bernstein spec check <path> and bernstein spec auto-fix <path> (dry-run vs --write, strict vs no-strict). Path-like spec strings that raise OSError fall back to inline mode. Docs at docs/planning/spec-quality-gate.md. Tests at tests/unit/planning/test_spec_quality.py and tests/unit/cli/test_spec_cmd.py. Closes #1631.

Skill customization

  • Three-layer skill merge (#1654). New core/skills/layered.py: BASE / TEAM / USER skill layers under XDG paths with a per-field merge spec where scalars override, tables deep-merge, keyed arrays replace by name / id / code, and unkeyed arrays append. Layers fall through cleanly when absent. CLI: bernstein skills list --layered surfaces layer-of-origin, and bernstein skills show <name> --per-layer shows the merged result alongside the raw per-layer diff. Docs at docs/skills/layered-merge.md. 30 new tests pin merge precedence, per-field granularity, deterministic output, and missing-layer fall-through. Closes #1624.

Correctness

  • _run_git error formatter (#1644). Re-add the str() coercion inside the OSError / TimeoutExpired handler of git_context._run_git. The refurb wave 3 auto-fix (#1615) had dropped it, so calls with a Path inside the argv list (test_context, test_context_builder, test_failure_reduction all do this indirectly via cochange_files) raised FileNotFoundError, and the handler then crashed on " ".join(...) with expected str instance, PosixPath found, turning a debug log into a TypeError that bubbled up. Same fix as #1591, regressed by the wave-3 auto-fix.

CI and infrastructure

  • Sonar scan via workflow_run (#1645). The Sonar scan workflow was re-running the full unit suite under a single pytest --cov invocation. That suite needs per-file isolation to fit the runner memory budget, which is why ci.yml shards it across files and takes about 25 minutes. The naive single-process run only reached 5 percent of files within the 30 minute step timeout (the job-level timeout bump in #1616 did not lift the inner step cap). Switch sonar-scan.yml to a workflow_run trigger that fires after a successful CI run on main, download the coverage-report artifact CI already publishes, and feed it directly to the Sonar scanner. Also add sonar.ws.timeout=600 to guard the scanner client against slow server responses, and pin sonar.scm.revision to the upstream CI head SHA so the scan reports against the right commit.
  • Lint repair after #1638 (#1640). ruff format --check failed on core/quality/review_pipeline/review_gate.py after the stream-signal PR landed. Applying ruff format collapses several string and comprehension wrappings under the project's 120-character line length. No behaviour change.
  • Lint repair after #1655 (#1657). The task-DAG merge turned main red on Lint. Move Iterator and Path imports under TYPE_CHECKING in core/orchestration/task_dag.py (TC003, 2 sites), replace == True with is True in tests/unit/tasks/test_parallel_flag.py (E712), and run ruff format across the four files added or touched by #1655. No behaviour change.
  • Schemathesis smoke timeout (#1659). Widen the Schemathesis smoke step timeout so the property-based API smoke run stops being cancelled mid-flight under the normal main merge cadence, removing a recurring flaky-cancellation source on the merge train.
  • Docker runtime pin (#1664). The published runtime image (Dockerfile) and the demo image (docker/demo/Dockerfile) referenced python:3.14-slim while their inline comments still read python:3.12-slim. Both build the bernstein wheel and run adapter dependencies that require <=3.13, so both are pinned back to python:3.13-slim by digest with the stale comments corrected to match the repository python policy.
  • Sonar-scan workflow_run bootstrap (#1665). The workflow_run listener only fires when the upstream CI run on main concludes success, but ci.yml cancels in-progress runs per branch, so main CI almost never reaches success and the scan job's if-guard kept skipping. Make workflow_dispatch a reliable bootstrap and re-scan path: resolve the most recent successful CI run on main and pull its coverage-report artifact so a manual scan carries full Python coverage instead of scanning coverage-less. The workflow_run path is unchanged.
  • Review-bot-ack concurrency (#1666). The review-bot-ack workflow emits a required status check on every PR. With cancel-in-progress: true and a per-PR concurrency group, overlapping events (synchronize on push, pull_request_review on review submit) routinely cancelled an in-flight gate run, and a CANCELLED conclusion reads as a non-success required check that stalled the merge queue. Scope the concurrency group per-PR and per-head-sha and set cancel-in-progress: false so every commit's gate run completes against its own sha. Adds a CI workflow-health sweep summary at docs/ci/workflow-health-2026-05-20.md covering all 47 registered workflows.

Documentation

  • Doc-drift refresh (#1677). Reconcile docs/concepts/ and docs/gui/ prose with the current source-of-truth public surfaces across 16 documents, correcting renamed CLI surfaces, signatures, and config knobs: action-cache subcommands and metric names, swarm-migration --id flag, validate_with_retry positional signature, FeatureContract-driven spec-as-test assertions, select_sandbox(backends, ...) return and raises, team-hub 64 KiB manifest cap, BestOfNDefaults config knobs, cpu_pause_threshold load-units default, route_for_phase per-phase router, fingerprint-memoization default_store factory, LineageReader.iter_records(run_id) with --limit, and the async summarize_diff returning a list. docs/sdd/ verified in sync (no change).

Quality and refurb waves

  • Wave 4 (FURB184 + leftovers) (#1643). Conservative libcst / ast-based rewrites that preserve semantics. Counts in src/: FURB184 197 -> 34 (163 fixed), FURB138 42 -> 8 (34 fixed), FURB124 29 -> 3 (26 fixed), FURB142 16 -> 0 (16 fixed), FURB113 23 -> 21 (2 fixed; remainder have intervening comments that act as section dividers). Followed by a ruff format pass over 36 files to wrap E501 long-line comprehensions, plus four targeted fixes for broken seen in seen self-referential dedup comprehensions in spec_assertions, pr_review_aggregator, review_responder.models, and tui.approval_panel (replaced with dict.fromkeys() for order-preserving dedup).
  • Cluster D (FURB139 / 143 / 179 strings and enumerate) (#1647). 16 refurb autofixes: FURB139 drops leading / trailing newlines in nine multi-line GraphQL query constants by switching to line-continuation backslashes; FURB143 drops one redundant outer or "" after str(... or "") in jira_dc_adapter; FURB179 flattens six nested list / set comprehensions to itertools.chain.from_iterable in bulletin, orchestrator (x4), and capability_matrix. Three FURB143 alerts skipped intentionally where defensive or "" guards external API boundaries (importlib.metadata fields, externally-typed input strings).
  • Cluster E (FURB182 / 183 / 142 / 101 misc) (#1649). 33 safe refurb rewrites across 21 files: FURB182 folds the first hashlib.update() into the sha256() constructor (10 sites); FURB142 replaces for x in iter: s.add(...) with s.update(...) (16 sites); FURB101 replaces with open(p) as f: y = f.read() with Path(p).read_text/bytes() (5 sites); FURB183 replaces f"{x}" with str(x) where the format spec is empty (2 sites). Refurb now reports 0 alerts for these rules in src/.
  • Cluster B (FURB109 / 108 / 126 control flow) (#1651). 53 refurb idiom fixes across 44 files in src/bernstein/: FURB109 (23 sites) uses tuples instead of lists for static in membership and for iteration over fixed sequences; FURB108 (18 sites) collapses x == a or x == b chains to x in (a, b); FURB126 (12 sites) drops redundant else / case _ after a return and relies on fall-through. Pure control-flow and literal rewrites with no behavioural change; verified with ruff check clean on touched files, compileall clean, and a targeted pytest sweep (320+ tests) over affected modules.

New and changed CLI commands

  • bernstein plan dag --file <path> / bernstein tasks dag --file <path> (new). Renders the task DAG with parallel batches highlighted and lists story rollback groups derived from story_id annotations.
  • bernstein doctor sonar (new). Surfaces project measures from SonarQube. Flags: --json, baseline cache override via XDG_DATA_HOME.
  • bernstein doctor glitchtip (new). Surfaces last-24h issue counts, 7-day trend, and top unresolved issues. Flags: --json, --top-n (IntRange(min=1)).
  • bernstein doctor --suggest-docs (extended). Now also prints one-line GlitchTip and Sonar nudges when the respective APIs report new unresolved issues or threshold regressions since the cached baseline; failures are logged and suppressed (never crashes the doctor command).
  • bernstein approve-tool / bernstein reject-tool (changed). Read the on-disk pending-approval record and thread the server-minted nonce back through resolve(). Operators using the CLI path see no behaviour change; integrators calling resolve() directly must thread the nonce or use the server-internal back-compat path.

Upgrade notes

  • Drop-in upgrade from v2.3.1. No config-schema changes, no audit-chain changes.
  • Approval API change. HTTP approval replies now require a nonce field. The live-fragment HTML threads the nonce through automatically; external integrators calling the approval endpoint directly need to echo the nonce from the pending-approval payload. Missing or empty nonce returns 409 NONCE_MISMATCH. Replays against an evicted approval return 410 NONCE_EXPIRED.
  • Sonar workflow trigger changed. .github/workflows/sonar-scan.yml is now workflow_run against the CI workflow on main. Operators with a fork running their own Sonar scan should mirror the same trigger or set SONAR_HOST_URL / SONAR_TOKEN to point at their own server.
  • New optional env vars. BERNSTEIN_GLITCHTIP_TOKEN (for bernstein doctor glitchtip), optional overrides BERNSTEIN_GLITCHTIP_BASE_URL and BERNSTEIN_GLITCHTIP_ORG. SONAR_HOST_URL and SONAR_TOKEN for bernstein doctor sonar. The GitHub workflows expect GLITCHTIP_API_TOKEN and (for Sonar) SONAR_TOKEN as repo secrets. None of these are required; both commands soft-fail with a one-line hint when unset.
  • RunActor is opt-in. Existing flows that do not pass session_id into the approval gate continue to work unchanged.
  • Empirical-confidence ledger is created lazily. On first write, an SQLite file is created at ${XDG_DATA_HOME:-~/.local/share}/bernstein/empirical-confidence.db. Override the path with BERNSTEIN_CONFIDENCE_DB, the sample threshold with BERNSTEIN_CONFIDENCE_MIN_SAMPLES. The model recommender falls back to the existing capability-tier and bandit paths when the ledger lacks a qualifying sample, so existing runs are unaffected.

Internal

  • Review-bot acknowledgement gate caught seven CodeRabbit must-address findings on #1646 across workflow status validation, gh issue subprocess check=True, doc clarification on soft-fail conditions, narrower import-time exception handling, logging of unexpected fetch failures, IntRange(min=1) on --top-n, and dropping a truthy fallback in summarise_severity / _bucket_trend_by_day that was inflating legitimate zero counts to one.
  • Sourcery flagged the empty-nonce-body case on #1642; default the field to an empty string at the schema layer so the documented 409 NONCE_MISMATCH contract holds.
  • _run_git regression test coverage hardened by re-adding the str() coercion in the error formatter and re-running the three failing tests (test_context::test_returns_list, test_context_builder::test_includes_file_summary_for_python_files, test_failure_reduction::test_task_context_includes_file_info).

Acknowledgements

This release is operator-only; no external contributor PRs landed in the v2.3.1..v2.4.0 window.

Full changelog

feat (10)

  • f84bde93 feat(adapters): canonical stream-signal protocol for adapter stdout (#1638)
  • 2df0e9c1 feat(orchestration): single-writer run-state actor with bounded replay buffer (#1641)
  • fd231bc7 feat(security): bind approval responses to single-use nonce (#1642)
  • 51f330a6 feat(observability): sonar insights surface + doctor subcommand + delta nudge (#1648)
  • 1a50c36a feat(observability): GlitchTip insights surface + doctor subcommand + daily alert workflow (#1646)
  • c42607b4 feat(quality): empirical confidence from outcome history (#1653)
  • ec430dff feat(orchestration): task DAG with explicit parallel flag + story-link grouping (#1655)
  • 05f582a2 feat(planning): auto spec-quality checklist refuses to advance until clean (#1652)
  • 381c3b6f feat(skills): three-layer customization with deterministic merge (#1654)
  • 15b5b1d0 feat(observability): unified bernstein doctor observe + per-PR insights summary + daily trends (#1650)

fix (8)

  • 80c819b8 fix(lint): repair main-red after #1638 merge (#1640)
  • 27ba6885 fix(test): restore str() coercion in _run_git error formatter (#1644)
  • b7bc28fe fix(ci): reuse coverage artifact in Sonar scan instead of re-running tests (#1645)
  • a0f26de7 fix(lint): repair main-red after #1655 task-DAG merge (#1657)
  • a1449b4f fix(ci): widen Schemathesis smoke timeout to stop flaky cancellations (#1659)
  • ab72c5bd fix(docker): pin runtime images to python:3.13-slim (#1664)
  • b7f288d5 fix(ci): repair sonar-scan workflow_run trigger so first scan populates the project (#1665)
  • 006743ee fix(ci): stop review-bot-ack from cancelling its own required check (#1666)

refactor (4)

  • 6fe31edc refactor: bulk refurb autofix wave 4 (FURB184 + leftovers) (#1643)
  • eb112d2e refactor: refurb cluster E (FURB182/183/142/101 misc) (#1649)
  • 2fb0d26c refactor: refurb cluster D (FURB139/143/179 strings/enumerate) (#1647)
  • d684739c refactor: refurb cluster B (FURB109/108/126 control flow) (#1651)

docs (1)

  • ca6a2dab docs(refresh): concepts + gui + sdd partitions per drift playbook (#1677)

chore / deps (10)

  • e6ca20a2 chore(release): v2.4.0 (#1658)
  • 7b047af9 chore(deps): update marocchino/sticky-pull-request-comment action to v2.9.4 (#1661)
  • 49b9766b chore(deps): update dependency python to 3.13 (#1663)
  • 1107e76d chore(deps): update peter-evans/create-pull-request action to v7.0.11 (#1662)
  • 93d47267 chore(deps): bump peter-evans/create-pull-request from 7.0.11 to 8.1.1 (#1667)
  • e999818e chore(deps): update marocchino/sticky-pull-request-comment action to v3 (#1671)
  • 852e3778 chore(deps): bump marocchino/sticky-pull-request-comment (#1669)
  • 1016b352 chore(deps): update gcr.io/oss-fuzz-base/base-builder-python docker digest to 04d1a93 (#1670)
  • 48051a8b chore(deps): bump actions/setup-python from 5 to 6 (#1668)
  • 17021db4 chore(deps): update python:3.13-slim docker digest to 9ca3cf9 (#1678)

Breaking Changes

  • Approval API now requires a `nonce` field; missing or empty nonce returns `409 NONCE_MISMATCH`, evicted replays return `410 NONCE_EXPIRED`.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track chernistry/bernstein

Get notified when new releases ship.

Sign up free

About chernistry/bernstein

Deterministic multi-agent orchestrator for 18 CLI coding agents (Claude Code, Codex, Cursor, Aider, Gemini CLI, OpenAI Agents SDK, and more). MCP server mode (stdio + HTTP/SSE) exposes the orchestrator to any MCP client. Git worktree isolation per agent, HMAC-chained audit trail, cost-aware model routing via contextual bandit. ~11K monthly PyPI downloads, Apache 2.0.

All releases →

Related context

Beta — feedback welcome: [email protected]