chernistry/bernstein

v2.4.0 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 2mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agent-orchestrator agentic-ai ai-agents aider air-gap audit-trail

+14 more

claude-code cli-tool codex-cli coding-agent deterministic-replay deterministic-scheduler hmac-audit mcp-server model-context-protocol multi-agent parallel-worktrees provenance python reproducibility

Summary

AI summary

Broad release touches CI and infrastructure, Observability surfaces, Highlights, and feat.

Changes in this release

Type	Severity	Summary	CVE
Security	Medium	Approval responses bound to server-minted single-use nonce; mismatches surface as 409 NONCE_MISMATCH, evicted replays as 410 NONCE_EXPIRED, foreclosing stale-button replay on superseded prompts. Approval responses bound to server-minted single-use nonce; mismatches surface as 409 NONCE_MISMATCH, evicted replays as 410 NONCE_EXPIRED, foreclosing stale-button replay on superseded prompts. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Feature
Feature	Medium	Unified bernstein doctor observe aggregates four observability backends into one table with delta-since-last-check, per-PR sticky summary comment, and daily trends snapshot. Unified bernstein doctor observe aggregates four observability backends into one table with delta-since-last-check, per-PR sticky summary comment, and daily trends snapshot. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Feature	Medium	Single-writer RunActor owns canonical per-session state behind async event queue with bounded replay buffer emitting Gap marker on eviction. Single-writer RunActor owns canonical per-session state behind async event queue with bounded replay buffer emitting Gap marker on eviction. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Feature	Medium	Spec-quality gate refuses to advance feature spec until deterministic library-only rule set passes, routing failures through auto-fix loop and surfacing unresolved items to operator. Spec-quality gate refuses to advance feature spec until deterministic library-only rule set passes, routing failures through auto-fix loop and surfacing unresolved items to operator. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Feature	Medium	Declarative task DAG adds parallel_safe and story_id fields; backlog parser learns markdown checkboxes; topological_iter_with_parallel yields ready batches honouring cycle detection; bernstein plan dag / tasks dag render DAG with parallel batches highlighted. Declarative task DAG adds parallel_safe and story_id fields; backlog parser learns markdown checkboxes; topological_iter_with_parallel yields ready batches honouring cycle detection; bernstein plan dag / tasks dag render DAG with parallel batches highlighted. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Feature	Medium	Three-layer skill customization (BASE/TEAM/USER) under XDG paths with deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append; missing layers fall through cleanly. Three-layer skill customization (BASE/TEAM/USER) under XDG paths with deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append; missing layers fall through cleanly. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Feature	Medium	Empirical-confidence ledger backs model recommender with per-decision outcomes in SQLite store; prefers measured outcomes over capability-tier heuristic and bandit arm, refusing values below documented threshold (default 5). Empirical-confidence ledger backs model recommender with per-decision outcomes in SQLite store; prefers measured outcomes over capability-tier heuristic and bandit arm, refusing values below documented threshold (default 5). Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Feature	Medium	Bernstein doctor sonar subcommand pulls project measures from SonarQube with rich-table or JSON output; soft-fails when env vars unset. Bernstein doctor sonar subcommand pulls project measures from SonarQube with rich-table or JSON output; soft-fails when env vars unset. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Feature	Medium	Bernstein doctor glitchtip subcommand pulls last-24h issue counts, 7-day trend, and top unresolved issues from GlitchTip; soft-fails when token unset. Bernstein doctor glitchtip subcommand pulls last-24h issue counts, 7-day trend, and top unresolved issues from GlitchTip; soft-fails when token unset. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Feature	Medium	Per-PR sticky Sonar comment workflow posts advisory PR comment with project-level Sonar measures; never blocks merge. Per-PR sticky Sonar comment workflow posts advisory PR comment with project-level Sonar measures; never blocks merge. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Feature	Medium	Daily GlitchTip alert sweep workflow mirrors fatal-level issues into sticky GitHub issues labelled glitchtip-alert and auto-closes when resolved. Daily GlitchTip alert sweep workflow mirrors fatal-level issues into sticky GitHub issues labelled glitchtip-alert and auto-closes when resolved. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Performance	Medium	Sonar scan workflow now consumes existing coverage artifact via workflow_run, avoiding full re-run of unit suite and fitting memory budget. Sonar scan workflow now consumes existing coverage artifact via workflow_run, avoiding full re-run of unit suite and fitting memory budget. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high	—
Bugfix	Medium	Restores str() coercion in _run_git error formatter to prevent TypeError when Path used in argv list. Restores str() coercion in _run_git error formatter to prevent TypeError when Path used in argv list. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low	—
Refactor
Refactor	Medium	Bulk refurb autofix wave 4 (FURB184 + leftovers) reduces mechanical idiom rewrites across src/ by ~163 items. Bulk refurb autofix wave 4 (FURB184 + leftovers) reduces mechanical idiom rewrites across src/ by ~163 items. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low	—
Refactor	Medium	Refurb cluster D (FURB139 / 143 / 179 strings and enumerate) applies 16 autofixes for GraphQL query constants, redundant outer or, and nested list/set comprehensions. Refurb cluster D (FURB139 / 143 / 179 strings and enumerate) applies 16 autofixes for GraphQL query constants, redundant outer or, and nested list/set comprehensions. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low	—
Refactor	Medium	Refurb cluster E (FURB182 / 183 / 142 / 101 misc) performs 33 safe rewrites: folds hashlib.update into sha256 constructor, replaces for x in iter s.add with s.update, switches open to Path.read_text/bytes, and simplifies empty format expressions. Refurb cluster E (FURB182 / 183 / 142 / 101 misc) performs 33 safe rewrites: folds hashlib.update into sha256 constructor, replaces for x in iter s.add with s.update, switches open to Path.read_text/bytes, and simplifies empty format expressions. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low	—
Refactor	Medium	Refurb cluster B (FURB109 / 108 / 126 control flow) uses tuples instead of lists for static membership, collapses x == a or x == b to x in (a,b), and drops redundant else after return. Refurb cluster B (FURB109 / 108 / 126 control flow) uses tuples instead of lists for static membership, collapses x == a or x == b to x in (a,b), and drops redundant else after return. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low	—
Other	Medium	Doc-drift refresh reconciles 16 documents with current source-of-truth public surfaces across concepts, GUI, SDD partitions, and more. Doc-drift refresh reconciles 16 documents with current source-of-truth public surfaces across concepts, GUI, SDD partitions, and more. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low	—

Full changelog

v2.4.0 - Observability surfaces, single-writer run state, declarative planning gates

Release date: 2026-05-20
Commits since v2.3.1: 33

Highlights

Unified bernstein doctor observe umbrella rolls the four observability backends (Sonar, GlitchTip, Dependency-Track, GitHub Code Scanning) into one aggregated table with delta-since-last-check, plus a per-PR sticky summary comment and a daily trends snapshot. Each backend soft-fails to SKIPPED when its env vars are unset, so a fresh checkout stays green.
Single-writer RunActor owns canonical per-session state behind one async event queue with a bounded replay buffer that emits an explicit Gap{up_to_seq} marker on eviction, making reconnect-after-eviction observable instead of silently lossy.
Spec-quality gate refuses to advance a feature spec until a deterministic, library-only rule set passes; failures route through a bounded auto-fix loop and surface unresolved items to the operator rather than dispatching an implementer against a weak spec.
Declarative task DAG: tasks gain parallel_safe and story_id fields, the backlog parser learns [T<id>] [P] [USn] markdown checkboxes, topological_iter_with_parallel yields ready batches honouring cycle detection, and bernstein plan dag / bernstein tasks dag render the DAG with parallel batches highlighted; replaces the file-overlap heuristic for tasks that declare the flag while preserving the legacy heuristic as a fall-back.
Three-layer skill customization (BASE / TEAM / USER) under XDG paths with a per-field deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append; missing layers fall through cleanly.
Empirical-confidence ledger backs the model recommender: an append-only SQLite store of per-decision outcomes feeds a sample-size-gated query that prefers measured outcomes over the capability-tier heuristic and over the bandit arm, refusing to return a value below a documented threshold (default 5).
Approval responses are now bound to a 16-byte server-minted single-use nonce; mismatches surface as 409 NONCE_MISMATCH and evicted replays as 410 NONCE_EXPIRED, foreclosing stale-button replay on superseded prompts.
Canonical stream-signal vocabulary (COMPLETED, FAILED, QUESTION, PLAN_DRAFT, PLAN_READY, BLOCKED) parseable from any wrapped CLI stdout so non-stream-json adapters surface lifecycle events through the same channel as native stream-json adapters.
CI hardening across the board: the Sonar scan consumes the existing coverage artifact via workflow_run (and workflow_dispatch bootstraps a coverage-bearing first scan), the review-bot-ack gate no longer cancels its own required check, the Schemathesis smoke timeout is widened to stop flaky cancellations, and the runtime Docker images are pinned back to python:3.13-slim.
Four refurb auto-fix waves (wave 4 plus clusters B / D / E) land about 320 mechanical idiom rewrites across src/, taking FURB142 to zero and substantially reducing the FURB184 / FURB138 / FURB124 / FURB182 / FURB101 / FURB109 / FURB108 / FURB126 backlog.

What ships

Observability surfaces

Unified bernstein doctor observe (#1650). Umbrella command that runs each per-backend probe (Sonar, GlitchTip, Dependency-Track, GitHub Code Scanning) in order and renders one aggregated Rich table with metric, value, delta-since-last-check, threshold, and status columns. Supports --json (machine-readable) and --watch (re-runs every 60 seconds). Each backend soft-fails to SKIPPED when its env vars are unset so the umbrella keeps running on a fresh checkout. Per-backend deltas are computed against a small snapshot cache at .sdd/observability/<backend>.json (suppressible via --no-persist). The dt, code-scanning, and observe Click commands are registered directly in bernstein.cli.main so the wiring survives independent refactors of advanced_cmd.py. A per-PR pr-observability-summary.yml workflow posts a sticky Markdown comment rendered from the observe JSON, and a daily docs-observability-snapshot.yml cron (06:00 UTC) writes docs/observability/snapshots/<date>.json and re-renders docs/observability/trends.md via a dependency-free unicode sparkline. Probe crash messages store only the exception type in persisted snapshots so tokens or URLs cannot leak. Docs at docs/observability/unified-doctor.md. Tests at tests/unit/cli/doctor/test_observe.py cover probe soft-fails, delta math, Click wiring, JSON shape, persistence toggle, and exit-code mapping.
bernstein doctor sonar (#1648). New subcommand pulling project measures from a configured SonarQube server: coverage, code smells by severity, bugs, vulnerabilities, security hotspots, and cognitive-complexity hotspots. Rich-table or --json output. Soft-fails (exit 0) when SONAR_HOST_URL / SONAR_TOKEN are unset and prints a one-line hint at docs/observability/sonar.md. Advisory baseline at $XDG_DATA_HOME/bernstein/sonar-baseline.json lets the parent bernstein doctor group nudge when open smells exceed the threshold or vulnerabilities regress. 28 hermetic tests via httpx.MockTransport.
bernstein doctor glitchtip (#1646). New subcommand pulling last-24h issue counts by severity, a 7-day trend, and the top unresolved issues from the configured GlitchTip server. Rich-table or --json output. Soft-fails when BERNSTEIN_GLITCHTIP_TOKEN is unset. Optional baseline cache at ~/.local/share/bernstein/glitchtip-baseline.json powers a nudge under bernstein doctor --suggest-docs when the GlitchTip API reports new unresolved issues since the last check. 25 unit tests cover the fetcher, baseline persistence, nudge logic, Click wiring, and soft-fail behaviour.
Sticky PR Sonar comment (#1648). New .github/workflows/sonar-pr-comment.yml posts a sticky advisory PR comment with project-level Sonar measures. Soft signal only, never blocks merge.
Daily GlitchTip alert sweep (#1646). New .github/workflows/glitchtip-insights.yml (06:30 UTC + workflow_dispatch) mirrors fatal-level GlitchTip issues into sticky GitHub issues labelled glitchtip-alert. The mirror auto-closes when the GlitchTip side resolves. Workflow now validates HTTP status on the resolved-issues fetch and runs gh issue subprocesses with check=True so reconciliation failures fail the run instead of being swallowed.

Security

Approval-nonce binding (#1642). Mints a 16-byte server-generated nonce per pending approval. The reply must echo the exact value or the gate refuses to resolve, foreclosing stale-button replay on superseded prompts and any path where the agent process could forge its own approval response.
- core/approval/models: nonce field on PendingApproval (hex on the wire); to_dict(include_nonce=False) for adapter-facing serialisations; new ApprovalNonceMismatch / ApprovalNonceExpired errors.
- core/approval/queue: resolve() validates the supplied nonce in constant time. Server-internal callers (TTL evict, wait_for timeout) keep the back-compat no-nonce path so they cannot deadlock.
- core/routes/approvals: HTTP reply now requires a nonce. Mismatches surface 409 NONCE_MISMATCH. Replays against an evicted approval surface 410 NONCE_EXPIRED. The live-fragment HTML threads the nonce through the button handlers.
- cli/commands/approval_cmd: approve-tool / reject-tool read the on-disk record and thread the nonce back through resolve().
- A missing nonce body field defaults to an empty string at the schema layer so it flows through the handler and surfaces as 409 NONCE_MISMATCH via the existing _coerce_nonce guard, instead of being rejected at the Pydantic layer with 422.
- Closes #1619.

Reliability and runtime

Single-writer RunActor (#1641). Introduces a per-session actor that owns canonical run state. Mutations flow as typed events through one async queue. A pure apply_event reducer applies them with monotonic seq numbers. ReplayBuffer is a bounded ring (default 1024) that emits an explicit Gap{up_to_seq} marker when a subscriber asks for an evicted range, so a reconnect-after-eviction is observable instead of silently corrupt. The approval gate gains an opt-in session_id kwarg that mirrors approval events into a registered RunActor via run_actor_registry. The file-driven decision contract is unchanged; the actor feed runs alongside. Migrating the remaining writers (worker subprocess, watchdog, lifecycle hooks, hooks_receiver) is a follow-up. Refs #1630.
Canonical stream-signal protocol (#1638). New core/protocols/stream_signals.py defines a small text-line vocabulary (COMPLETED, FAILED, QUESTION, PLAN_DRAFT, PLAN_READY, BLOCKED), a parser, a producer-side format helper, and conformance helpers. CLIAdapter grows an optional stream_signal_parser hook; the default delegates to the canonical parser, adapters override to map a native protocol onto the canonical vocabulary. ConformanceReport surfaces missing terminal signals as a soft warning so adapters without canonical signals stay visible without failing. Tests cover parse, format round-trip, malformed-input resilience, concurrent multi-adapter parsing, terminal-signal check, default vs. override hook behaviour, plan, and question round-trip. Docs at docs/adapters/stream_signals.md describe the vocabulary with shell and Python wrapper examples. Resolves #1632.
Declarative task DAG (#1655). Adds a declarative task DAG layer so the planner sets per-task parallel safety at task-generation time instead of having the scheduler infer it from file overlap. The Task schema gains parallel_safe (default False) and story_id (Optional[str]) with round-trip support in Task.from_dict. The backlog parser recognises the [T<id>] [P] [USn] markdown checkbox format and the matching YAML frontmatter keys. New core/orchestration/task_dag.py provides TaskNode, TaskDag (markdown + YAML loaders), and topological_iter_with_parallel yielding ready batches; cycles raise TaskDagCycleError. adaptive_parallelism.tasks_safe_to_run_in_parallel consumes the declarative flag directly; the file-overlap heuristic is preserved only for legacy tasks that lack the attribute. CLI: bernstein plan dag --file <path> (also reachable as bernstein tasks dag --file) renders the DAG with parallel batches highlighted and lists story rollback groups. Docs at docs/orchestration/task-dag.md and docs/operations/task_format.md. Tests cover schema and parser round-trip, scheduler consumption, and single-task / sequential-chain / parallel-batch / mixed parallel-serial / cycle-detection paths. Closes #1634.

Quality and routing

Empirical-confidence ledger (#1653). New core/quality/empirical_confidence.py: an append-only SQLite ledger (agent_outcomes table) of per-decision outcomes, with a sample-size-gated ConfidenceQuery that returns None below the documented threshold (default 5) instead of fabricating a value. core/routing/model_recommender.py consults the ledger first; the existing capability-tier heuristic and the bandit arm remain as documented fall-backs for cells that have not accumulated enough samples. Default DB path: ${XDG_DATA_HOME:-~/.local/share}/bernstein/empirical-confidence.db. Override via BERNSTEIN_CONFIDENCE_DB; threshold via BERNSTEIN_CONFIDENCE_MIN_SAMPLES. Docs at docs/quality/empirical-confidence.md cover the schema, the sample-size rationale, and the routing precedence order. 16 new ledger tests plus 8 router regression tests pass. Closes #1622.

Planning gates

Spec-quality gate (#1652). New core/planning/spec_quality.py: a deterministic, library-only gate that evaluates a feature spec against a small, pluggable rule set before the orchestrator dispatches an implementer. Default rules cover acceptance-criteria-present, out-of-scope-present, tested-via-present, no-TODO, no-placeholder, and ref-paths-exist. Specs that fail any required rule route through a bounded auto-fix loop (default 3 iterations); when the budget is exhausted the gate raises SpecQualityUnresolvedError so callers can surface the unresolved items without re-evaluating. Rules are pluggable through the bernstein.spec_quality_rules entry-point group; broken plugins are skipped, never crash the gate, and plugin RuleResult ids are normalised to the owning rule. CLI surfaces: bernstein spec check <path> and bernstein spec auto-fix <path> (dry-run vs --write, strict vs no-strict). Path-like spec strings that raise OSError fall back to inline mode. Docs at docs/planning/spec-quality-gate.md. Tests at tests/unit/planning/test_spec_quality.py and tests/unit/cli/test_spec_cmd.py. Closes #1631.

Skill customization

Three-layer skill merge (#1654). New core/skills/layered.py: BASE / TEAM / USER skill layers under XDG paths with a per-field merge spec where scalars override, tables deep-merge, keyed arrays replace by name / id / code, and unkeyed arrays append. Layers fall through cleanly when absent. CLI: bernstein skills list --layered surfaces layer-of-origin, and bernstein skills show <name> --per-layer shows the merged result alongside the raw per-layer diff. Docs at docs/skills/layered-merge.md. 30 new tests pin merge precedence, per-field granularity, deterministic output, and missing-layer fall-through. Closes #1624.

Correctness

_run_git error formatter (#1644). Re-add the str() coercion inside the OSError / TimeoutExpired handler of git_context._run_git. The refurb wave 3 auto-fix (#1615) had dropped it, so calls with a Path inside the argv list (test_context, test_context_builder, test_failure_reduction all do this indirectly via cochange_files) raised FileNotFoundError, and the handler then crashed on " ".join(...) with expected str instance, PosixPath found, turning a debug log into a TypeError that bubbled up. Same fix as #1591, regressed by the wave-3 auto-fix.

CI and infrastructure

Sonar scan via workflow_run (#1645). The Sonar scan workflow was re-running the full unit suite under a single pytest --cov invocation. That suite needs per-file isolation to fit the runner memory budget, which is why ci.yml shards it across files and takes about 25 minutes. The naive single-process run only reached 5 percent of files within the 30 minute step timeout (the job-level timeout bump in #1616 did not lift the inner step cap). Switch sonar-scan.yml to a workflow_run trigger that fires after a successful CI run on main, download the coverage-report artifact CI already publishes, and feed it directly to the Sonar scanner. Also add sonar.ws.timeout=600 to guard the scanner client against slow server responses, and pin sonar.scm.revision to the upstream CI head SHA so the scan reports against the right commit.
Lint repair after #1638 (#1640). ruff format --check failed on core/quality/review_pipeline/review_gate.py after the stream-signal PR landed. Applying ruff format collapses several string and comprehension wrappings under the project's 120-character line length. No behaviour change.
Lint repair after #1655 (#1657). The task-DAG merge turned main red on Lint. Move Iterator and Path imports under TYPE_CHECKING in core/orchestration/task_dag.py (TC003, 2 sites), replace == True with is True in tests/unit/tasks/test_parallel_flag.py (E712), and run ruff format across the four files added or touched by #1655. No behaviour change.
Schemathesis smoke timeout (#1659). Widen the Schemathesis smoke step timeout so the property-based API smoke run stops being cancelled mid-flight under the normal main merge cadence, removing a recurring flaky-cancellation source on the merge train.
Docker runtime pin (#1664). The published runtime image (Dockerfile) and the demo image (docker/demo/Dockerfile) referenced python:3.14-slim while their inline comments still read python:3.12-slim. Both build the bernstein wheel and run adapter dependencies that require <=3.13, so both are pinned back to python:3.13-slim by digest with the stale comments corrected to match the repository python policy.
Sonar-scan workflow_run bootstrap (#1665). The workflow_run listener only fires when the upstream CI run on main concludes success, but ci.yml cancels in-progress runs per branch, so main CI almost never reaches success and the scan job's if-guard kept skipping. Make workflow_dispatch a reliable bootstrap and re-scan path: resolve the most recent successful CI run on main and pull its coverage-report artifact so a manual scan carries full Python coverage instead of scanning coverage-less. The workflow_run path is unchanged.
Review-bot-ack concurrency (#1666). The review-bot-ack workflow emits a required status check on every PR. With cancel-in-progress: true and a per-PR concurrency group, overlapping events (synchronize on push, pull_request_review on review submit) routinely cancelled an in-flight gate run, and a CANCELLED conclusion reads as a non-success required check that stalled the merge queue. Scope the concurrency group per-PR and per-head-sha and set cancel-in-progress: false so every commit's gate run completes against its own sha. Adds a CI workflow-health sweep summary at docs/ci/workflow-health-2026-05-20.md covering all 47 registered workflows.

Documentation

Doc-drift refresh (#1677). Reconcile docs/concepts/ and docs/gui/ prose with the current source-of-truth public surfaces across 16 documents, correcting renamed CLI surfaces, signatures, and config knobs: action-cache subcommands and metric names, swarm-migration --id flag, validate_with_retry positional signature, FeatureContract-driven spec-as-test assertions, select_sandbox(backends, ...) return and raises, team-hub 64 KiB manifest cap, BestOfNDefaults config knobs, cpu_pause_threshold load-units default, route_for_phase per-phase router, fingerprint-memoization default_store factory, LineageReader.iter_records(run_id) with --limit, and the async summarize_diff returning a list. docs/sdd/ verified in sync (no change).

Quality and refurb waves

Wave 4 (FURB184 + leftovers) (#1643). Conservative libcst / ast-based rewrites that preserve semantics. Counts in src/: FURB184 197 -> 34 (163 fixed), FURB138 42 -> 8 (34 fixed), FURB124 29 -> 3 (26 fixed), FURB142 16 -> 0 (16 fixed), FURB113 23 -> 21 (2 fixed; remainder have intervening comments that act as section dividers). Followed by a ruff format pass over 36 files to wrap E501 long-line comprehensions, plus four targeted fixes for broken seen in seen self-referential dedup comprehensions in spec_assertions, pr_review_aggregator, review_responder.models, and tui.approval_panel (replaced with dict.fromkeys() for order-preserving dedup).
Cluster D (FURB139 / 143 / 179 strings and enumerate) (#1647). 16 refurb autofixes: FURB139 drops leading / trailing newlines in nine multi-line GraphQL query constants by switching to line-continuation backslashes; FURB143 drops one redundant outer or "" after str(... or "") in jira_dc_adapter; FURB179 flattens six nested list / set comprehensions to itertools.chain.from_iterable in bulletin, orchestrator (x4), and capability_matrix. Three FURB143 alerts skipped intentionally where defensive or "" guards external API boundaries (importlib.metadata fields, externally-typed input strings).
Cluster E (FURB182 / 183 / 142 / 101 misc) (#1649). 33 safe refurb rewrites across 21 files: FURB182 folds the first hashlib.update() into the sha256() constructor (10 sites); FURB142 replaces for x in iter: s.add(...) with s.update(...) (16 sites); FURB101 replaces with open(p) as f: y = f.read() with Path(p).read_text/bytes() (5 sites); FURB183 replaces f"{x}" with str(x) where the format spec is empty (2 sites). Refurb now reports 0 alerts for these rules in src/.
Cluster B (FURB109 / 108 / 126 control flow) (#1651). 53 refurb idiom fixes across 44 files in src/bernstein/: FURB109 (23 sites) uses tuples instead of lists for static in membership and for iteration over fixed sequences; FURB108 (18 sites) collapses x == a or x == b chains to x in (a, b); FURB126 (12 sites) drops redundant else / case _ after a return and relies on fall-through. Pure control-flow and literal rewrites with no behavioural change; verified with ruff check clean on touched files, compileall clean, and a targeted pytest sweep (320+ tests) over affected modules.

New and changed CLI commands

bernstein plan dag --file <path> / bernstein tasks dag --file <path> (new). Renders the task DAG with parallel batches highlighted and lists story rollback groups derived from story_id annotations.
bernstein doctor sonar (new). Surfaces project measures from SonarQube. Flags: --json, baseline cache override via XDG_DATA_HOME.
bernstein doctor glitchtip (new). Surfaces last-24h issue counts, 7-day trend, and top unresolved issues. Flags: --json, --top-n (IntRange(min=1)).
bernstein doctor --suggest-docs (extended). Now also prints one-line GlitchTip and Sonar nudges when the respective APIs report new unresolved issues or threshold regressions since the cached baseline; failures are logged and suppressed (never crashes the doctor command).
bernstein approve-tool / bernstein reject-tool (changed). Read the on-disk pending-approval record and thread the server-minted nonce back through resolve(). Operators using the CLI path see no behaviour change; integrators calling resolve() directly must thread the nonce or use the server-internal back-compat path.

Upgrade notes

Drop-in upgrade from v2.3.1. No config-schema changes, no audit-chain changes.
Approval API change. HTTP approval replies now require a nonce field. The live-fragment HTML threads the nonce through automatically; external integrators calling the approval endpoint directly need to echo the nonce from the pending-approval payload. Missing or empty nonce returns 409 NONCE_MISMATCH. Replays against an evicted approval return 410 NONCE_EXPIRED.
Sonar workflow trigger changed. .github/workflows/sonar-scan.yml is now workflow_run against the CI workflow on main. Operators with a fork running their own Sonar scan should mirror the same trigger or set SONAR_HOST_URL / SONAR_TOKEN to point at their own server.
New optional env vars. BERNSTEIN_GLITCHTIP_TOKEN (for bernstein doctor glitchtip), optional overrides BERNSTEIN_GLITCHTIP_BASE_URL and BERNSTEIN_GLITCHTIP_ORG. SONAR_HOST_URL and SONAR_TOKEN for bernstein doctor sonar. The GitHub workflows expect GLITCHTIP_API_TOKEN and (for Sonar) SONAR_TOKEN as repo secrets. None of these are required; both commands soft-fail with a one-line hint when unset.
RunActor is opt-in. Existing flows that do not pass session_id into the approval gate continue to work unchanged.
Empirical-confidence ledger is created lazily. On first write, an SQLite file is created at ${XDG_DATA_HOME:-~/.local/share}/bernstein/empirical-confidence.db. Override the path with BERNSTEIN_CONFIDENCE_DB, the sample threshold with BERNSTEIN_CONFIDENCE_MIN_SAMPLES. The model recommender falls back to the existing capability-tier and bandit paths when the ledger lacks a qualifying sample, so existing runs are unaffected.

Internal

Review-bot acknowledgement gate caught seven CodeRabbit must-address findings on #1646 across workflow status validation, gh issue subprocess check=True, doc clarification on soft-fail conditions, narrower import-time exception handling, logging of unexpected fetch failures, IntRange(min=1) on --top-n, and dropping a truthy fallback in summarise_severity / _bucket_trend_by_day that was inflating legitimate zero counts to one.
Sourcery flagged the empty-nonce-body case on #1642; default the field to an empty string at the schema layer so the documented 409 NONCE_MISMATCH contract holds.
_run_git regression test coverage hardened by re-adding the str() coercion in the error formatter and re-running the three failing tests (test_context::test_returns_list, test_context_builder::test_includes_file_summary_for_python_files, test_failure_reduction::test_task_context_includes_file_info).

Acknowledgements

This release is operator-only; no external contributor PRs landed in the v2.3.1..v2.4.0 window.

Full changelog

feat (10)

f84bde93 feat(adapters): canonical stream-signal protocol for adapter stdout (#1638)
2df0e9c1 feat(orchestration): single-writer run-state actor with bounded replay buffer (#1641)
fd231bc7 feat(security): bind approval responses to single-use nonce (#1642)
51f330a6 feat(observability): sonar insights surface + doctor subcommand + delta nudge (#1648)
1a50c36a feat(observability): GlitchTip insights surface + doctor subcommand + daily alert workflow (#1646)
c42607b4 feat(quality): empirical confidence from outcome history (#1653)
ec430dff feat(orchestration): task DAG with explicit parallel flag + story-link grouping (#1655)
05f582a2 feat(planning): auto spec-quality checklist refuses to advance until clean (#1652)
381c3b6f feat(skills): three-layer customization with deterministic merge (#1654)
15b5b1d0 feat(observability): unified bernstein doctor observe + per-PR insights summary + daily trends (#1650)

fix (8)

80c819b8 fix(lint): repair main-red after #1638 merge (#1640)
27ba6885 fix(test): restore str() coercion in _run_git error formatter (#1644)
b7bc28fe fix(ci): reuse coverage artifact in Sonar scan instead of re-running tests (#1645)
a0f26de7 fix(lint): repair main-red after #1655 task-DAG merge (#1657)
a1449b4f fix(ci): widen Schemathesis smoke timeout to stop flaky cancellations (#1659)
ab72c5bd fix(docker): pin runtime images to python:3.13-slim (#1664)
b7f288d5 fix(ci): repair sonar-scan workflow_run trigger so first scan populates the project (#1665)
006743ee fix(ci): stop review-bot-ack from cancelling its own required check (#1666)

refactor (4)

6fe31edc refactor: bulk refurb autofix wave 4 (FURB184 + leftovers) (#1643)
eb112d2e refactor: refurb cluster E (FURB182/183/142/101 misc) (#1649)
2fb0d26c refactor: refurb cluster D (FURB139/143/179 strings/enumerate) (#1647)
d684739c refactor: refurb cluster B (FURB109/108/126 control flow) (#1651)

docs (1)

ca6a2dab docs(refresh): concepts + gui + sdd partitions per drift playbook (#1677)

chore / deps (10)

e6ca20a2 chore(release): v2.4.0 (#1658)
7b047af9 chore(deps): update marocchino/sticky-pull-request-comment action to v2.9.4 (#1661)
49b9766b chore(deps): update dependency python to 3.13 (#1663)
1107e76d chore(deps): update peter-evans/create-pull-request action to v7.0.11 (#1662)
93d47267 chore(deps): bump peter-evans/create-pull-request from 7.0.11 to 8.1.1 (#1667)
e999818e chore(deps): update marocchino/sticky-pull-request-comment action to v3 (#1671)
852e3778 chore(deps): bump marocchino/sticky-pull-request-comment (#1669)
1016b352 chore(deps): update gcr.io/oss-fuzz-base/base-builder-python docker digest to 04d1a93 (#1670)
48051a8b chore(deps): bump actions/setup-python from 5 to 6 (#1668)
17021db4 chore(deps): update python:3.13-slim docker digest to 9ca3cf9 (#1678)

Breaking Changes

Approval API now requires a `nonce` field; missing or empty nonce returns `409 NONCE_MISMATCH`, evicted replays return `410 NONCE_EXPIRED`.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track chernistry/bernstein

Get notified when new releases ship.

About chernistry/bernstein

Deterministic multi-agent orchestrator for 18 CLI coding agents (Claude Code, Codex, Cursor, Aider, Gemini CLI, OpenAI Agents SDK, and more). MCP server mode (stdio + HTTP/SSE) exposes the orchestrator to any MCP client. Git worktree isolation per agent, HMAC-chained audit trail, cost-aware model routing via contextual bandit. ~11K monthly PyPI downloads, Apache 2.0.

All releases →

Related context

Related tools

Earlier breaking changes

v3.7.1 `bernstein approve` and `bernstein reject` now enforce identifier regex `[A-Za-z0-9._-]{1,64}`.
v3.7.1 Tampered mission ledger reports as unverified rather than not-found.
v3.7.1 `mission define` now refuses phases without gate tasks.
v3.5.0 MCP client, transport, and gateway become stateless; calls carry content‑derived trace IDs in _meta.