This release includes 1 breaking change for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+14 more
Summary
AI summaryBroad release touches CI and infrastructure, Observability surfaces, Highlights, and feat.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Security | Medium |
Approval responses bound to server-minted single-use nonce; mismatches surface as 409 NONCE_MISMATCH, evicted replays as 410 NONCE_EXPIRED, foreclosing stale-button replay on superseded prompts. Approval responses bound to server-minted single-use nonce; mismatches surface as 409 NONCE_MISMATCH, evicted replays as 410 NONCE_EXPIRED, foreclosing stale-button replay on superseded prompts. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Unified bernstein doctor observe aggregates four observability backends into one table with delta-since-last-check, per-PR sticky summary comment, and daily trends snapshot. Unified bernstein doctor observe aggregates four observability backends into one table with delta-since-last-check, per-PR sticky summary comment, and daily trends snapshot. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Single-writer RunActor owns canonical per-session state behind async event queue with bounded replay buffer emitting Gap marker on eviction. Single-writer RunActor owns canonical per-session state behind async event queue with bounded replay buffer emitting Gap marker on eviction. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Spec-quality gate refuses to advance feature spec until deterministic library-only rule set passes, routing failures through auto-fix loop and surfacing unresolved items to operator. Spec-quality gate refuses to advance feature spec until deterministic library-only rule set passes, routing failures through auto-fix loop and surfacing unresolved items to operator. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Declarative task DAG adds parallel_safe and story_id fields; backlog parser learns markdown checkboxes; topological_iter_with_parallel yields ready batches honouring cycle detection; bernstein plan dag / tasks dag render DAG with parallel batches highlighted. Declarative task DAG adds parallel_safe and story_id fields; backlog parser learns markdown checkboxes; topological_iter_with_parallel yields ready batches honouring cycle detection; bernstein plan dag / tasks dag render DAG with parallel batches highlighted. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Three-layer skill customization (BASE/TEAM/USER) under XDG paths with deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append; missing layers fall through cleanly. Three-layer skill customization (BASE/TEAM/USER) under XDG paths with deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append; missing layers fall through cleanly. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Empirical-confidence ledger backs model recommender with per-decision outcomes in SQLite store; prefers measured outcomes over capability-tier heuristic and bandit arm, refusing values below documented threshold (default 5). Empirical-confidence ledger backs model recommender with per-decision outcomes in SQLite store; prefers measured outcomes over capability-tier heuristic and bandit arm, refusing values below documented threshold (default 5). Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Bernstein doctor sonar subcommand pulls project measures from SonarQube with rich-table or JSON output; soft-fails when env vars unset. Bernstein doctor sonar subcommand pulls project measures from SonarQube with rich-table or JSON output; soft-fails when env vars unset. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Bernstein doctor glitchtip subcommand pulls last-24h issue counts, 7-day trend, and top unresolved issues from GlitchTip; soft-fails when token unset. Bernstein doctor glitchtip subcommand pulls last-24h issue counts, 7-day trend, and top unresolved issues from GlitchTip; soft-fails when token unset. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Per-PR sticky Sonar comment workflow posts advisory PR comment with project-level Sonar measures; never blocks merge. Per-PR sticky Sonar comment workflow posts advisory PR comment with project-level Sonar measures; never blocks merge. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Daily GlitchTip alert sweep workflow mirrors fatal-level issues into sticky GitHub issues labelled glitchtip-alert and auto-closes when resolved. Daily GlitchTip alert sweep workflow mirrors fatal-level issues into sticky GitHub issues labelled glitchtip-alert and auto-closes when resolved. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Performance | Medium |
Sonar scan workflow now consumes existing coverage artifact via workflow_run, avoiding full re-run of unit suite and fitting memory budget. Sonar scan workflow now consumes existing coverage artifact via workflow_run, avoiding full re-run of unit suite and fitting memory budget. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Bugfix | Medium |
Restores str() coercion in _run_git error formatter to prevent TypeError when Path used in argv list. Restores str() coercion in _run_git error formatter to prevent TypeError when Path used in argv list. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
| Refactor | Medium |
Bulk refurb autofix wave 4 (FURB184 + leftovers) reduces mechanical idiom rewrites across src/ by ~163 items. Bulk refurb autofix wave 4 (FURB184 + leftovers) reduces mechanical idiom rewrites across src/ by ~163 items. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
| Refactor | Medium |
Refurb cluster D (FURB139 / 143 / 179 strings and enumerate) applies 16 autofixes for GraphQL query constants, redundant outer or, and nested list/set comprehensions. Refurb cluster D (FURB139 / 143 / 179 strings and enumerate) applies 16 autofixes for GraphQL query constants, redundant outer or, and nested list/set comprehensions. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
| Refactor | Medium |
Refurb cluster E (FURB182 / 183 / 142 / 101 misc) performs 33 safe rewrites: folds hashlib.update into sha256 constructor, replaces for x in iter s.add with s.update, switches open to Path.read_text/bytes, and simplifies empty format expressions. Refurb cluster E (FURB182 / 183 / 142 / 101 misc) performs 33 safe rewrites: folds hashlib.update into sha256 constructor, replaces for x in iter s.add with s.update, switches open to Path.read_text/bytes, and simplifies empty format expressions. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
| Refactor | Medium |
Refurb cluster B (FURB109 / 108 / 126 control flow) uses tuples instead of lists for static membership, collapses x == a or x == b to x in (a,b), and drops redundant else after return. Refurb cluster B (FURB109 / 108 / 126 control flow) uses tuples instead of lists for static membership, collapses x == a or x == b to x in (a,b), and drops redundant else after return. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
| Other | Medium |
Doc-drift refresh reconciles 16 documents with current source-of-truth public surfaces across concepts, GUI, SDD partitions, and more. Doc-drift refresh reconciles 16 documents with current source-of-truth public surfaces across concepts, GUI, SDD partitions, and more. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
Full changelog
v2.4.0 - Observability surfaces, single-writer run state, declarative planning gates
Release date: 2026-05-20
Commits since v2.3.1: 33
Highlights
- Unified
bernstein doctor observeumbrella rolls the four observability backends (Sonar, GlitchTip, Dependency-Track, GitHub Code Scanning) into one aggregated table with delta-since-last-check, plus a per-PR sticky summary comment and a daily trends snapshot. Each backend soft-fails toSKIPPEDwhen its env vars are unset, so a fresh checkout stays green. - Single-writer
RunActorowns canonical per-session state behind one async event queue with a bounded replay buffer that emits an explicitGap{up_to_seq}marker on eviction, making reconnect-after-eviction observable instead of silently lossy. - Spec-quality gate refuses to advance a feature spec until a deterministic, library-only rule set passes; failures route through a bounded auto-fix loop and surface unresolved items to the operator rather than dispatching an implementer against a weak spec.
- Declarative task DAG: tasks gain
parallel_safeandstory_idfields, the backlog parser learns[T<id>] [P] [USn]markdown checkboxes,topological_iter_with_parallelyields ready batches honouring cycle detection, andbernstein plan dag/bernstein tasks dagrender the DAG with parallel batches highlighted; replaces the file-overlap heuristic for tasks that declare the flag while preserving the legacy heuristic as a fall-back. - Three-layer skill customization (BASE / TEAM / USER) under XDG paths with a per-field deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append; missing layers fall through cleanly.
- Empirical-confidence ledger backs the model recommender: an append-only SQLite store of per-decision outcomes feeds a sample-size-gated query that prefers measured outcomes over the capability-tier heuristic and over the bandit arm, refusing to return a value below a documented threshold (default 5).
- Approval responses are now bound to a 16-byte server-minted single-use nonce; mismatches surface as
409 NONCE_MISMATCHand evicted replays as410 NONCE_EXPIRED, foreclosing stale-button replay on superseded prompts. - Canonical stream-signal vocabulary (
COMPLETED,FAILED,QUESTION,PLAN_DRAFT,PLAN_READY,BLOCKED) parseable from any wrapped CLI stdout so non-stream-json adapters surface lifecycle events through the same channel as native stream-json adapters. - CI hardening across the board: the Sonar scan consumes the existing coverage artifact via
workflow_run(andworkflow_dispatchbootstraps a coverage-bearing first scan), the review-bot-ack gate no longer cancels its own required check, the Schemathesis smoke timeout is widened to stop flaky cancellations, and the runtime Docker images are pinned back topython:3.13-slim. - Four refurb auto-fix waves (wave 4 plus clusters B / D / E) land about 320 mechanical idiom rewrites across
src/, taking FURB142 to zero and substantially reducing the FURB184 / FURB138 / FURB124 / FURB182 / FURB101 / FURB109 / FURB108 / FURB126 backlog.
What ships
Observability surfaces
- Unified
bernstein doctor observe(#1650). Umbrella command that runs each per-backend probe (Sonar, GlitchTip, Dependency-Track, GitHub Code Scanning) in order and renders one aggregated Rich table with metric, value, delta-since-last-check, threshold, and status columns. Supports--json(machine-readable) and--watch(re-runs every 60 seconds). Each backend soft-fails toSKIPPEDwhen its env vars are unset so the umbrella keeps running on a fresh checkout. Per-backend deltas are computed against a small snapshot cache at.sdd/observability/<backend>.json(suppressible via--no-persist). Thedt,code-scanning, andobserveClick commands are registered directly inbernstein.cli.mainso the wiring survives independent refactors ofadvanced_cmd.py. A per-PRpr-observability-summary.ymlworkflow posts a sticky Markdown comment rendered from the observe JSON, and a dailydocs-observability-snapshot.ymlcron (06:00 UTC) writesdocs/observability/snapshots/<date>.jsonand re-rendersdocs/observability/trends.mdvia a dependency-free unicode sparkline. Probe crash messages store only the exception type in persisted snapshots so tokens or URLs cannot leak. Docs atdocs/observability/unified-doctor.md. Tests attests/unit/cli/doctor/test_observe.pycover probe soft-fails, delta math, Click wiring, JSON shape, persistence toggle, and exit-code mapping. bernstein doctor sonar(#1648). New subcommand pulling project measures from a configured SonarQube server: coverage, code smells by severity, bugs, vulnerabilities, security hotspots, and cognitive-complexity hotspots. Rich-table or--jsonoutput. Soft-fails (exit 0) whenSONAR_HOST_URL/SONAR_TOKENare unset and prints a one-line hint atdocs/observability/sonar.md. Advisory baseline at$XDG_DATA_HOME/bernstein/sonar-baseline.jsonlets the parentbernstein doctorgroup nudge when open smells exceed the threshold or vulnerabilities regress. 28 hermetic tests viahttpx.MockTransport.bernstein doctor glitchtip(#1646). New subcommand pulling last-24h issue counts by severity, a 7-day trend, and the top unresolved issues from the configured GlitchTip server. Rich-table or--jsonoutput. Soft-fails whenBERNSTEIN_GLITCHTIP_TOKENis unset. Optional baseline cache at~/.local/share/bernstein/glitchtip-baseline.jsonpowers a nudge underbernstein doctor --suggest-docswhen the GlitchTip API reports new unresolved issues since the last check. 25 unit tests cover the fetcher, baseline persistence, nudge logic, Click wiring, and soft-fail behaviour.- Sticky PR Sonar comment (#1648). New
.github/workflows/sonar-pr-comment.ymlposts a sticky advisory PR comment with project-level Sonar measures. Soft signal only, never blocks merge. - Daily GlitchTip alert sweep (#1646). New
.github/workflows/glitchtip-insights.yml(06:30 UTC +workflow_dispatch) mirrors fatal-level GlitchTip issues into sticky GitHub issues labelledglitchtip-alert. The mirror auto-closes when the GlitchTip side resolves. Workflow now validates HTTP status on the resolved-issues fetch and runsgh issuesubprocesses withcheck=Trueso reconciliation failures fail the run instead of being swallowed.
Security
- Approval-nonce binding (#1642). Mints a 16-byte server-generated nonce per pending approval. The reply must echo the exact value or the gate refuses to resolve, foreclosing stale-button replay on superseded prompts and any path where the agent process could forge its own approval response.
core/approval/models:noncefield onPendingApproval(hex on the wire);to_dict(include_nonce=False)for adapter-facing serialisations; newApprovalNonceMismatch/ApprovalNonceExpirederrors.core/approval/queue:resolve()validates the supplied nonce in constant time. Server-internal callers (TTL evict,wait_fortimeout) keep the back-compat no-nonce path so they cannot deadlock.core/routes/approvals: HTTP reply now requires a nonce. Mismatches surface409 NONCE_MISMATCH. Replays against an evicted approval surface410 NONCE_EXPIRED. The live-fragment HTML threads the nonce through the button handlers.cli/commands/approval_cmd:approve-tool/reject-toolread the on-disk record and thread the nonce back throughresolve().- A missing
noncebody field defaults to an empty string at the schema layer so it flows through the handler and surfaces as409 NONCE_MISMATCHvia the existing_coerce_nonceguard, instead of being rejected at the Pydantic layer with422. - Closes #1619.
Reliability and runtime
- Single-writer
RunActor(#1641). Introduces a per-session actor that owns canonical run state. Mutations flow as typed events through one async queue. A pureapply_eventreducer applies them with monotonic seq numbers.ReplayBufferis a bounded ring (default 1024) that emits an explicitGap{up_to_seq}marker when a subscriber asks for an evicted range, so a reconnect-after-eviction is observable instead of silently corrupt. The approval gate gains an opt-insession_idkwarg that mirrors approval events into a registeredRunActorviarun_actor_registry. The file-driven decision contract is unchanged; the actor feed runs alongside. Migrating the remaining writers (worker subprocess, watchdog, lifecycle hooks,hooks_receiver) is a follow-up. Refs #1630. - Canonical stream-signal protocol (#1638). New
core/protocols/stream_signals.pydefines a small text-line vocabulary (COMPLETED,FAILED,QUESTION,PLAN_DRAFT,PLAN_READY,BLOCKED), a parser, a producer-side format helper, and conformance helpers.CLIAdaptergrows an optionalstream_signal_parserhook; the default delegates to the canonical parser, adapters override to map a native protocol onto the canonical vocabulary.ConformanceReportsurfaces missing terminal signals as a soft warning so adapters without canonical signals stay visible without failing. Tests cover parse, format round-trip, malformed-input resilience, concurrent multi-adapter parsing, terminal-signal check, default vs. override hook behaviour, plan, and question round-trip. Docs atdocs/adapters/stream_signals.mddescribe the vocabulary with shell and Python wrapper examples. Resolves #1632. - Declarative task DAG (#1655). Adds a declarative task DAG layer so the planner sets per-task parallel safety at task-generation time instead of having the scheduler infer it from file overlap. The
Taskschema gainsparallel_safe(defaultFalse) andstory_id(Optional[str]) with round-trip support inTask.from_dict. The backlog parser recognises the[T<id>] [P] [USn]markdown checkbox format and the matching YAML frontmatter keys. Newcore/orchestration/task_dag.pyprovidesTaskNode,TaskDag(markdown + YAML loaders), andtopological_iter_with_parallelyielding ready batches; cycles raiseTaskDagCycleError.adaptive_parallelism.tasks_safe_to_run_in_parallelconsumes the declarative flag directly; the file-overlap heuristic is preserved only for legacy tasks that lack the attribute. CLI:bernstein plan dag --file <path>(also reachable asbernstein tasks dag --file) renders the DAG with parallel batches highlighted and lists story rollback groups. Docs atdocs/orchestration/task-dag.mdanddocs/operations/task_format.md. Tests cover schema and parser round-trip, scheduler consumption, and single-task / sequential-chain / parallel-batch / mixed parallel-serial / cycle-detection paths. Closes #1634.
Quality and routing
- Empirical-confidence ledger (#1653). New
core/quality/empirical_confidence.py: an append-only SQLite ledger (agent_outcomestable) of per-decision outcomes, with a sample-size-gatedConfidenceQuerythat returnsNonebelow the documented threshold (default 5) instead of fabricating a value.core/routing/model_recommender.pyconsults the ledger first; the existing capability-tier heuristic and the bandit arm remain as documented fall-backs for cells that have not accumulated enough samples. Default DB path:${XDG_DATA_HOME:-~/.local/share}/bernstein/empirical-confidence.db. Override viaBERNSTEIN_CONFIDENCE_DB; threshold viaBERNSTEIN_CONFIDENCE_MIN_SAMPLES. Docs atdocs/quality/empirical-confidence.mdcover the schema, the sample-size rationale, and the routing precedence order. 16 new ledger tests plus 8 router regression tests pass. Closes #1622.
Planning gates
- Spec-quality gate (#1652). New
core/planning/spec_quality.py: a deterministic, library-only gate that evaluates a feature spec against a small, pluggable rule set before the orchestrator dispatches an implementer. Default rules cover acceptance-criteria-present, out-of-scope-present, tested-via-present, no-TODO, no-placeholder, and ref-paths-exist. Specs that fail any required rule route through a bounded auto-fix loop (default 3 iterations); when the budget is exhausted the gate raisesSpecQualityUnresolvedErrorso callers can surface the unresolved items without re-evaluating. Rules are pluggable through thebernstein.spec_quality_rulesentry-point group; broken plugins are skipped, never crash the gate, and pluginRuleResultids are normalised to the owning rule. CLI surfaces:bernstein spec check <path>andbernstein spec auto-fix <path>(dry-run vs--write, strict vs no-strict). Path-like spec strings that raiseOSErrorfall back to inline mode. Docs atdocs/planning/spec-quality-gate.md. Tests attests/unit/planning/test_spec_quality.pyandtests/unit/cli/test_spec_cmd.py. Closes #1631.
Skill customization
- Three-layer skill merge (#1654). New
core/skills/layered.py: BASE / TEAM / USER skill layers under XDG paths with a per-field merge spec where scalars override, tables deep-merge, keyed arrays replace byname/id/code, and unkeyed arrays append. Layers fall through cleanly when absent. CLI:bernstein skills list --layeredsurfaces layer-of-origin, andbernstein skills show <name> --per-layershows the merged result alongside the raw per-layer diff. Docs atdocs/skills/layered-merge.md. 30 new tests pin merge precedence, per-field granularity, deterministic output, and missing-layer fall-through. Closes #1624.
Correctness
_run_giterror formatter (#1644). Re-add thestr()coercion inside theOSError/TimeoutExpiredhandler ofgit_context._run_git. The refurb wave 3 auto-fix (#1615) had dropped it, so calls with aPathinside theargvlist (test_context,test_context_builder,test_failure_reductionall do this indirectly viacochange_files) raisedFileNotFoundError, and the handler then crashed on" ".join(...)withexpected str instance, PosixPath found, turning a debug log into aTypeErrorthat bubbled up. Same fix as #1591, regressed by the wave-3 auto-fix.
CI and infrastructure
- Sonar scan via
workflow_run(#1645). The Sonar scan workflow was re-running the full unit suite under a singlepytest --covinvocation. That suite needs per-file isolation to fit the runner memory budget, which is whyci.ymlshards it across files and takes about 25 minutes. The naive single-process run only reached 5 percent of files within the 30 minute step timeout (the job-level timeout bump in #1616 did not lift the inner step cap). Switchsonar-scan.ymlto aworkflow_runtrigger that fires after a successful CI run on main, download thecoverage-reportartifact CI already publishes, and feed it directly to the Sonar scanner. Also addsonar.ws.timeout=600to guard the scanner client against slow server responses, and pinsonar.scm.revisionto the upstream CI head SHA so the scan reports against the right commit. - Lint repair after #1638 (#1640).
ruff format --checkfailed oncore/quality/review_pipeline/review_gate.pyafter the stream-signal PR landed. Applyingruff formatcollapses several string and comprehension wrappings under the project's 120-character line length. No behaviour change. - Lint repair after #1655 (#1657). The task-DAG merge turned main red on
Lint. MoveIteratorandPathimports underTYPE_CHECKINGincore/orchestration/task_dag.py(TC003, 2 sites), replace== Truewithis Trueintests/unit/tasks/test_parallel_flag.py(E712), and runruff formatacross the four files added or touched by #1655. No behaviour change. - Schemathesis smoke timeout (#1659). Widen the Schemathesis smoke step timeout so the property-based API smoke run stops being cancelled mid-flight under the normal main merge cadence, removing a recurring flaky-cancellation source on the merge train.
- Docker runtime pin (#1664). The published runtime image (
Dockerfile) and the demo image (docker/demo/Dockerfile) referencedpython:3.14-slimwhile their inline comments still readpython:3.12-slim. Both build the bernstein wheel and run adapter dependencies that require<=3.13, so both are pinned back topython:3.13-slimby digest with the stale comments corrected to match the repository python policy. - Sonar-scan
workflow_runbootstrap (#1665). Theworkflow_runlistener only fires when the upstream CI run on main concludessuccess, butci.ymlcancels in-progress runs per branch, so main CI almost never reachessuccessand the scan job's if-guard kept skipping. Makeworkflow_dispatcha reliable bootstrap and re-scan path: resolve the most recent successful CI run on main and pull itscoverage-reportartifact so a manual scan carries full Python coverage instead of scanning coverage-less. Theworkflow_runpath is unchanged. - Review-bot-ack concurrency (#1666). The review-bot-ack workflow emits a required status check on every PR. With
cancel-in-progress: trueand a per-PR concurrency group, overlapping events (synchronizeon push,pull_request_reviewon review submit) routinely cancelled an in-flight gate run, and aCANCELLEDconclusion reads as a non-success required check that stalled the merge queue. Scope the concurrency group per-PR and per-head-sha and setcancel-in-progress: falseso every commit's gate run completes against its own sha. Adds a CI workflow-health sweep summary atdocs/ci/workflow-health-2026-05-20.mdcovering all 47 registered workflows.
Documentation
- Doc-drift refresh (#1677). Reconcile
docs/concepts/anddocs/gui/prose with the current source-of-truth public surfaces across 16 documents, correcting renamed CLI surfaces, signatures, and config knobs: action-cache subcommands and metric names, swarm-migration--idflag,validate_with_retrypositional signature,FeatureContract-driven spec-as-test assertions,select_sandbox(backends, ...)return and raises, team-hub 64 KiB manifest cap,BestOfNDefaultsconfig knobs,cpu_pause_thresholdload-units default,route_for_phaseper-phase router, fingerprint-memoizationdefault_storefactory,LineageReader.iter_records(run_id)with--limit, and the asyncsummarize_diffreturning a list.docs/sdd/verified in sync (no change).
Quality and refurb waves
- Wave 4 (FURB184 + leftovers) (#1643). Conservative libcst / ast-based rewrites that preserve semantics. Counts in
src/: FURB184 197 -> 34 (163 fixed), FURB138 42 -> 8 (34 fixed), FURB124 29 -> 3 (26 fixed), FURB142 16 -> 0 (16 fixed), FURB113 23 -> 21 (2 fixed; remainder have intervening comments that act as section dividers). Followed by aruff formatpass over 36 files to wrapE501long-line comprehensions, plus four targeted fixes for brokenseen in seenself-referential dedup comprehensions inspec_assertions,pr_review_aggregator,review_responder.models, andtui.approval_panel(replaced withdict.fromkeys()for order-preserving dedup). - Cluster D (FURB139 / 143 / 179 strings and enumerate) (#1647). 16 refurb autofixes: FURB139 drops leading / trailing newlines in nine multi-line GraphQL query constants by switching to line-continuation backslashes; FURB143 drops one redundant outer
or ""afterstr(... or "")injira_dc_adapter; FURB179 flattens six nested list / set comprehensions toitertools.chain.from_iterableinbulletin,orchestrator(x4), andcapability_matrix. Three FURB143 alerts skipped intentionally where defensiveor ""guards external API boundaries (importlib.metadatafields, externally-typed input strings). - Cluster E (FURB182 / 183 / 142 / 101 misc) (#1649). 33 safe refurb rewrites across 21 files: FURB182 folds the first
hashlib.update()into thesha256()constructor (10 sites); FURB142 replacesfor x in iter: s.add(...)withs.update(...)(16 sites); FURB101 replaceswith open(p) as f: y = f.read()withPath(p).read_text/bytes()(5 sites); FURB183 replacesf"{x}"withstr(x)where the format spec is empty (2 sites). Refurb now reports 0 alerts for these rules insrc/. - Cluster B (FURB109 / 108 / 126 control flow) (#1651). 53 refurb idiom fixes across 44 files in
src/bernstein/: FURB109 (23 sites) uses tuples instead of lists for staticinmembership andforiteration over fixed sequences; FURB108 (18 sites) collapsesx == a or x == bchains tox in (a, b); FURB126 (12 sites) drops redundantelse/case _after areturnand relies on fall-through. Pure control-flow and literal rewrites with no behavioural change; verified withruff checkclean on touched files,compileallclean, and a targeted pytest sweep (320+ tests) over affected modules.
New and changed CLI commands
bernstein plan dag --file <path>/bernstein tasks dag --file <path>(new). Renders the task DAG with parallel batches highlighted and lists story rollback groups derived fromstory_idannotations.bernstein doctor sonar(new). Surfaces project measures from SonarQube. Flags:--json, baseline cache override viaXDG_DATA_HOME.bernstein doctor glitchtip(new). Surfaces last-24h issue counts, 7-day trend, and top unresolved issues. Flags:--json,--top-n(IntRange(min=1)).bernstein doctor --suggest-docs(extended). Now also prints one-line GlitchTip and Sonar nudges when the respective APIs report new unresolved issues or threshold regressions since the cached baseline; failures are logged and suppressed (never crashes the doctor command).bernstein approve-tool/bernstein reject-tool(changed). Read the on-disk pending-approval record and thread the server-minted nonce back throughresolve(). Operators using the CLI path see no behaviour change; integrators callingresolve()directly must thread the nonce or use the server-internal back-compat path.
Upgrade notes
- Drop-in upgrade from v2.3.1. No config-schema changes, no audit-chain changes.
- Approval API change. HTTP approval replies now require a
noncefield. The live-fragment HTML threads the nonce through automatically; external integrators calling the approval endpoint directly need to echo thenoncefrom the pending-approval payload. Missing or emptynoncereturns409 NONCE_MISMATCH. Replays against an evicted approval return410 NONCE_EXPIRED. - Sonar workflow trigger changed.
.github/workflows/sonar-scan.ymlis nowworkflow_runagainst the CI workflow on main. Operators with a fork running their own Sonar scan should mirror the same trigger or setSONAR_HOST_URL/SONAR_TOKENto point at their own server. - New optional env vars.
BERNSTEIN_GLITCHTIP_TOKEN(forbernstein doctor glitchtip), optional overridesBERNSTEIN_GLITCHTIP_BASE_URLandBERNSTEIN_GLITCHTIP_ORG.SONAR_HOST_URLandSONAR_TOKENforbernstein doctor sonar. The GitHub workflows expectGLITCHTIP_API_TOKENand (for Sonar)SONAR_TOKENas repo secrets. None of these are required; both commands soft-fail with a one-line hint when unset. RunActoris opt-in. Existing flows that do not passsession_idinto the approval gate continue to work unchanged.- Empirical-confidence ledger is created lazily. On first write, an SQLite file is created at
${XDG_DATA_HOME:-~/.local/share}/bernstein/empirical-confidence.db. Override the path withBERNSTEIN_CONFIDENCE_DB, the sample threshold withBERNSTEIN_CONFIDENCE_MIN_SAMPLES. The model recommender falls back to the existing capability-tier and bandit paths when the ledger lacks a qualifying sample, so existing runs are unaffected.
Internal
- Review-bot acknowledgement gate caught seven CodeRabbit must-address findings on #1646 across workflow status validation,
gh issuesubprocesscheck=True, doc clarification on soft-fail conditions, narrower import-time exception handling, logging of unexpected fetch failures,IntRange(min=1)on--top-n, and dropping a truthy fallback insummarise_severity/_bucket_trend_by_daythat was inflating legitimate zero counts to one. - Sourcery flagged the empty-nonce-body case on #1642; default the field to an empty string at the schema layer so the documented
409 NONCE_MISMATCHcontract holds. _run_gitregression test coverage hardened by re-adding thestr()coercion in the error formatter and re-running the three failing tests (test_context::test_returns_list,test_context_builder::test_includes_file_summary_for_python_files,test_failure_reduction::test_task_context_includes_file_info).
Acknowledgements
This release is operator-only; no external contributor PRs landed in the v2.3.1..v2.4.0 window.
Full changelog
feat (10)
f84bde93feat(adapters): canonical stream-signal protocol for adapter stdout (#1638)2df0e9c1feat(orchestration): single-writer run-state actor with bounded replay buffer (#1641)fd231bc7feat(security): bind approval responses to single-use nonce (#1642)51f330a6feat(observability): sonar insights surface + doctor subcommand + delta nudge (#1648)1a50c36afeat(observability): GlitchTip insights surface + doctor subcommand + daily alert workflow (#1646)c42607b4feat(quality): empirical confidence from outcome history (#1653)ec430dfffeat(orchestration): task DAG with explicit parallel flag + story-link grouping (#1655)05f582a2feat(planning): auto spec-quality checklist refuses to advance until clean (#1652)381c3b6ffeat(skills): three-layer customization with deterministic merge (#1654)15b5b1d0feat(observability): unified bernstein doctor observe + per-PR insights summary + daily trends (#1650)
fix (8)
80c819b8fix(lint): repair main-red after #1638 merge (#1640)27ba6885fix(test): restore str() coercion in _run_git error formatter (#1644)b7bc28fefix(ci): reuse coverage artifact in Sonar scan instead of re-running tests (#1645)a0f26de7fix(lint): repair main-red after #1655 task-DAG merge (#1657)a1449b4ffix(ci): widen Schemathesis smoke timeout to stop flaky cancellations (#1659)ab72c5bdfix(docker): pin runtime images to python:3.13-slim (#1664)b7f288d5fix(ci): repair sonar-scan workflow_run trigger so first scan populates the project (#1665)006743eefix(ci): stop review-bot-ack from cancelling its own required check (#1666)
refactor (4)
6fe31edcrefactor: bulk refurb autofix wave 4 (FURB184 + leftovers) (#1643)eb112d2erefactor: refurb cluster E (FURB182/183/142/101 misc) (#1649)2fb0d26crefactor: refurb cluster D (FURB139/143/179 strings/enumerate) (#1647)d684739crefactor: refurb cluster B (FURB109/108/126 control flow) (#1651)
docs (1)
ca6a2dabdocs(refresh): concepts + gui + sdd partitions per drift playbook (#1677)
chore / deps (10)
e6ca20a2chore(release): v2.4.0 (#1658)7b047af9chore(deps): update marocchino/sticky-pull-request-comment action to v2.9.4 (#1661)49b9766bchore(deps): update dependency python to 3.13 (#1663)1107e76dchore(deps): update peter-evans/create-pull-request action to v7.0.11 (#1662)93d47267chore(deps): bump peter-evans/create-pull-request from 7.0.11 to 8.1.1 (#1667)e999818echore(deps): update marocchino/sticky-pull-request-comment action to v3 (#1671)852e3778chore(deps): bump marocchino/sticky-pull-request-comment (#1669)1016b352chore(deps): update gcr.io/oss-fuzz-base/base-builder-python docker digest to 04d1a93 (#1670)48051a8bchore(deps): bump actions/setup-python from 5 to 6 (#1668)17021db4chore(deps): update python:3.13-slim docker digest to 9ca3cf9 (#1678)
Breaking Changes
- Approval API now requires a `nonce` field; missing or empty nonce returns `409 NONCE_MISMATCH`, evicted replays return `410 NONCE_EXPIRED`.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About chernistry/bernstein
Deterministic multi-agent orchestrator for 18 CLI coding agents (Claude Code, Codex, Cursor, Aider, Gemini CLI, OpenAI Agents SDK, and more). MCP server mode (stdio + HTTP/SSE) exposes the orchestrator to any MCP client. Git worktree isolation per agent, HMAC-chained audit trail, cost-aware model routing via contextual bandit. ~11K monthly PyPI downloads, Apache 2.0.
Related context
Related tools
Beta — feedback welcome: [email protected]