◂ home
// channel · orchestration
active100.3 MHZ

Nexus

A provider-agnostic agent-orchestration framework delivered over MCP — you hand it a goal and scoped, single-purpose agents frame, build, and review the work, none of them knowing which model from which vendor ran their turn.

briefgoal, boundedclaudecodexgrokparked · 1 yaml linereviewunion-to-refute · ×2 vendorsmain
[ core · debate · spawn ]three MCPs; scope enforced structurally, not by prompt prose

[latest2026-06-10]It runs itself — refactored its own engine through its own MCP, −39 lines on main.

8/8 vs 0/8
panel vs solo reviewer
16 runs · one planted trap
1 line
to reroute a vendor
nexus.yaml
[updated2026-06-10][sections9][stack7]
// session :: nexus.overviewtransmit

The orchestration logic is the durable asset. The model behind any given turn is a swappable detail.

── vision.md

Pitch

Nexus is the layer my agents use when one model isn't enough. You hand it a goal over MCP; it stands up an orchestrator that turns the goal into a problem brief, runs scoped, single-purpose agents that do the work through structured handoffs and reviews, and returns a result. The agents can run on any model from any provider — and none of them know which.

Most agent frameworks weld themselves to a single vendor. That's fragile two ways: every model's strengths move month to month, and the billing underneath you can change overnight (the forcing example: headless Claude Code moving from subscription coverage to a metered credit pool). Nexus's bet is to treat the execution engine as swappable and routing as configuration — so you can play each model to its strength and survive any one vendor changing terms. The differentiator is the thing a single-provider harness structurally cannot do: orchestrate one task across vendors, each doing what it's best at.

That isn't only a design claim anymore. A single task runs end-to-end over the real MCP wire with Claude framing and Codex building — and when the work hits the review gate, a two-vendor panel judges it, because sixteen measured runs showed a single same-vendor reviewer shipping a planted wrong answer 8 of 8 times while the cross-vendor panel caught it 8 of 8. The sharpest proof is the most recent: Nexus, registered into Claude Code as an MCP server, refactored its own engine — brief, scoped execution, panel review, landed on main.

3providersclaude · grok · codex
5backends3 delegated cli · 2 native
1line to reroutevendor swap in nexus.yaml
8/8panel catch ratevs 0/8 single-vendor · planted wrong answer

The sixty-second version

The whole page in nine lines — each one links to its long form.

  • The seam is the betarchitecture: ExecutionBackend.run_turn(TurnRequest) → TurnResult is the single method every layer above it calls — two loop modes (delegated CLI, native API) behind it, three packages over a shared core.
  • Routing is configurationproviders: three providers, five registered backends; swapping who executes a step is one line of nexus.yaml, and the orchestrator, brief, and agent prompt never change.
  • Three vendors, three auth storiescredentials: a CLI that owns its own login, a subscription piggyback, and from-scratch PKCE OAuth — normalized to one interface, with per-spawn token projection.
  • An agent is a pure function of its briefscope: the tool surface it can reach is enforced at the MCP layer, and the same Session.authorize gates both the out-of-process relay and the in-process native loop.
  • "I'm done" never completes a tasklifecycle: the caller drives a walk loop through execute → review → bounded retry; only an independent review's advance commits work, and the gate is enforced in code.
  • The blind spot is the vendorreview: one planted wrong answer, sixteen live runs — a single same-vendor reviewer shipped it 8 of 8, the cross-vendor panel caught it 8 of 8. The panel is the shipped default; the reflect-lead turns a catch into a cure.
  • Honest about what's provenverification: what CI proves versus what's run by hand — including the run where Nexus, registered as an MCP server, refactored its own engine and landed it on main.
  • Debate rides the same substratedebate: a structured multi-agent protocol that earns its keep by costing almost nothing to carry.
  • How it got heredecisions: the reshape arc — monolith to three composable MCPs — narrated from Nexus's own Engram memory, the project's history told by the engine it orchestrates.

This is the second time I've built an orchestrator. The first was a monolith — a twelve-agent named roster, a Director gate, a SQLite cache. This one keeps the spine that was load-bearing and drops everything that wasn't.

// section 01 :: architecture1 / 9

Architecture

Three packages over a shared core, and the one seam — ExecutionBackend.run_turn — that makes every model swappable.

// session :: nexus.architecturetransmit

Everything above the seam calls one method and never learns which vendor ran. That obliviousness is the whole bet.

── design note

Nexus is a uv workspace of three Python packages. There's no daemon, no database, no central server registry — workspace folders on disk are the source of truth, and every spawn carries its own self-describing launch config, so any caller can compose a topology.

nexus-core
engine library

The shared primitives — not an MCP. The spawn_agent atom, the ExecutionBackend seam, scope / session / workspace / registry, the nexus.yaml config surface, and the wire-dialect normalizers. Both servers build on it.

nexus-spawn-mcp
the product

A FastMCP stdio server that turns one agentic task into a scoped, auditable, multi-turn lifecycle the caller drives turn-by-turn. Fifteen caller-facing MCP tools, plus the larger scoped surface only spawned agents see. This is the thing Nexus actually is.

nexus-debate-mcp
sideshow

A structured multi-agent debate protocol on the same substrate. Genuinely neat, deliberately a passenger — it inherits core upgrades for free and costs almost nothing to carry. Its own section is last.

Engram (optional peer)
memory

A provenance-aware memory engine in its own repo, attached as an optional peer MCP. uv sync is standalone; uv sync --extra memory opts in. Absent it, a no-op bridge keeps the product running.

The one seam

The architecture is organized around a single contract. ExecutionBackend.run_turn(request, *, tool_router, on_event) → TurnResult is a runtime_checkable Protocol: one method that runs exactly one agentic turn and returns a normalized result. Every spawn site in the system — orchestrate, execute, review, reflect, even the background cartographer — only ever calls run_turn. None of them learns which vendor or which loop mode produced the turn; the deterministic broker above them disposes of results without ever spawning at all.

callers · every spawn site routes through the seamone seam · provider-obliviousorchestratorexecutorreview panel×2 vendorsreflectorcartographerbackgroundbrokerthe deterministic brokerdisposes results —it never spawnsExecutionBackend.run_turn(TurnRequest)→ TurnResultthe obliviousness boundarynothing above learns the vendor or loop modedelegated · agent runs inside the vendor's own cli / harnessclaude-cligrok-cliparkedcodex-clithe proven primary rungnative · nexus owns the loop over the responses apigrok-sub-tokenparkedcodex-sub-tokenfallback rungclaudecodexparked (grok) · flaky, parked for reliability
// one seam — run_turn(TurnRequest) -> TurnResult · the chosen vendor never crosses it

Two loop modes hide behind the seam, and the choice between them is invisible to everything upstream:

  • Delegated — Nexus spawns the vendor's own CLI (claude -p, grok, codex exec) and lets that harness run the whole agentic loop, with its coding system prompt, plan mode, self-verification, and loop-detector intact. Nexus just parses the stream. After a deliberate pivot, this is the proven primary rung: re-encoding a vendor's hard-won discipline in prompts for a hand-rolled loop lost to simply running the agent inside the harness that already has it.
  • Native — Nexus owns the loop itself, calling the model once per iteration over the OpenAI-Responses API and dispatching scoped tool calls in-process. This is the fallback rung — it exists, it's tested, but native loops flailed outside the vendor harness on exactly the kind of multi-file refactor the delegated path handled cleanly.

TurnRequest and TurnResult are the provider-neutral dataclasses that cross the seam. A delegated backend reads the CLI fields and ignores the tool router; a native backend reads the tools and ignores the CLI fields. Same contract, two implementations.

The spawn atom

Underneath the delegated backends sits one stateless primitive. spawn_agent runs exactly one claude -p invocation and returns when the process exits — no loop, no retry, no state. The caller owns the workspace, the retry policy, and the orchestration; the atom just runs an agent once.

Workspaces are the truth

Every task and every debate is a Workspace — a folder under ~/.nexus-spawn/workspaces/ (or ~/.nexus-debate/), with status.json as the single source of truth. There is no database. State is persisted with atomic writes: a per-write unique tmp filename plus os.replace, specifically to dodge the Windows PermissionError you hit when concurrent writers race on a rename. Crash recovery is "open the workspace and read."

Two fields make cross-server traceability work without a registry: every workspace carries a parent_workspace_id and a linked_workspace_ids list, so a task that spawned a debate — or a debate that spun out follow-up tasks — is traceable across MCPs that otherwise never talk to each other.

Where provider knowledge is allowed to live

The discipline that earns the framework its name is a boundary, not a feature: provider choice lives only in router.py resolving a nexus.yaml binding at dispatch time. The orchestrator composes briefs; it never picks a vendor. The agents are anonymous, typed, and scoped; they never learn who ran. Vendor-specific quirks — a --effort flag Claude honors, a reasoning.effort field the Grok proxy rejects — are pushed to the very edge, inside the individual backends. Above the seam, the code is vendor-blind on purpose.

// section 02 :: providers & routing2 / 9

Providers & routing

Three vendors, five backends, and one routing surface — swapping who runs a work-kind is a one-line edit the agents never see.

// session :: nexus.providerstransmit

The provider layer is what the framework's name is about. Above the ExecutionBackend seam everything is vendor-blind (see Architecture); this section is the part below the line — which vendors exist, how a work-kind resolves to one, and what "cross-provider" has actually been proven to do.

Three providers, five backends

BACKEND_FACTORIES registers exactly five backend ids across three providers:

  • Claudeclaude-cli (delegated; wraps spawn_agent). The Claude Code CLI owns its own login and runs the whole loop; Nexus never manages Anthropic credentials. It's reached only as a metered CLI, deliberately not via a borrowed subscription token — that's Anthropic's drawn line, and it's exactly why routing matters: the best-at-code model is now the metered one.
  • Grokgrok-cli (delegated) and grok-sub-token (native, over the SuperGrok Responses proxy).
  • Codex / ChatGPTcodex-cli (delegated) and codex-sub-token (native, over the ChatGPT-subscription Codex endpoint).

The three delegated CLIs are the primary rung; the two native sub-token backends are the fallback. There is no local-model, Gemini, or API-key backend in the code — those live in the design docs as a north star, not on this page's ledger.

Routing is configuration

work-kind · typegroup · nexus.yamlresolved backend · modeloperations tierorchestrategroupanthropicclaude-cliopus-4-7execution tierexecutegroupopenaixaiparkedflip executor.group→ xai when reliablecodex-cligpt-5.5operations tierreviewgrouppanel · 2 vendorscodex-cligpt-5.5lens · boundary inputsclaude-cliopus-4-7lens · hidden contractsunion-to-refute · shipped default (8/8 vs 0/8 measured)xaiparkedgrok rejoinswhen un-parkedoperations tierreflectgroupanthropicclaude-cliopus-4-7fires when review gates a turn — remove the binding to disableswapping group · openai → xai is a one-line editorchestrator, brief, and agent prompt unchanged — the agent never learns which vendor ranthe proven runclaude framescodex buildscodex + claude reviewcomplete3/3 live over the mcp wire · panel default since jun 9 · self-hosted refactor landed on main
// routing is configuration — one task, two vendors cross-cutting · grok parked

nexus.yaml is the single control surface, with three kinds of entry:

  • groups — named, ordered fallback chains of backend ids (anthropic: [claude-cli], openai: [codex-cli], xai: [grok-cli]). What you list is what runs; no improvising.
  • types — a work-kind, which is a (tier, phase) bound to a group, a model, and an optional provider-neutral effort that each adapter clamps to its vendor's range. Four ship bound: orchestrator, executor, reviewer, reflector — and the background cartographer is routable the same way.
  • panels — named rosters that fan a review gate out to N voters, each member a (group, model) plus a lens: a config-pushed adversarial probing focus. The two-vendor review panel ships as the default roster, and the measurement behind that decision has its own section (The review gate).

At dispatch time, ProviderRouter.resolve(tier, phase) indexes the type bindings and returns a concrete backend plus model and effort overrides. That's the whole indirection: swapping group: openai → xai for a work-kind is a one-line edit, and the orchestrator, the brief, and the agent prompt are all unchanged. The agent never learns which vendor ran it. That's "an agent is a pure function of its brief," made operational.

Effort defaults are part of the routing story too, because they were set by evidence rather than taste: work-bearing phases never run at medium — a reviewer at medium rubber-stamped the planted trap three runs out of three, so implement and review default to the top of each vendor's range and orchestrate one notch below.

What cross-provider actually means

The honest version, because this is the differentiator and overstating it would cheapen it. The proven end-to-end spine is Claude frames → Codex builds → review fans out to Codex and Claude → complete, driven over the real MCP stdio wire, with each phase's backend asserted from the task log. The original 3-of-3 proof runs predate the panel and ran review on Codex alone; everything since — the in-harness dogfoods, the self-hosted refactor, the sixteen-run review measurement — has run with two vendors actively judging each other's lane. When review gates a turn, a Claude reflect-lead does the frame diagnosis. One task, two vendors, four distinct seats — and the driver never knows.

Grok is supported but parked. The shipped config routes execute and review to Codex, not Grok, with a one-line note to flip back. Grok was proven in earlier isolated runs, but it's been flaky — stalling and over-thinking even inside its own CLI, and lately returning empty reviews that the panel drops as non-votes — so it's benched for reliability rather than shipped as a brittle demo. The thing worth noticing is that parking it was one line. The provider-agnostic design is what turned "a vendor went unreliable" into a config edit instead of a rewrite — the architecture paying off under real failure.

Wire dialects at the edge

The two native backends share one normalizer (wire_responses.py) that flattens the OpenAI-Responses SSE stream into a single vocabulary — text and thinking deltas, the tool-call trio, message_end with normalized token usage, error. The provider-specific quirks stay at the edge, where they belong: Grok uses plain-string messages and reasoning.summary only (its proxy 400s on reasoning.effort); Codex uses the strict nested-content shape, honors reasoning.effort, and carries a ChatGPT-Account-Id header decoded from its access-token JWT. The delegated CLIs each have their own JSONL stream parser doing the same normalization. Everything funnels into the same TurnRequest / TurnResult, so the chosen vendor never crosses the seam.

// section 03 :: credentials & auth3 / 9

Credentials & auth

Three genuinely different auth strategies behind one interface, and the token-projection trick that keeps a single-use rotating refresh from burning itself.

// session :: nexus.credentialstransmit

Provider-agnosticism is easy to claim and hard to wire, because the three vendors don't authenticate alike. Claude manages itself; Grok wants you to reuse a login you already have; Codex needs a full OAuth flow Nexus runs on its own. Hiding all three behind the same ExecutionBackend seam meant building each from scratch — no shared SDK papers over the differences.

three credential strategies · one interfacenexus authclaudemetered cli · per anthropic termsthe claude CLI ownsits own loginnexus holds NO claude token store/ never touches itgroksubscription · piggyback then ownparked~/.grok/auth.jsonclaim (bootstrap)authoritative~/.nexus/auth/grok-build.jsonsingle-use rotating refresh;nexus refreshes only its own chaincodexsubscription · own PKCE oauthlocalhost:1455 PKCEauth.openai.comsole authority~/.nexus/auth/openai-codex.jsonaccount id decoded from JWT→ ChatGPT-Account-Id headerper-spawn token projection · grok + codexpatch a FRESH access token into a COPY of the auth fileinside a hermeticGROK_HOME / CODEX_HOMEso the relocated CLI never burns its own single-use rotating refresh(no surprise browser login)
// three credential strategies, one interface · token projection keeps the rotating refresh safe

Three strategies

  • Claude — none of Nexus's business. The claude CLI logs in itself; Nexus holds no Claude token store and never touches one. This is deliberate: Anthropic reserves subscription OAuth for its own clients, so Claude is reached only through the metered CLI, never a borrowed token.
  • Grok — piggyback, then take ownership. Nexus bootstraps from the grok CLI's existing interactive login at ~/.grok/auth.json, then promotes the tokens into its own store at ~/.nexus/auth/grok-build.json and makes that authoritative. It refreshes only its own chain and treats ~/.grok as bootstrap-and-recovery, because xAI's refresh tokens rotate single-use — and two stores both spending the same rotating token is how you get a surprise browser login mid-run.
  • Codex — Nexus runs its own OAuth. No Codex CLI required: Nexus runs a full PKCE flow against auth.openai.com (using the published Codex client id), catches the loopback callback on localhost:1455, and owns the tokens at ~/.nexus/auth/openai-codex.json as the sole authority. The account id and plan are decoded from the access-token JWT and sent as a ChatGPT-Account-Id header.

The rotating-token problem, solved structurally

The subtle failure these strategies share: a delegated CLI, relocated into a hermetic per-run home so it can't see the user's global plugins, will try to run its own refresh against a single-use rotating token — and burn it, breaking the next run.

The fix is token projection. Nexus refreshes a fresh access token in its own authoritative store, then patches that token into a copy of the auth file placed inside a hermetic GROK_HOME / CODEX_HOME. The relocated CLI finds a valid access token already sitting there and never runs its own refresh, so the rotating chain is only ever advanced by the one owner that's allowed to. It's the kind of bug you only find by hitting it — here, as per-spawn browser popups — and it's pinned with regression tests.

// section 04 :: scoped context & enforcement4 / 9

Scoped context & enforcement

An agent is a pure function of its brief — and the tool surface it can reach is enforced at the MCP layer, identically across two transports.

// session :: nexus.scopetransmit

Prompts never say "don't touch X." The tool to touch X is simply not in the agent's list.

── vision.md

The bet the first orchestrator taught me: prompt prose is the wrong unit of control. "You are the planner, do not edit files" is a sentence, and sentences are negotiable — a long-running agent drifts toward whatever it was told last. The right unit is the tool surface itself. A planner that has no edit tool cannot quietly "help" by editing; a reviewer whose phase excludes write tools cannot rubber-stamp by fixing what it found. Scope isn't asked for, it's constructed.

Scope is assembled, never inherited

A ParticipantScope is built from explicit inputs only. It does not silently inherit an agent profile's tools or MCP servers — the assembler takes what it's given and nothing else, so the full tool surface is always visible at the construction site rather than accreting from defaults three layers away. Each spawn gets a freshly-registered Participant whose effective scope is the brief's scope, unioned with the tier/phase overlay and any per-turn override, minus an explicit do_not_touch list. The session token that identifies it is generated per-participant, kept in memory only, and never written to any artifact.

Fan-out tightens this rather than loosening it. When a plan step is a band of parallel units, each unit's effective writes are replaced by exactly its declared slice — not unioned, replaced — and the slices are validated pairwise-disjoint and inside the brief's write scope before the plan is accepted. Two concurrent agents physically cannot contend for a file, by construction rather than by convention.

One authorize, two transports

Here's the part I'm proud of. Scope has to survive both loop modes — the delegated CLI running out-of-process and the native loop running in-process — and it does, because both converge on the same gate.

participant scopetwo transports → one authorizeParticipantScopeallowed_mcp_tools (wildcards)read / write pathsdo_not_touchassembled explicitly — never inherited from a profiledelegated · out-of-processwrite_participant_mcp_config→ --mcp-configspawned vendor CLIloads the configrelaycarries each callTCP loopback → CoordinatorIPCcrosses the process boundarynative · in-processDirectCoordinatorClientcalls straight inonly the transport differssame authorize, same handlersSession.authorize(token, tool, args)authorizedhandler → workspacethe call runsout of scopeAuthErrorthe handler is NEVER invoked
// one authorize, two transports — out-of-scope never reaches the handler

For a delegated spawn, scope is rendered into a --mcp-config file; the vendor CLI starts a short-lived relay that connects back to the coordinator over a TCP loopback socket, carrying the session token. For a native turn, a DirectCoordinatorClient routes calls in-process. Different transports — a JSON-RPC socket versus a direct call — but both land on Session.authorize(token, tool, args), and both run the same handlers behind it. An out-of-scope call returns an AuthError and the handler is never invoked. There's a dedicated test asserting exactly that, because it's the headline guarantee: zero-knowledge scoping is enforced in code, not described in a prompt, regardless of which path the turn took.

Defense in depth on the CLI

The delegated path inherits a vendor harness with its own tool surface, so scoping it down takes more than an allowlist. Each control closes a specific hole:

  • --allowedTools is always emitted, even empty, to override the CLI's permissive default allowlist.
  • --tools "" disables the built-in Read/Edit/Bash so they can't leak around the scoped MCP surface.
  • "strict scope" adds --setting-sources "" and disables slash commands, to defeat auto-discovered plugins (an LSP server, a skill pack) that would otherwise smuggle tools in.
  • It deliberately stops short of --bare, which would also break keychain auth — a reminder that hardening has a cost ceiling.

The honest boundary: scope and the tool allowlist are the hard controls. Prompt restraint is not treated as a safety boundary — it's a hint, layered on top. The thing that actually can't happen is the call the runtime refuses, and that refusal is a property of Session.authorize, not of the model behaving.

// section 05 :: the walk loop5 / 9

The walk loop

The caller drives a task turn-by-turn; inside each turn an independent review gate decides whether the work advances, a reflect-lead can correct the frame, and disjoint slices fan out.

// session :: nexus.lifecycletransmit

nexus-spawn-mcp never auto-runs a whole task. The caller — Claude Code, Cursor, any MCP client, positioned as the Advisory layer — gathers context itself, hands the server a pre-digested goal, and then drives. The shape is a walk loop, and the load-bearing idea is what happens inside one step.

part 1 · the caller drives the lifecycletask_create→ framingtask_orchestrate→ active | clarification_neededtask_run_turn ⇄ task_statusrepeat until a stop statusthe caller loopstask_result→ returnedstop when status ∈ { complete · blocked · escalated · cancelled }part 2 · inside one task_run_turnload-bearing ruleexecution’s own action=‘complete’ does NOT bypass review — only review’s ‘advance’ completes the task.the gate itself is structural — a plan that omits the review entry gets one appended in codespawn executionone plan step — or a band of N disjoint units in parallelhandoff persistedstatus NOT yet committedreview gatereview disposes the steprun review→ two-vendor panel verdictverdictadvanceretry · escalatecommit handoffstatus committed · the step advancesif action = completetask → completeterminal · successreflect-lead · if routedread-onlyframe vs work → re_brief | re_dispatch | accept_terminalre_brief · re_dispatchaccept_terminalsupersede handoffre-dispatch with retry_hintup to max_review_retries = 2, then escalateterminal escalatereview overrides executionre-dispatch · execution re-runsadvanceretryescalatereflect-leadexecution proposes · review disposes
// inside the walk loop — review disposes, not execution

The caller's loop

task_create mints a workspace and writes status.json at framing. task_orchestrate runs the Operations-tier orchestrator once, which either returns a validated ProblemBrief (status → active) or clarification_needed with up to three questions (the ambiguity escape hatch — the caller answers and re-calls). Then the caller loops task_run_turn / task_status until the task reaches a terminal state — complete, blocked, escalated, or cancelled — and reads task_result. Fifteen caller-facing MCP tools total, including task_log, task_cancel, task_reopen, and the tier/agent registration surface; the spawned agents see a separate, larger scoped tool surface that the caller never touches.

There's no error status. An unrecoverable infrastructure failure lands the task on terminal escalated with an error field attached — deliberately, so a poller doesn't spin forever on an active task that's actually wedged.

One turn is one plan step

The subtlety: one task_run_turn is one plan step, not one spawn. A single call transparently walks the cursor's execution entry, then any review gate that follows it, then any retries or reflections the verdict demands — and only returns when the step settles. Between calls the cursor always rests on a non-review entry, so the caller never has to think about reviews at all.

Inside the call, the driver:

  1. Spawns the execution agent and persists its handoff artifact — but does not yet commit the task's status.
  2. If a review phase follows, runs review first and lets its verdict decide.

The review verdict is a three-way ladder. Advance commits the handoff — and if the action was complete, the task completes. Retry supersedes the handoff and re-dispatches execution with the reviewer's retry_hint, up to max_review_retries (default 2), then escalates. Escalate synthesizes a terminal handoff outright. And when a reflector is routed, a gating verdict first passes through the reflect-lead — which can convert a mechanical retry into a re-briefed one, or a dead-end escalate into a correction (its own section: The review gate).

Fan-out: parallel work units

One plan step can also be a band. Consecutive plan entries marked with the same parallel_group run concurrently — each unit a fully scoped agent on its own slice — and fold worst-wins into a single result the rest of the walk consumes unchanged, with one review judging the combined diff. The safety is structural, settled at brief acceptance: every band unit must name its write slice, slices must be pairwise disjoint and inside the brief's write scope, and each unit's effective writes are narrowed to exactly its slice. Review entries can never band.

The part I didn't expect to land this early: given only prompt guidance about when banding is appropriate, the orchestrator emitted a correct two-unit band unhinted on a real maintenance task — right disjoint slices, self-invented phase names, one combined review — and both units ran measurably concurrently. Decomposition is supposed to be the orchestrator's judgment, and at least once, it was.

Tiers, honestly

The design vocabulary is a four-tier context gradient — Advisory knows everything, Operations knows the project, Execution knows its slice, Field knows its gesture. In code, two tiers are implemented: Operations and Execution. Advisory is the caller, and Field is reserved-but-unbuilt (it's waiting on a front-door integration). Operations is phase-aware — an orchestrate phase that builds the brief, a review phase that verifies, and a reflect phase that diagnoses, all read-only to the filesystem; Execution has one implement phase with scoped file I/O. Custom tiers can be registered at runtime, but they can't shadow a built-in or a reserved name.

A per-phase tool budget bounds each turn's MCP activity (25 calls for Operations phases, 60 for Execution — a multi-file refactor that runs its own tests needs the headroom), and the terminal submit_* tools always bypass it so an agent can never be budgeted out of finishing.

The Cartographer

One piece sits outside the tier model on purpose. The Cartographer is infrastructure, not a tier — not in any plan, not registerable as an agent. It maintains exactly one file: a small (4 KB-capped) context-map.md per project, refreshed fire-and-forget after a task completes, that gives the next task's orchestrator a starting frame without bloating the brief. It does one job and it never raises. It was also the last spawn in the system still reaching a vendor directly — it now routes through the same ExecutionBackend seam as everything else, so even the background maintenance is a nexus.yaml line.

// section 06 :: the review gate6 / 9

The review gate

Confident-wrong is the failure that matters — the probe tools, the two-vendor panel, and the reflect-lead that turned a measured catch into a measured cure.

// session :: nexus.reviewtransmit

The failure an orchestrator actually has to prevent isn't an agent erroring out — errors are loud. It's an agent confidently shipping the wrong answer. The testbed plants exactly that: a module where two functions share a same-name _normalize that looks duplicated but isn't (one floors negatives to zero, the other must let them through), a brief that asks for consolidation, and a shipped test suite that skips negatives — so the wrong answer comes back green. The real scorer is a held-out behavioral check the agent never sees. Every claim in this section is measured against that trap.

the gate · union-to-refutedepth × breadthexecution hands off — action=‘complete’claims are checked, not trustedcodex · gpt-5.5vendor alens · boundary inputsnegatives · zero · empty · caps —probe each merged behavior separatelyrun_probe ×N · run_tests · read_diffclaude · opus-4-7vendor blens · hidden contractsbehaviors the criteria don’t name —anchor expectations in the original coderun_probe ×N · run_tests · git show HEAD:each member authorsits own probesblocking_concerns ≠ ∅→ coerced off advance, in codeunion-to-refuteadvance only if all advance · alignment = min · any escalate escalatesan empty member is an infra failure → clean non-vote, never a fabricated verdictadvancehandoff commits · status committedif the action was complete, the task completesretry · escalatea gating verdict — the work does not advancereflect-leadoperations · reflect · read-onlyre-runs the reviewer’s probe — verify, don’t trust · reads the originals from historyrules FRAME vs WORK → re_brief | re_dispatch | accept_terminalan amendment structurally cannot touch scope or the tier planre_brief → amended frame → re-executecorrections consume the same bounded retry budget — the broker still disposessingle same-vendor reviewer · 0/8the planted wrong answer shipped every runtwo-vendor panel · 8/8caught it every run · fisher exact p ≈ 8e-5run 1 · the trap is caughtrun 2 · re-briefed work advances
// the gauntlet — two vendors, distinct lenses, union-to-refute · a catch becomes a cure

Depth: a reviewer that red-teams

A reviewer that checks the brief's stated criteria will rubber-stamp this trap, because the wrong answer satisfies the stated criteria. So the reviewer is armed and instructed to refute, not confirm:

  • run_probe lets a reviewer author and run a short behavior-preservation check of its own against the live worktree — sibling to run_tests, same sandbox posture, and a failing probe comes back as a result, not an error. The review prompt's discipline is refute-by-default: enumerate the behaviors the criteria don't name — negatives, empties, boundaries, each merged code path — and write a probe for each.
  • blocking_concerns is the code backstop. It's a verdict-bearing field on the review schema: if a reviewer records a probe-confirmed regression there, the validator coerces the verdict off advance — path-independently, whether the vote came from a single reviewer, a panel member, or a salvage parse. A reviewer that documents a regression structurally cannot then wave it through.

Depth turned out to be contagious in a good way: on ordinary runs the executor now probes its own edge inputs unprompted and pre-empts the trap before review ever sees it. Detection moved earlier in the pipeline.

Breadth: the two-vendor panel

Depth alone wasn't enough, and this is the project's sharpest finding. A six-run ladder put a single Codex reviewer against the identical trap, moving one variable per rung: more effort — missed. An adversarial lens — missed, and the probe it wrote actually hit the revealing input but took its expected values from the code under review: circular verification, visible in the tool-call log. An explicit instruction quoting that exact failure — missed again. The blind spot isn't probing skill or prompt quality; it's epistemic anchoring on the brief's framing ("consolidate the duplicate") over the artifact evidence. No knob moved it. Claude caught the same trap first try with no special instructions — and a different model would presumably have its own, different blind spot. "Different frames of mind" stopped being a metaphor and became a measured property.

So review fans out: a panel of reviewers, each from a different vendor, each with a config-assigned lens — a distinct adversarial probing focus that narrows what they probe, never what they judge. Votes aggregate union-to-refute: advance only if every member advances, any failed criterion forces a retry, any escalate escalates, alignment is the minimum. An empty response is an infrastructure failure, not a verdict — it's dropped as a clean non-vote rather than salvaged into a fabricated opinion (that bug shipped a false escalate once; it's regression-tested now).

0/8single reviewersame-vendor · planted trap shipped every run
8/82-vendor panelcaught it every run
p≈8e-5fisher exact16 live runs · perfect separation
0/14single, lifetimeacross effort · lens · anchor instruction

The measurement behind the default: sixteen live runs against the identical planted answer, eight per arm. The single same-vendor reviewer caught it 0/8 — the confident-wrong result shipped every time, with an eerily uniform signature: exactly one non-revealing probe, then advance. The panel caught it 8/8, every catch carrying a probe-confirmed regression. Counting the earlier ladder, the single reviewer is 0/14 lifetime on this trap. On this failure class, cross-provider diversity isn't an improvement — it's the difference between always shipping the wrong answer and never shipping it. That's why the two-vendor lensed panel is the shipped default, at the cost of one extra review spawn per gate, riding subscriptions.

The reflect-lead: catch → cure

A caught regression used to dead-end in a mechanical retry or a terminal escalate. Now it feeds a third Operations phase: reflect. When review gates a turn, the reflect-lead — read-only, with review's verify surface, so it can re-run the reviewer's probe rather than trust it — diagnoses whether the fault is in the work or in the frame, and emits one of re_brief | re_dispatch | accept_terminal. A re-brief carries a structured amendment that cannot touch scope or the tier plan (model-validated), and the merged brief re-runs the orchestrator's invariants. Corrections consume the same bounded retry budget the broker already enforces. Unbind the reflector routing line and the driver falls back to the mechanical paths, byte-compatibly.

The first live run closed the whole loop on the planted trap. The panel caught the regression; the reflect-lead re-ran the probe, read the original functions from git history, and called it: a frame failure — the brief's "single helper, exact signature" criterion and behavior preservation were mutually unsatisfiable as stated. Its amended brief stated the divergence as fact and wrote the previously-held-out ground truth into the criteria. A different vendor re-executed under the new frame, the panel advanced, and the external held-out check passed. Caught → diagnosed → re-briefed → re-built → verified: the system corrected its own frame mid-task, which is the briefs-over-plans wager operating as a mechanism instead of a slogan.

// section 07 :: verification & evidence7 / 9

Verification & evidence

What "it works" actually rests on — a trust gradient from reproducible unit tests to operator-run cross-provider proof, kept deliberately honest.

// session :: nexus.verificationtransmit

A portfolio page can claim anything. This section is the part that says which claims are reproducible, which are run by hand, and where the line is — because for an orchestration framework the gap between "the tests pass" and "it really drives multiple vendors to a correct answer" is the whole interesting part.

trust gradientthe proof ladder · 6 rungsoperator-run / humanreproducible / CI-abletier 1rungunit floor712 green across 3 packages · 457 core+spawn · seconds, no network · no standing redautomatedtier 2rungin-process e2efull lifecycle with the spawn mockedautomatedtier 3rungstdio probereal server over MCP · lists 15 tools · no auth, no spawnautomatedtier 4rungoperator-run lifecyclesanthropic-only floor · real-task drivers · planted-trap discriminatorshuman-runtier 5rungcross-provider proof3/3 live over the wire · per-phase provenance from task_log16-run review measurement — single 0/8 vs panel 8/8human-runtier 6rungnexus on nexusin-harness advisor drive · self-hosted engine refactor landed on mainunhinted parallel fan-outhuman-run
// the proof ladder — reproducible at the bottom, operator-run at the top

The trust gradient

At the bottom, the reproducible floor: 712 tests green across the three packages — 457 in the core and the product alone, in seconds, with no network. That's the orchestration logic — config parsing, routing, scope assembly, the state machine, schema round-trips, panel aggregation, banding invariants — locked down by unit tests even though the models behind it are swappable. For a long stretch the debate package carried one known-red test; it's green now, and the fix is its own proof point: Nexus repaired it itself, as one slice of a banded dogfood task. (The counts churn week to week; treat them as "as of early June 2026.")

Above that sit the in-process end-to-end tests that mock the spawn, then the stdio probe that boots the real server over MCP and lists its fifteen tools without auth or a spawn. Then the things a human has to run: the anthropic-only floor's full lifecycle, the real-task drivers, the sixteen-run review measurement — and at the top, the runs where Nexus worked on its own codebase.

712tests green457 core+spawn · no network · no standing red
3/3live cross-providerfull lifecycle over real MCP stdio
16measured review runssingle 0/8 · panel 8/8
2self-hosted commitsnexus refactoring nexus, landed on main

The cross-provider proof, the honest way

The headline run is verified the way it should be. An external client drives the product as a black box over real MCP stdio — task_create → task_orchestrate → (task_run_turn loop) → task_log — exactly as a third-party harness would. Then two independent checks: it reads the file the agent edited and confirms the change is on disk (not just claimed), and it asserts from the task log which backend actually ran each phase. The provenance constant pins it: orchestrate on claude-cli, execute on codex-cli, review on codex-cli. Three of three.

The honest caveat travels with the claim: that 3/3 is operator-run over real logins, asserted via task-log provenance — there's no committed transcript, so it isn't reproducible from the repo the way the unit suite is. "Proven" here means proven by hand, on the wire, with evidence — not green in CI. (Those runs also predate the panel default; every dogfood since has run review fanned across two vendors.)

The dogfood ledger

The strongest evidence is the most recent, and it's Nexus operating on real work — including its own:

  • First in-harness drive. The server registered into Claude Code via .mcp.json, and Claude Code drove a full lifecycle in-conversation as the Advisory tier — including surviving a live provider failure mid-walk: Codex's token was revoked vendor-side, the specific diagnostic propagated, the task went terminal escalated instead of stranding the poller, and after a re-login the Advisor reopened it and the walk completed. Failure handling proven by an actual failure.
  • Self-hosted refactor. Nexus refactored its own engine — the long-deferred deduplication of its two native Responses loops — through its own registered MCP surface: scoped worktree, Codex execution, two-vendor panel review (each member visibly probing per its lens; the Claude member pulled the originals from git history to anchor its checks), externally verified, landed on main at net −39 lines.
  • Unhinted fan-out. A real two-item maintenance task on the Nexus repo itself, no plan hints — the orchestrator banded it into two parallel units with correct disjoint slices on its own, both ran concurrently, one panel reviewed the combined diff, and the commit landed. That run is also what fixed the standing-red debate test.

Measuring the thing that matters

The testbed is built to catch the failure that actually hurts: an agent that confidently ships the wrong answer. _real_task.py carries planted tasks with external ground truth the agent never sees, and a four-way scorer that splits outcomes into CORRECT, CONFIDENT-WRONG (the dangerous one), HONEST-ESCALATE (the safe failure), and FALSE-ESCALATE. The sharpest instrument is the review-only discriminator: programmatically plant the wrong answer, fabricate a confident complete handoff whose claims are technically true, dispatch only the review gate, and read the binary verdict — caught or shipped. Run sixteen times, it produced the 0/8-versus-8/8 separation that made the two-vendor panel the shipped default; the full story, including the negative results that led there, is in The review gate.

A debugging win

One more, because it's the kind of thing that only shows up in a build log. A persistent "narration failure" — the orchestrator printing tool calls as prose instead of calling them — had been blamed on the model. A census of 363 spawn captures root-caused it instead to an MCP connection race in Nexus's own relay: when the scoped relay hadn't finished connecting, the model saw zero tools and narrated — 42 narrated runs out of 42 had zero tools at turn start. The fix was a bridge-first lazy-connect relay plus telling the CLI to block startup until the relay is up, and it's verified live: the orchestrator now connects and produces a brief in seconds, every run, under real concurrent spawns. It reversed the previous session's conclusion. The lesson logged: suspect your own plumbing before you blame the model.

// section 08 :: the debate sideshow8 / 9

The debate sideshow

A structured multi-agent debate protocol on the same substrate — where strawmanning is made structurally impossible, not just discouraged.

// session :: nexus.debatetransmit

A confession up front: this isn't the main event. nexus-debate-mcp is a deliberate sideshow — a second MCP server that rides the same scoped-spawn substrate as the product and inherits its upgrades for free. It earns its place by costing almost nothing to carry, and by being a genuinely interesting protocol in its own right. So: briefly, and honestly.

The idea is epistemic parallelism. A single long-running agent on a hard architectural question accumulates confirmation bias — it drifts toward whatever it heard last. N agents that each form an independent prior from a disjoint scoped context, then resolve their tensions through a structured protocol, don't. It's closer to peer review than to a chat.

debate pipeline · the sideshowsame substrate · own ride2–4 subpasses · adaptiveframeframerno toolsopeningN parallelisolated priorsarbiter:analyzetensions +steelmanscross-examrebutper roundarbiter:check-closecontinue?or close?arbiter:convergeagree? hold?finalclosingstatementssynthesizesynthesis.md · assembledin python · verbatim recordsthe steelman shieldrouting under the cross-exam stageAgent A openingAgent B openingArbiterwrites steelman{A,B}steelman{A}steelman{B}B's cross-examresponds to STEELMAN of Anever A's raw rebuttalopenings ARE visible · raw cross-exam rebuttals are withheld → strawmanning is structurally impossibleone real run on disk · 5 panelists · 3 rounds · 86m26s · 3 converged / 4 contesteda demonstration, not a benchmark
// the debate sideshow — steelman routing makes strawmanning structurally impossible

The pipeline

A framer composes the panel (it reasons from the prompt with no tools at all), extracts the contestable claims, and assigns roles. Panelists run an opening round fully in parallel and in isolation — none of them sees another's response. An Arbiter (read-only, schema-bound) reads all the openings and maps the disagreements into TensionFindings, each carrying a steelman of every side. Then a cross-exam loop runs until the Arbiter closes it, a convergence pass splits converged from contested, a final round runs, and a synthesis is assembled.

The cross-exam loop is adaptive, not a fixed cap: the Arbiter ratifies a close decision driven by a measured revision rate and tension density, bounded by a hard floor and ceiling. And the synthesis is assembled in Python from the verbatim round records — the Arbiter's expensive call produces only the interpretive layer, so no argument is lost to summarization, and an honest abstention survives when the panel is genuinely split.

The steelman, as structure

The signature idea is that strawmanning is made structurally impossible, not merely discouraged. During cross-exam, an agent responds to the Arbiter's steelman of the opposing position — a maximum-force reconstruction in the Arbiter's own words — delivered through a scoped debate://self/tension resource that carries only the steelman. The opponent's raw cross-exam rebuttals are simply absent from that agent's tool scope. You can't weaken an argument you were never handed.

To keep the panel from becoming an echo chamber, the framer is held to two composition invariants: if the caller's registry has topic-relevant agents, at least one must be on the panel; and if any registered agent is on it, at least one ephemeral outsider must be too — because registered agents tend to have co-built the systems they'd be debating.

What it actually ran

5panelistsall ephemeral
3roundsopening · cross-exam · final
86mone real run3 converged · 4 contested
~55%cross-exam revisionpositions moved under pressure

There's exactly one complete debate on disk, and I'll call it what it is — a demonstration, not a benchmark. The topic was Nexus's own architecture: a five-member panel, three rounds, about 86 minutes of wall clock on Claude models at varying effort tiers, ending in a reasoned per-claim abstention rather than a forced winner. Roughly 55% of cross-exam responses moved from their opening position under the steelman pressure, which is the signal you'd hope for. It's slow and expensive by design — depth over speed — and it's one run, so it proves the protocol executes end-to-end on a hard problem, nothing more.

// section 09 :: decisions9 / 9

Decisions

The reshape arc, reconstructed from Nexus's own Engram memory — real session dates, the reasoning captured as it happened, the June frontier filled from the build log.

// session :: nexus.decisionstransmit

Every other section describes Nexus as it runs today. This one is the history, and it's reconstructed the way the rest of the stack works: from Nexus's own Engram store. Engram is the memory engine Nexus orchestrates against; here it narrates the orchestrator that uses it. Every date below is a drawer's real session_date — pulled with recall_thread, the same narrative-recall call an agent makes — not a commit. The quotes are verbatim from the memories written in the moment.

memory ends · build log beyondMCP primitivesAPR 16Advisory is youAPR 18Provider-agnosticMAY 24Seam, liveMAY 25In-harnessMAY 28Briefs, not plansMAY 30Proven 3/3JUN 2Panel + reflectJUN 9Runs itselfJUN 10
// the arc — reconstructed from Nexus's own Engram memory · the june frontier from the build log

Two honest edges, up front. The store's Nexus record runs from mid-April through May 30 — so the genuinely-earliest monolith design is referenced inside it but predates it. And the memory lags the work: the June frontier below (the narration-race fix, the review measurement, the reflect-lead cure, the self-hosted runs) isn't authored into Engram yet, so those entries are filled from the build log — marked where they appear, and drawn dashed on the arc above.

I. Foundations — April

Before the provider story, Nexus had to stop being a platform. The first monolith — a UI that, in the build log's words, "collapsed under its own weight" — got reframed into composable MCP primitives, and the orchestration tier model that survives today was locked here.

  1. Three primitives, not a platform

    The reframe that started everything: a heavy standalone app became three composable MCP primitives — memory (Engram), scoped spawning, and debate. Each could stand alone, be registered by any harness, and stop carrying the weight of a bespoke UI. The platform's job was redistributed to the caller; the primitives kept only what was load-bearing.

  2. The advisory layer is you

    Porting the orchestration into nexus-spawn-mcp was "a chance to do it better," not a line-by-line migration. The tier taxonomy locked here — Advisory · Operations · Execution · Field — and the key insight reshaped the whole contract: Advisory isn't a spawned agent, it's the caller. "The advisory layer is YOU. You control when an orchestration is sent, not the user." That demoted the Orchestrator from a thinker to a checker — "not a framer that figures things out; a validator + structurer of what's already been figured out." The first real testbed reproduced the failure it was built to avoid: a capable model given thin context hung for five minutes, while the same model with a rich brief produced one in fifteen seconds — "capable model, thin context… the system-failure zone."

  3. Review gets the final word

    Operations became phase-aware — an orchestrate phase that builds the brief and a review phase that reads the diff and never writes source — rather than spinning up a separate Review tier. The load-bearing bug, caught and fixed the same day: an execution agent's action='complete' could bypass review. The rule became "Review has the final word on commit." The same stretch hardened scope (--setting-sources "" + disabled slash-commands to kill auto-discovered plugins, deliberately not --bare because the subscription login has no API key) and learned a lasting lesson — spawned agents under strict scope don't see MCP resources in their tool list, so "tools, not resources": expose read_my_brief and friends as tools.

  4. Engram, deliberately paused

    An audit found premature authority-profile glue wiring Nexus to Engram before Engram's own design had settled. The call was to pause deep integration rather than build against a moving target — the decision that, downstream, made memory an optional peer rather than a hard dependency for the product. (The one place it stayed hard-wired is the debate sideshow.)

II. The reset — late May

After a lineage retrospective — two months of work audited as a path "from the rev-nexus exploration monolith into disciplined extracted projects" — Nexus was reset around a single new thesis. This is the densest week in the store.

  1. Use any provider you want

    Kyle locked the north star: a provider-agnostic orchestration MCP. "Use any provider you want, however you want," and "Nexus should be something I can implement literally here into claude code if i wanted to." The consequences fell out fast — Nexus owns its own Python provider layer (not Mantle's), routing lives in config-driven fallback chains, and there is no Anthropic sub-proxy because "anthropic is the only provider that has been crystal clear on that one." The config split that still holds: tier is an authority/scope band, type is a routable work-kind, and the human writes the YAML "so the advisor and agents make as little decisions as possible."

  2. Lock the seams, stub the brain

    The discipline that made a v1 shippable: commit the expensive-to-change interfaces, leave the internals crude. "v1 should land the core orchestration and patterns. From there, we deep dive into each of the parts at its own layer." Review v1 became exactly this — a read-only reviewer sees the brief, the diff, and prior handoffs and emits advance | retry | escalate; the broker enforces the rest. Dimensions, debt ledgers, and tiered-review sophistication were deferred on purpose. The seams are still locked; some of the brains are still stubs, and the page says which.

  3. The framework argues its own design

    "We're in lockstep, which can be a powerful thing — but also make us blind to the gaps." So the debate framework was pointed at Nexus's own VISION.md and DESIGN.md, with a rival framework as a fair foil. The 86-minute panel returned a sharp verdict: "every contested bet's survival turned on implementation work v1 defers, not on the philosophy." Provider-obliviousness converged as holding; scoped-context was called "sound in principle, unearned in v1." Its derived claims turned the roadmap toward exactly the layer that had been stubbed — review — and it shook out two real framework bugs while it ran. The system pressure-tested itself and changed its own plan.

  4. The seam ships — live, and cross-vendor

    The load-bearing abstraction landed as nexus_core/execution.py: ExecutionBackend, TurnRequest, TurnResult, with ClaudeCliBackend wrapping the existing spawn verbatim — "define the contract, wrap claude-cli behind it, prove the wrap with a parity test." Then the first non-Claude backend went live: GrokBuildBackend over the xAI Responses API, and after a token-refresh fix the call came back clean — "Live. Green. Cross-vendor." The keystone followed: DirectCoordinatorClient routed a native turn through the same Session.authorize as the relay, an out-of-scope call returned an error with "the handler never invoked," and Grok made a live scoped tool call. "The keystone proof — live."

III. Proof — late May

With the seam real, the question shifted from "does it wire" to "does it actually work, across vendors, on real tasks."

  1. Run it inside the harness

    The native loops were thrashing on real refactors, and the diagnosis reframed the whole execution strategy: "I think grok messing up is due to the grok-build model being used outside its harness." Sharpened — "you tried to reimplement grok's native loop-detector in prose and failed. The CLI is that loop-detector." So the delegated CLIs (grok-cli, then codex-cli) were promoted to the primary rung: run the agent inside the harness that already has the discipline. Proven on the exact refactor that had failed three times natively — "same task, same model, same scoped tools, same prompt; the only change was running grok inside its CLI harness" — native gave "0 writes / partial / regression," the harness gave a correct refactor with duplication 6 → 0.

  2. One task, three vendors — first try

    The showcase: nexus.yaml bound orchestrate, execute, and review to different providers, the router and DirectCoordinatorClient wired into all three spawn sites, and the run went "completely, first try" — Claude orchestrated, Grok executed, Codex reviewed, the file on disk was byte-exact, Codex advanced at alignment 1.0, the task completed. Kyle's reaction is in the memory: "wait hold on did that work." The thesis, stated plainly: the driver "never knew which vendor ran any turn."

  3. Briefs, not plans

    The briefs-over-plans wager became a concrete pressure test rather than an article of faith: hand an agent the bounds (problem, success, what's out of scope) versus a rigid step-by-step plan, and measure. Fix the bounds, free the path. The early signal favored briefs — a plan-framed run followed a now-stale plan to a wrong answer that review caught — and the result was honestly small (n=1 per arm), so the switch stayed in the code to keep measuring rather than declaring.

IV. The frontier — June

The record ends here, in the memory. These last entries are from the build log (STATUS.md) — real, recent, and not yet written back into Engram. They're the work the rest of this page leans on.

  1. Breadth isn't depth

    The intuition that a cross-provider review panel — different vendors voting — would catch confident-wrong answers turned out to be wrong: a planted behavioral trap passed, because every reviewer checked the stated criteria and the wrong answer satisfied them. Diversity can't see a shared blind spot. The conclusion redirected the next move from breadth to depth, and added a four-way outcome scorer (CORRECT / CONFIDENT-WRONG / HONEST-ESCALATE / FALSE-ESCALATE) to measure the exact failure a gate is supposed to prevent.

  2. The narration race, not a provider tax

    The orchestrator's habit of narrating tool calls as prose had been blamed on the model. A census of all 363 spawn captures said otherwise — P(narrated | zero tools at turn start) = 42/42 — and root-caused it to an MCP connection race in Nexus's own relay: under concurrent spawns the relay dialed the coordinator before answering the CLI's initialize, leaving the agent with zero tools. The fix was a bridge-first, lazy-connect relay plus an alwaysLoad flag — "NOT a provider tax; a fixable race in our code." The same stretch shipped the depth the prior finding called for: a run_probe tool for an adversarial reviewer, with a blocking_concerns validator that code-coerces a recorded regression off advance. The lesson logged: suspect your own plumbing first.

  3. Proven across vendors — and grok parked, in one line

    The product server, driven by an external client over real MCP stdio, ran the full lifecycle 3/3 with per-phase provenance asserted from the task log: orchestrate=claude-cli, execute=codex-cli, review=codex-cli. Two vendors actively cross-cutting one task, the claude MCP-race fix verified live. The honest footnote in the same breath: Grok, flaky even in its own CLI, was parked — execute and review routed to Codex with a one-line nexus.yaml edit, no change to the orchestrator, the briefs, or the prompts. That one-line mitigation is the thesis paying off under real failure.

  4. The blind spot is the vendor, not the knobs

    A six-run ladder put a single Codex reviewer against the identical planted trap, moving one knob per rung — more effort, an adversarial lens, finally an explicit instruction quoting the exact observed failure. It missed every time; one lensed probe even hit the revealing input but took its expected values from the code under review — circular verification. The diagnosis: epistemic anchoring on the brief's framing, immune to prompting. Claude caught the same trap first try. So the two-vendor lensed panel became the shipped default — and the same day's dogfood exposed that the review gate itself had been a plan artifact (unhinted plans simply omitted it, twice out of twice), so brief acceptance now appends and re-tiers the gate in code. Two structural answers to two measured holes.

  5. The reflect-lead closes the loop

    A third Operations phase landed: when review gates a turn, a read-only reflect-lead re-runs the reviewer's probe, reads the originals from history, and rules frame or work — emitting re_brief | re_dispatch | accept_terminal, with amendments structurally barred from touching scope or the tier plan. Its first live run cured the planted trap end-to-end: the panel caught the regression, the reflection called the brief's criteria "mutually unsatisfiable as stated," the amended brief wrote the held-out ground truth into the criteria, a different vendor re-executed under the new frame, and the external check passed. Caught → diagnosed → re-briefed → re-built → verified. The persistent project-lead has its first organ.

  6. It runs itself

    Three closures in two days. Nexus refactored its own engine — the long-deferred native-loop dedup — through its own registered MCP surface, panel-reviewed, landed on main at net −39 lines. A real maintenance task fanned out unhinted: the orchestrator banded two disjoint slices on its own, both ran concurrently, and the commit that landed fixed the repo's one standing-red test. And the panel decision got its distribution: sixteen live discriminator runs, single reviewer 0/8, two-vendor panel 8/8, Fisher-exact p ≈ 8e-5. The framework is no longer just proven over the wire — it's a tool its own repo uses.

What's not in this log

Refactors that didn't change what Nexus is. There's an order of magnitude more churn — schema round-trips, Windows-safety fixes, test passes climbing 332 → 712 — than this page admits. A decision log isn't a changelog: each entry above named something the framework started or stopped doing on purpose. And the most recent entries are the least settled — June is real but young, and the version of this page that's reconstructed entirely from memory is the one the next round of authoring writes.