// project

active

Mantle

A lightweight TypeScript agent harness with provenance-aware long-term memory wired in before inference, mid-stream voice, and three LLM providers including ChatGPT subscription billing.

[updated2026-04-01][sections5][stack6]

// session :: mantle.overviewtransmit

The agent shouldn't have to ask for memory. Memory should already be on the table when it sits down.

── design note

Pitch

Mantle is the runtime my agents actually live in. The harness itself is intentionally small — an agent loop, a tool registry, three provider adapters, a vanilla-JS WebSocket UI. Three npm runtime deps. No web framework on either side of the wire.

The complexity that earns its keep lives in the layers wired around the loop. Engram, my long-term memory engine, is loaded before inference: every user turn arrives at the model with a relevant memory pack already injected into the system prompt's dynamic zone. Voice streams mid-reply through a sentence-boundary chunker so the first audio chunk lands while the model is still drafting later ones. Heartbeat runs a per-agent archival pipeline that turns session transcripts into framed memories on its own, in the background, on event-gated intervals. Multi-agent is per-agent isolation by construction — each agent's memory pool is a physically separate ChromaDB store. Three providers, including OpenAI Codex routed through ChatGPT subscription billing instead of API credits.

3runtime npm depsanthropic, openai, yaml

3LLM providersclaude · grok · chatgpt

2memory poolsframed + raw

4cache breakpointsanthropic max

What's wired in

Engram integration — companion-mode deployment, 4-type speech-act library, 6-intent retrieval matrix, per-agent isolated stores
Pre-inference memory pack — ~6 query variants fanned into one batched search, FTS5 fallback, optional temporal retrieval, ambient memories on every turn
Heartbeat archival pipeline — event-gated session-ingest + session-mine tasks build the two-pool retrieval over time without operator intervention
Voice — Chatterbox-streaming TTS with mid-reply sentence chunking, audio-driven text reveal, per-message replay, per-agent voice tuning. Whisper STT endpoint live; hands-free input loop in progress.
Multi-agent — per-agent workspaces, per-agent Engram stores, persona masks, per-agent provider/model/voice/heartbeat overrides
Cron — SQLite-backed scheduled jobs with Engram pre/post-run hooks for context enrichment and outcome storage
Claude Code CLI mode — sessions can route through the claude binary instead of the in-process loop; Mantle still injects the per-turn dynamic context via --append-system-prompt
Skills — two-tier (global + per-agent), token-budgeted, loaded into the system prompt's stable zone
Prompt caching — a three-zone system prompt (stable / persona / dynamic) gives both providers the longest possible cache prefix; Anthropic gets explicit cache_control, xAI auto-caches by prefix

Reading order

Architecture is the loop and how the providers, prompt zones, and cache breakpoints fit. Memory is the Engram integration — the part worth reading the code for. Voice is the mid-stream pipeline. Runtime is everything operational that doesn't fit elsewhere — multi-agent, personas, skills, cron, CC CLI mode. Decisions is the log of load-bearing calls.

This isn't a thesis project — Mantle exists so my agents have somewhere to live. But the calls that shaped it are deliberate, and the integration with Engram is the part that's actually interesting.

// section 01 :: architecture1 / 5

Architecture

One loop, three providers, three-zone prompts, one tool registry.

// session :: mantle.architecturetransmit

The loop

The core loop is provider-agnostic by design. Every provider adapter — Claude (Anthropic SDK), Grok (OpenAI SDK pointed at xAI), OpenAI Codex (Responses API on chatgpt.com/backend-api/codex) — emits the same normalized ProviderEvent channel: text_delta, thinking_delta, tool_call_delta, message_end, tool_use. The loop reads that channel, dispatches tool calls through a single registry, persists everything as JSONL transcript, and feeds results back for the next iteration.

user message
   ↓
buildSystemPrompt   ← stable + persona + dynamic (memory pack)
   ↓
provider.stream()   ← claude | grok | openai-codex
   ↓
ProviderEvent[]     ← text · thinking · tool_call · message_end
   ↓
registry.dispatch(toolCall)   ← core + MCP-bridged tools
   ↓
   loop until end_turn

The harness is what's not in there: no web framework, no agent framework, no orchestration library. Three runtime deps. Everything else is platform.

Three providers, one event channel

Claude: Anthropic SDK. Extended thinking gated by thinkingLevel (off → 0, low → 4096, medium → 10000, high → 16384). Reads back cache_read_input_tokens and cache_creation_input_tokens per turn.
Grok: OpenAI SDK with baseURL pointed at xAI. Reasoning splits across two architectures: grok-4.3 honors reasoning_effort; the older grok-4.20-* and grok-4-1-fast-* lineups encode reasoning depth in the model id and reject the param at the API layer. Provider gates this via a MODELS_WITH_CONFIGURABLE_REASONING allowlist.
OpenAI Codex: Routes through the user's ChatGPT subscription via OAuth (PKCE) instead of API credits. Backend is chatgpt.com/backend-api/codex, not api.openai.com. Uses the Responses API (not Chat Completions). Stream mandatory, instructions field for system prompt, strict input shape. x-codex-* response headers prove subscription billing without checking the dashboard.

The interesting one is Codex. ChatGPT Plus / Pro / Team / Enterprise subscriptions include Codex inference quota; Mantle reuses OpenAI's published Codex CLI client_id to ride that auth flow and route requests through the same backend the official CLI does. Token rotation, JWT identity decode, and refresh dedup live in src/auth/openai-codex.ts. The ChatGPT-Account-Id header is required and comes from the JWT's chatgpt_account_id claim.

System prompt: three zones

buildSystemPrompt (in src/agent/prompt-builder.ts) emits three zones, in order. The ordering is not cosmetic — it's the cache discipline.

stable: Base identity, workspace files (AGENTS.md, IDENTITY.md, USER.md, MEMORY.md), skills index. Invalidates only when a workspace file is edited or a skill is toggled.
persona: Active persona profile. Invalidates only on mask swap. Skipped entirely in heartbeat mode (archival judgment shouldn't be biased by voice).
dynamic: Persona transition note + timestamp + per-turn memory pack. Not cached — it's expected to change every turn. The whole point of having three zones is that the dynamic zone can change without invalidating the stable + persona prefix.

Anthropic gets four explicit cache_control markers (the API maximum): end of tools array, end of stable, end of persona, last assistant message's last block. xAI auto-caches by prefix match — same zone ordering produces the same cache hit pattern without explicit markers. Codex doesn't expose cache breakpoints (and rejects the param), so its cost profile is whatever the ChatGPT subscription absorbs.

Tool descriptions live in the structured tools API param only, never as prose in the system prompt. Earlier versions duplicated them as Markdown — that was ~12 kB extra per turn for no benefit.

Tool registry

A single Registry (src/tools/registry.ts) holds every tool the agent can call. Three sources feed it:

Core tools in src/tools/core/ — filesystem, bash, web, memory wrappers, sessions, cron, agent-attachments, research.
MCP-bridged tools via stdio. Engram is the canonical bridge (~25 tools), plus Brave Search and @playwright/mcp for browser automation as upstream MCP servers. bridgeMcpTools wires upstream tools into the registry; new tools appear without TypeScript changes.
Per-task allow-lists — heartbeat tasks declare a tools: whitelist, and the registry filters dispatch through it. The model literally cannot call anything outside the allow-list for that invocation. (See memory for the two-phase archival pipeline that uses this.)

Sessions

JSONL files in .mantle/sessions/<agentId>/, one per session, with a per-agent index.json. Each entry is a turn: user message, assistant text, thinking, tool calls, tool results. The same format both the in-process loop and the Claude Code CLI mode write.

Compaction (src/agent/compaction.ts) summarizes long sessions when the turn budget gets tight — produces a compact rolling summary that replaces older turns in context.

Claude Code CLI mode

Sessions can route through the claude CLI binary instead of the in-process loop. CC owns model selection and tool execution; Mantle forwards the per-turn dynamic context (persona, memory pack, transition note) via --append-system-prompt and renders CC's event stream into the same UI. Useful when the goal is CC's tool surface + agent loop with Mantle's memory injection.

The trade-off: voice mode doesn't apply to CC sessions (CC owns the loop, no text_delta interception layer to hook the chunker into).

What's not in the harness

A short list of things that aren't part of Mantle's core, on purpose:

No web framework. Server is Bun.serve() returning HTTP + WebSocket + static. UI is plain HTML/JS/CSS. The simplicity is load-bearing — every dep is a future maintenance bill.
No agent framework. No LangChain, no LlamaIndex, no autogen. The loop is ~200 lines.
No orchestrator. Multi-agent is per-agent state, not coordinated execution.
No vendored Engram. Mantle installs Engram editable from the sibling repo and respawns the MCP adapter on schema changes. Edits to ../rev-engram/src/engram*/*.py show up next spawn.
Not yet: sandboxed tool execution, NEXUS companion-app integration, Phase 2 voice (hands-free VAD + Whisper input loop).

// section 02 :: memory2 / 5

Memory

Engram integration. Two pools, six query variants, one pre-injected pack.

// session :: mantle.memorytransmit

Most agent harnesses give the model a recall tool and trust it to call when needed. Mantle doesn't.

── design note

This is the part of the harness worth reading the code for. Mantle integrates Engram — a 6-signal provenance-aware retrieval engine — and uses it differently from the conventional pattern.

Companion deployment

Engram ships with two deployment shapes: multi-agent (named-agent authority hierarchy, 10-type memory library) and companion (role-based authority, 4-type speech-act library, 6-intent retrieval matrix). Mantle runs Engram in companion mode. The agent's voice and personality come from workspace files (SOUL.md, AGENTS.md, IDENTITY.md); Engram is purely the retrieval and storage layer.

The companion library is built around speech acts the agent might hold about its user:

want: A goal, plan, or thing the user intends to do. Boosted under the procedural query intent.
preference: A stable taste, habit, or how-they-like-things pattern. Boosted under the preference intent.
opinion: What the user thinks or feels about a thing. Boosted under the reflection intent.
observation: Biographical fact or current state. Boosted under state_check and recall intents.

The intent × type matrix is what shapes retrieval. A "what's the user planning to do this week?" turn ranks want drawers higher than observation drawers, even when the cosine similarities are equal.

Two-pool memory model

Engram stores two physically separate ChromaDB collections, with separate retrieval surfaces:

memory pool: Framed interpretations. "Kyle prefers small focused PRs over bundled ones." Authored from the agent's perspective with type, signature, pin status, confidence. Scored on all 6 Engram signals (similarity × salience × authority × confidence × type-multiplier × scope-penalty).
source pool: Raw chunked content. Full text of session transcripts, ingested code, ingested docs. Similarity-only retrieval. Separate query surface (engram_search_source / recall_source).

The pools never compete because they're served by different tools. "What does Kyle think about the auth flow?" hits the memory pool. "What did Kyle literally say last Tuesday about the cron bug?" hits the source pool. Engram's piece-4 scope-boundary invariant — that source chunks cannot enter the memory scoring pool — is enforced structurally here: separate collections, separate query tools, no shared ranking.

The pre-inference memory pack

Most agent harnesses give the model a recall or search_memory tool and trust it to call when needed. Mantle doesn't. On every user turn, before any inference happens, buildMemoryPack (src/agent/memory-pack.ts) runs and assembles a markdown block that gets injected into the system prompt's dynamic zone.

Six stages:

Query fan-out — multi-source, multi-variant query set built from the user message, the prior assistant text (windowed to 600 chars), and the prior user text.
Three concurrent calls via Promise.all: batched topical search, optional temporal search, ambient sample.
FTS5 fallback — only if topical returned empty.
Pack assembly — top-9 topical hits, temporal section, reminiscing tail.
Empty-pack note — even when nothing relevant came back, the pack still injects "memory was queried and returned nothing — don't search for more" so the agent doesn't waste a turn calling recall.
Inject into dynamic zone — per-turn variation never invalidates the larger stable + persona cache prefix.

Stage 1 — query fan-out

The user message is decomposed into roughly 6 variants:

Full text of the user turn
Per-clause split on .!?;, top 3 by length, min 12 chars per clause
Stripped-filler versions of full + each clause (articles + short prepositions are kept — Jina's phrase embeddings are structure-sensitive)
Template HyDE reframings — Kyle wants {topic} / Kyle prefers {topic} / Kyle thinks {topic} / Kyle is {topic} — fired only when the topic phrase has ≥5 content words, since shorter topics let the framing tokens dominate the embedding

Plus the prior assistant turn (windowed to 600 chars) — critical for "tell me more" / "go ahead" follow-ups where the user message has no topical signal — and the prior user turn for thread continuity.

Stripped variants below 8 chars are dropped. Single-word fragments suffer from Jina word-sense ambiguity and aren't worth the embed call.

Stage 2 — three concurrent calls

Promise.all([
  engram_search_batch(queryVariants, { min_score: 0.10 }),  // single embed pass
  hasTemporalPhrase ? engram_search_temporal(...) : null,
  engram_sample_drawers({ k: 6, exclude_types: ["observation"] })
])

engram_search_batch is one embed pass for N queries — N forward passes' worth of cost collapsed to ~one. engram_sample_drawers skips embedding entirely (just chroma rows + scoring), so the speculative ambient pull is essentially free.

Temporal retrieval fires only when chrono-node parses a date phrase from the user turn. Grain (day / week / month / year) is inferred from the matched phrase; mode is auto-selected — empty residual after stripping the temporal phrase fires session-enum mode, substantive residual fires semantic-within-range. False-positive parses ("fix the May bug") that return zero results are silently suppressed.

Stage 3 — FTS5 fallback

Only fires sequentially if topical merge came back empty: engram_search with min_score=0 to surface keyword-heavy hits the embedding floor missed. The fallback lives in Engram's SQLite FTS5 index, not Chroma, so it catches signature-verbatim hits at 100% R@1 on unique phrases.

Stage 4 — pack assembly

─── Recalled Memories ──────────────────────
  topical    (top 9 by score, with type/wing/room context)
  temporal   (when present, with date grain)
  ambient    (3 random drawers, voiceful types only)
─── End Recall ─────────────────────────────

Each hit renders with type, wing, room, and the framed text. The agent gets memory as background context, not as a tool. Mid-turn recall calls are still possible (the tool is registered) but actively discouraged in operating instructions, since the pre-injected pack already covers the topical retrieval budget for the turn.

MANTLE_DISABLE_MEMORY_TOOLS=1 hides the recall tools from the agent surface entirely while keeping them callable from the pack builder. Used to test whether the pack alone serves the agent's needs.

Per-agent isolation and lazy spawn

Each Mantle agent runs its own Engram MCP adapter process pointed at its own ENGRAM_PATH. Memory pools never cross agents by default — Sly's drawers and Vega's drawers live in physically separate ChromaDB stores; one agent's writes are invisible to the other unless they explicitly share a path.

Spawning is lazy. On startup Mantle does not spin up Python:

Loads Engram's tool surface from a cached schema (.mantle/cache/engram-schema.json) so the registry can register engram_* tools without spawning anything.
If the cache is missing (first boot ever, or user deleted it), probes the default agent's Engram once, captures the schema, writes the cache. That probe stays warm as the default agent's client.
Other agents stay dormant until the agent receives its first chat turn, heartbeat tick, cron job, or memory-pack query.

Concurrent first-spawns for the same agent are deduped through a single in-flight promise — a burst of parallel calls (the memory pack fires up to 12 batched searches in one shot) only triggers one Python process.

The Jina embedder is daemon-level, not per-agent, so it loads once across all agents.

Implementation: src/engram/manager.ts.

Daemon split

Engram runs as two processes: a long-lived daemon (engram_daemon, holds the Chroma stores + Jina embedder) and a thin per-agent MCP adapter (engram_mcp, stdio→HTTP translator) spawned by Mantle.

Mantle does not manage the daemon — start it separately and leave it running. Adapter env: ENGRAM_DAEMON_URL (default 127.0.0.1:49765), ENGRAM_PATH (per-agent store), auth token resolved from ENGRAM_AUTH_TOKEN or ENGRAM_AUTH_FILE (default ~/.engram/auth.json, daemon bootstraps on first run).

Boot flow: Mantle probes daemonUrl/healthz first; if unreachable, Engram is disabled for the session and Mantle runs without memory. No boot stall, no failed spawn, no pages of Python tracebacks.

Heartbeat archival pipeline

This is what builds the two-pool retrieval over time, without operator intervention. A timer-based background runner (src/heartbeat/runner.ts) executes scheduled tasks per agent. Each task runs in its own agent loop with a stripped "archivist" system prompt (no SOUL.md, no persona, neutral judgment) and a per-task tool allow-list.

The default HEARTBEAT.md defines a two-phase pipeline:

session-ingest: Renders each JSONL session transcript to a sibling .md file (one H2 per turn), then ingests it into the source pool under wing=sessions, room=<session-id>. Mechanically scoped to render_session_markdown and engram_ingest_source. Cannot write framed memories.
session-mine: Walks recent sessions, extracts framed companion memories (wants, preferences, opinions, observations), checks both pools for duplicates before writing into the memory pool. Mechanically scoped to memory-write tools. Cannot ingest sources.

The two-lens retrieval falls out structurally. Phase 1 fills the source pool with raw transcript chunks; Phase 2 fills the memory pool with interpretive framings of the same material. Same source content, different query surfaces, different ranking semantics.

The tool allow-list is mechanical, not advisory. Phase 1 literally cannot call remember. Phase 2 literally cannot call engram_ingest_source. The registry filters dispatch through the per-task whitelist; calls outside it return as tool-not-available rather than executing.

Tasks are event-gated via watch: paths. A task with watch: [sessions, MEMORY.md] only fires when the interval elapsed and at least one watched path changed since the last run. Idle days cost zero tokens.

Authoring discipline

The companion library expects framed interpretations, not verbatim logs. The agent is the first interpreter; the retrieval agent is the second. From AGENTS.md-shaped guidance:

"Kyle is planning to check out the new pho place this week. Worth asking how it went."

Not: User said: "I'm going to try that new pho place".

Memories should carry forward intent, context, and the agent's interpretation. Verbatim quotes belong in the source pool — they're already there, automatically, via session-ingest.

Tool surface

High-level wrappers in src/tools/core/memory.ts: remember, recall, recall_source, memory_status. The raw Engram surface (~25 tools as of v0.7) is bridged automatically through bridgeMcpTools — engram_search, engram_search_batch, engram_search_source, engram_search_temporal, engram_sample_drawers, engram_check_duplicate, engram_build_context_pack, engram_record_retrieval, and the rest. Wrappers gate on registry.has(...) so they degrade gracefully when a tool isn't bridged yet.

engram_research and engram_watch_* aren't wired in the daemon adapter yet — the wrappers handle that absence rather than the agent needing to know which version of Engram is running.

// section 03 :: voice3 / 5

Voice

Mid-stream TTS via Chatterbox, audio-driven text reveal, per-message replay.

// session :: mantle.voicetransmit

What's here

Mid-stream TTS replies via chatterbox-streaming (the generate_stream-supporting fork of the original 0.5B Chatterbox — not the distilled turbo variant), Whisper for transcription, plus on-demand per-message replay and per-agent voice tuning. Phase 1 is shipped — TTS-out is fully integrated. Phase 2 is in progress: hands-free input via browser-side Silero VAD into the Whisper endpoint.

3voice tuning knobstemp · cfg · exaggeration

50first-chunk char floorvs 60 for subsequent

1loopback sidecarport 7333

Sidecar

A Python FastAPI service at voice/server.py, port 7333, loopback only. Spawned eagerly by VoiceManager (src/voice/manager.ts) on mantle start — it's a cheap idle process, and the model loads happen lazy via POST /voice/load. Reuses the Engram .venv (pip install -r voice/requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124).

GET  /health
GET  /voice/status
POST /voice/load           # serialized — see below
POST /voice/unload
POST /voice/tts/synthesize → audio/wav
POST /voice/stt/transcribe → { text, language, duration_s, inference_ms }

TTS + STT loads are SERIALIZED. Transformers v5's _LazyModule is not thread-safe; parallel imports fail with a misleading cannot import name 'LlamaModel'. Any new transformers-touching engine has to go through the same sequential chain in /voice/load. This bit me hard once.

Mid-stream pipeline

StreamChunker (src/voice/stream-chunker.ts) watches text_delta events from the agent loop. Greedy sentence-merge: emits when the buffer ends in .!? followed by whitespace AND length ≥ minChars.

The first chunk uses a lower threshold (firstChunkMinChars, default 50) for live streaming so the opening sentence ships sooner than a full 60-char chunk would. Subsequent chunks use the standard minChars (60) for prosody. The 50-char default produces ~4s of opening audio, comfortably longer than worst-case chunk-2 synthesis (~1.5s at default cfm timesteps), so back-to-back playback lands gap-free.

text_delta stream
  ↓
StreamChunker        ← sentence-merge, minChars threshold
  ↓ (parallel synth, serial emit)
voiceClient.synthesize()  ─┐
                           ├─ tts_audio { synthId, chunkIdx, text, audioBase64 }
                           │
  Promise chain (idx order)─┘
  ↓
WebSocket → browser

Each chunk fires voiceClient.synthesize() immediately (parallel), but emit-to-UI chains through a Promise so events reach the WS in idx order. The chain checks abortSignal at the start of each step, so replays cancel mid-stream cleanly.

The chunker preserves leading whitespace on each chunk so paragraph breaks (\n\n) survive into the UI's markdown render. The synth side strips that whitespace before sending to chatterbox; the python normalizer would otherwise turn it into a . glottal stop.

Per-turn `synthId` state

Every voice-bound turn (live reply OR per-message replay) gets a server-generated UUID synthId stamped on every event for that turn. The client keeps voice state in Map<synthId, VoiceTurn>. Each VoiceTurn tracks pending chunks awaiting in-order release, in-flight decode promises, decoded-vs-played counters, and per-turn callback hooks.

This replaced an earlier global-state model whose reset-on-tts_done raced async decodeAudioData calls and dropped the final chunk. The per-turn model means a new turn starting before the previous one's audio finishes can't write into the wrong bubble — each turn captures its target DOM bubble at _onTurnStart (closure) and renders into that one for its lifetime.

Audio-driven text reveal

When tts_start arrives without a replayId, the UI flips voiceModeForTurn = true and suppresses text_delta rendering for that turn. The visible bubble is built chunk-by-chunk as audio plays. A chunk-scoped typewriter paces text reveal to the audio buffer's duration so visible text never trails the voice. onChunkEnd snaps any tail in.

The "Responding…" indicator (three pulsing dots) gets appended at turn start and removed in onPlaybackComplete — which fires when chunksScheduled === chunksPlayed AND tts_done has arrived AND no decodes are in flight. That condition matters: it's distinct from "decoded and queued."

Per-message replay

Every completed assistant message gets a .msg-speaker-btn (line-art SVG, swaps to a hollow square when playing). Click → re-synthesize the bubble's text via the agent's voice pipeline + tuning. Click again or another speaker → stop.

inbound WS: replay { agentId, text, replayId }
   ↓
build VoicePipeline for agent      ← per-agent voice file, tuning resolved fresh
   ↓
StreamChunker.feed(text)
   ↓
tts_start { synthId, replayId }
   ↓
... tts_audio chunks ...
   ↓
tts_done { synthId }

Only one replay plays globally; clicking a different speaker stops the current first. extractMessageTextForVoice walks all .msg-content children (a single bubble can have empty placeholder, text-then-tools-then-text, etc.) and joins with \n\n. Speaker buttons are skipped on bubbles still streaming (they have a .streaming-cursor descendant).

stopCurrentReplay sends replay_stop to the server AND calls MantleVoice.purgeTurn(synthId) for instant local cutoff. Purge filters the playback queue by synthId AND stops the current AudioBufferSource if its tagged _mantleSynthId matches.

Voice-mode prompt: three layers

Voice quality requires three complementary layers that can't substitute for each other:

prompt: VOICE_MODE_PROMPT appended to the dynamic zone when voiceMode: true. Semantic guidance — write spoken English, use [laugh]/[sigh] proactively, defer code and links.
normalizer: voice/normalizer.py, 23-step pipeline ported from rev-nexus. Mechanical Chatterbox-quirk handling — strip double quotes (sigh artifact), fix ellipsis (comma-collapse), validate [bracket] tag whitelist, pad short segments, strip markdown / URLs / emoji.
display strip: ws.ts::stripDisplayTags. Bracket tags like [chuckle] stay in the synth text (chatterbox turns them into real sound effects) but are stripped from the text shipped on tts_audio so the user doesn't see raw [chuckle] in the bubble.

The prompt shapes what only the model decides; the normalizer enforces what only post-processing can guarantee; the display strip keeps the audio cue out of the visible text. All three are required.

Per-agent voice tuning

AgentConfig.voice?: AgentVoiceConfig overrides the global config.voice.defaults field-by-field. Three knobs (down from six in the turbo era — chatterbox-streaming's generate_stream API doesn't expose top_k/top_p/repetition_penalty/cfm_timesteps):

temperature: Sampling temperature. Default 0.7. Lower = more consistent, higher = more variance per take.
cfgWeight: Classifier-free guidance — the speaker-anchoring lever turbo dropped. 0.0 = no anchoring (model prior leaks, accent drift); 0.5 = balanced default; 1.0 = strong fidelity to reference clip. Raise if accent drifts on shorter clips.
exaggeration: Baked into the cached T3Cond via prepare_conditionals. 0.0 = flat, 0.5 = default, 1.0 = highly expressive. Costs nothing at synth time — only re-prepares the conditional when the value changes.

Resolution happens fresh inside buildVoicePipeline on each call — saved tuning takes effect on the next reply / replay without restart.

The UI exposes this as a gear button next to the voice toggle: equalizer-style sliders, agent-accent fill, modified-vs-default highlight. "Test" calls POST /api/voice/preview without persisting; "Save" PUTs the agent config; "Reset" wipes overrides (voice: null).

Voice files

voices/<agent-id>.wav at project root. Convention: filename = agent id, lowercase. Centralized rather than per-workspace so a future voice library / multi-agent sharing UI fits cleanly. Missing file → toggle stays is-unavailable with a tooltip pointing back to voices/.

STT (Whisper)

/voice/stt/transcribe accepts a raw WAV body and an optional ?language= to skip auto-detect. Returns text + detected language + duration + inference time. Whisper's own vad_filter is disabled on this path — caller is expected to have endpointed via browser-side Silero VAD.

Phase 2 (the hands-free input loop) is the work in progress: VAD → Whisper → user message → memory pack → agent loop → TTS, all without a click. The Whisper endpoint is live; the browser pipeline is being assembled.

Known gaps

No AbortSignal on the python synth fetch — replay stop cancels server-side scheduling but in-flight synths complete (audio dropped before WS send).
No "free VRAM" UI control — toggle-off keeps models warm. Hit POST /api/voice/unload directly if you need the GPU back.
No voice support on Claude Code CLI sessions (CC owns its own loop; no text_delta interception layer to wire the chunker into).
HTTP headers are Latin-1; X-Normalized-Text is ASCII-stripped, future headers need the same care.

// section 04 :: runtime4 / 5

Runtime

Multi-agent, personas, skills, cron, Claude Code CLI mode.

// session :: mantle.runtimetransmit

The operational layer. Everything in this section is what makes Mantle useful day-to-day rather than what makes it interesting architecturally — but a couple of these calls (especially CC CLI mode and the mechanical heartbeat allow-list) are load-bearing for keeping the harness small.

Multi-agent

Each agent owns a workspace directory, its own session storage, and its own Engram store. Adding an agent is mechanical:

copy templates/agent-workspace/ → workspaces/<id>/
fill in {{user}} / {{name}} / {{date}} placeholders
add an entry to config.json's agents[]
restart

The harness handles per-agent session storage, Engram isolation, skill scoping, and heartbeat scheduling automatically.

{
  "agents": [
    {
      "id": "sly",
      "name": "Sly",
      "workspace": "./workspaces/sly",
      "engramPath": "~/.rev-mantle/engram-sly",
      "defaultProvider": "grok",
      "accentColor": "#00d4aa",
      "heartbeat": { "enabled": true, "intervalMinutes": 30 }
    }
  ],
  "defaultAgent": "sly"
}

engramPath is optional. Unset → each agent gets ~/.rev-mantle/engram-<id> so memory pools are isolated by construction. Set explicitly to share — e.g. point Sly and Vega at the same path and one agent's memories are recallable by the other. Per-agent provider, model, persona, voice tuning, and heartbeat behavior all flow from the same agents[] entry.

The active agent is selectable from the UI sidebar. Switching agents loads their workspace files into the system prompt's stable zone — which invalidates the cache for that agent until the next turn settles.

Personas

Per-agent persona masks (personas.json in the agent's workspace) layer on top of SOUL.md. The persona block lives in its own zone of the system prompt, separately cached from the stable zone — switching personas mid-session only invalidates the persona cache, not the stable cache.

Switching personas mid-session also injects a transition note into the dynamic zone so the agent acknowledges the mask change naturally rather than silently shifting voice. The transition note is per-session: lastMessagePersona lives in the session index, and the note fires only on actual swaps, not on every message.

Persona is selectable per message via the profile bar in the UI. Heartbeat tasks always run with persona stripped — archival judgment shouldn't be biased by voice.

Skills (two-tier)

A SKILL.md system loads named skill packs into the system prompt's stable zone with token budgeting:

Global — ./skills/ at the Mantle root, available to all agents
Per-agent — <workspace>/skills/, available only to that agent

Resolution: agent-specific skills override globals on name conflict. Per-agent disable / enable is configurable via agent.disabledSkills[] / agent.enabledSkills[]. Global disable via config.skills.disabled[]. Resolution order: agent-disable → agent-enable → global-disable → on by default.

Skills are application content, not core code. The repository ships only skills/memory-maintenance/ as a canonical SKILL.md format example. Workspace skill packs and operator skill packs live outside the tree.

Cron with Engram hooks

A SQLite-backed scheduled job system (src/cron/) lives alongside heartbeat for programmatic scheduling. Coexists with the heartbeat runner — heartbeat is for archival, cron is for everything else.

at: One-shot. ISO timestamp or relative like "20m". Three retries on transient errors (rate_limit, overloaded, network, timeout, 5xx).
every: Fixed interval, minimum 60s.
cron: Standard 5/6-field cron expression via croner. TZ-aware.

Each job picks a session target: isolated (default — fresh session per run), persistent (cron-<jobId> — same session across runs), or session:<id> (fixed existing session).

Engram hooks come in three shapes:

Pre-run context enrichment — query Engram for relevant memory before the job's agent loop runs, injected as additional context.
Pre-run conditional skip — query Engram, evaluate the result, skip the run entirely if conditions don't match. Costs zero tokens when skipped.
Post-run outcome storage — store the run outcome as a memory drawer with wing: cron-history, room: <job-name>, and a causal chain via parent_id linking sequential runs.

Backoff: [30s, 60s, 5m, 15m, 1h] on consecutive errors. Auto-disable after 5 consecutive failures (configurable). Run logs are JSONL per job.

The agent has a cron_jobs tool with actions list / create / update / delete / run / history / analyze. Guardrails: 20 jobs per agent max, 1 minute minimum interval, agents only manage their own jobs. UI panel sits between Skills and Sessions in the sidebar.

Claude Code CLI mode

Sessions can route through the claude CLI binary instead of the in-process agent loop. CC owns model selection, tool execution, and the loop itself; Mantle stays in the role of context provider.

user message → Mantle UI → Mantle WS handler
   ↓
buildSystemPrompt({ heartbeat: false, voiceMode: false })
   ↓ persona + memory pack + transition note
spawn `claude --append-system-prompt <dynamic context>` ...
   ↓
CC event stream
   ↓
parsed and rendered into the same UI

Useful when the goal is CC's tool surface and agent loop with Mantle's memory injection layered on top. CC's session storage is used for the actual loop state; Mantle persists its own session record alongside.

The trade-off: voice mode doesn't apply to CC sessions (CC owns the loop, no text_delta interception layer to hook the chunker into).

Slash commands

Parsed client-side in app.js. Unrecognized / prefixes pass through as messages.

/think [off|low|medium|high]   adjust extended thinking budget
/reasoning [on|off]            toggle reasoning surfacing
/clear                         clear current session
/new                           start a new session
/compact                       compact current session
/stop                          abort in-flight inference
/model                         show / pick model
/tools                         list available tools
/status                        show agent + provider state
/help                          slash command reference

Backend support is plumbed: GET /api/tools, POST /api/agents/:id/sessions/:id/compact, WebSocket {type:"stop"} aborts via AbortController.

Attachments

Files dropped into the chat input upload to per-agent attachment storage and surface in the agent's tool registry as agent_attachments — list / read by id / etc. Useful for "look at this image" / "review this doc" turns where the agent needs durable access to the file across multiple tool calls in one turn.

Status

Working: chat, multi-agent, Engram integration, pre-inference memory pack, lazy spawn, heartbeat, cron, personas, Claude Code CLI mode, prompt caching, attachments, persona transitions, mid-stream TTS, per-message replay, per-agent voice tuning, ChatGPT subscription billing.

Not yet implemented: Playwright/browser automation as a first-class tool (the upstream @playwright/mcp server is bridged, but a native first-class tool is on the list), sandboxed tool execution, NEXUS companion-app integration, Phase 2 voice (hands-free VAD + Whisper input loop).

// section 05 :: decisions5 / 5

Decisions

Load-bearing calls that shaped the harness.

// session :: mantle.decisionstransmit

2026-04-28
Pre-inference memory pack over in-loop recall
◇ shipped
Most agent harnesses give the model a recall or search_memory tool and trust it to call when needed. Mantle inverts that: assemble the memory pack before inference, inject into the system prompt's dynamic zone, treat memory as already-on-the-table context rather than a tool the agent has to know to use. Saves a round-trip per turn, makes memory injection an architectural commitment rather than a behavioral nudge. The recall tool is still registered but actively discouraged in operating instructions — the pack already covers the topical retrieval budget.
2026-04-25
Two-pool memory model
◇ shipped
Framed memories scored on all 6 Engram signals; raw transcripts in a physically separate ChromaDB collection with similarity-only retrieval and a separate query surface. The pools never compete because they're served by different tools. Engram's piece-4 scope-boundary invariant — that source chunks cannot enter the memory scoring pool — is enforced structurally here: separate collections, separate query tools, no shared ranking. Two-lens retrieval falls out for free.
2026-04-22
Mechanical tool allow-list for heartbeat tasks
◇ shipped
Each heartbeat task declares tools: [...] and the registry filters dispatch through it. The model literally cannot call anything outside the allow-list for that invocation. Phase 1 (session-ingest) mechanically can't call remember. Phase 2 (session-mine) mechanically can't call engram_ingest_source. The two-pool separation falls out of mechanical enforcement instead of being a discipline the archivist agent has to maintain. Same machinery means per-task provider/model overrides also Just Work.
2026-04-20
Daemon split for Engram
◇ shipped
Earlier versions ran one Engram process per agent, with each one holding its own copy of the embedder. Memory blew up linearly in agent count. Pivoted to the daemon split: engram_daemon holds the Chroma stores and the Jina embedder, engram_mcp is the thin per-agent stdio adapter that translates MCP calls to HTTP against the daemon. Mantle does NOT manage the daemon — start it separately, leave it running. Edits to ../rev-engram/src/engram*/*.py show up next adapter spawn (no vendoring).
2026-04-18
Lazy spawn with first-call dedup
◇ shipped
Engram processes do not spawn at boot. Mantle loads the tool surface from a cached schema (.mantle/cache/engram-schema.json) so the registry can register engram_* tools without spawning Python. Other agents stay dormant until first chat turn / heartbeat tick / cron job / memory-pack query. Concurrent first-spawns for the same agent are deduped via a single in-flight promise — a burst of 12 parallel batched searches triggers one Python process, not 12.
2026-04-15
Three-zone system prompt
◇ shipped
Stable + persona + dynamic. Ordering is the cache discipline, not cosmetic. Anthropic gets four explicit cache_control markers (the API maximum); xAI auto-caches by prefix match. The pre-inference memory pack lives in the dynamic zone, so per-turn variation never invalidates the longer stable + persona prefix. Heartbeat mode strips the persona zone entirely (archival judgment shouldn't be biased by voice) — same machinery, just less in stable.
2026-04-10
Per-turn synthId voice state
◇ shipped
First voice implementation kept playback state in module-level globals and reset on tts_done. That raced async decodeAudioData calls — the final chunk was decoded after the reset and dropped silently. Replaced with Map<synthId, VoiceTurn> where every voice-bound turn (live or replay) gets a server-generated UUID stamped on every event. Each VoiceTurn tracks pending chunks, in-flight decode promises, decoded-vs-played counters. New turn starting before the previous finishes can't corrupt the previous one's bubble.
2026-04-08
OpenAI Codex via ChatGPT subscription, not API credits
◇ shipped
The openai-codex provider routes through chatgpt.com/backend-api/codex over OAuth (PKCE), reusing OpenAI's published Codex CLI client_id. Billing flows through the user's ChatGPT Plus/Pro/Team/Enterprise quota windows rather than platform.openai.com credits. x-codex-* response headers prove subscription billing without checking the dashboard. Constraints from the codex backend are real: stream: true mandatory, system prompt in the top-level instructions field, strict input shape, no temperature / max_output_tokens / metadata. The provider just doesn't include them.
2026-04-05
Mid-stream sentence chunking with first-chunk floor
◇ shipped
First-chunk threshold (firstChunkMinChars, default 50) lower than subsequent (minChars, default 60) so opening audio ships sooner than a full prosody chunk would. 50-char default produces ~4s of opening audio, comfortably longer than worst-case chunk-2 synth (~1.5s at default cfm timesteps), so back-to-back playback lands gap-free. Replay path overrides first-chunk to 60 — full text is already in hand, so no TTFB benefit, just gap risk to avoid.
2026-04-02
Three runtime npm deps
◇ shipped
@anthropic-ai/sdk, openai (used for both xAI and Codex backends), yaml. Plus chrono-node for date parsing in the memory pack and croner for cron expressions. Everything else is platform — Bun.serve() for HTTP/WebSocket, vanilla HTML/JS/CSS for the UI, bun:sqlite for cron storage. No web framework on either side of the wire; no agent framework anywhere. Each dep is a future maintenance bill, and the loop is small enough not to need them.
2026-03-28
No vendored Engram, install editable from sibling repo
◇ shipped
Engram lives in ../rev-engram/. Mantle's .venv/ installs it via pip install -e ../rev-engram rather than vendoring a copy or pulling from PyPI. Edits to ../rev-engram/src/engram*/*.py show up next adapter spawn — no rebuild, no version-pin lag during co-development. Schema cache (.mantle/cache/engram-schema.json) makes this safe at boot: even if the editable install changes underneath us, Mantle registers the tool surface from the cached schema and re-probes only when the cache is missing.

Mantle

▰▰Pitch

▰▰What's wired in

▰▰Reading order

Architecture

▰▰The loop

▰▰Three providers, one event channel

▰▰System prompt: three zones

▰▰Tool registry

▰▰Sessions

▰▰Claude Code CLI mode

▰▰What's not in the harness

Memory

▰▰Companion deployment

▰▰Two-pool memory model

▰▰The pre-inference memory pack

◇Stage 1 — query fan-out

◇Stage 2 — three concurrent calls

◇Stage 3 — FTS5 fallback

◇Stage 4 — pack assembly

▰▰Per-agent isolation and lazy spawn

▰▰Daemon split

▰▰Heartbeat archival pipeline

▰▰Authoring discipline

▰▰Tool surface

Voice

▰▰What's here

▰▰Sidecar

▰▰Mid-stream pipeline

▰▰Per-turn synthId state

▰▰Audio-driven text reveal

▰▰Per-message replay

▰▰Voice-mode prompt: three layers

▰▰Per-agent voice tuning

▰▰Voice files

▰▰STT (Whisper)

▰▰Known gaps

Runtime

▰▰Multi-agent

▰▰Personas

▰▰Skills (two-tier)

▰▰Cron with Engram hooks

▰▰Claude Code CLI mode

▰▰Slash commands

▰▰Attachments

▰▰Status

Decisions

Pre-inference memory pack over in-loop recall

Two-pool memory model

Mechanical tool allow-list for heartbeat tasks

Daemon split for Engram

Lazy spawn with first-call dedup

Three-zone system prompt

Per-turn synthId voice state

OpenAI Codex via ChatGPT subscription, not API credits

Mid-stream sentence chunking with first-chunk floor

Three runtime npm deps

No vendored Engram, install editable from sibling repo

Pitch

What's wired in

Reading order

The loop

Three providers, one event channel

System prompt: three zones

Tool registry

Sessions

Claude Code CLI mode

What's not in the harness

Companion deployment

Two-pool memory model

The pre-inference memory pack

Stage 1 — query fan-out

Stage 2 — three concurrent calls

Stage 3 — FTS5 fallback

Stage 4 — pack assembly

Per-agent isolation and lazy spawn

Daemon split

Heartbeat archival pipeline

Authoring discipline

Tool surface

What's here

Sidecar

Mid-stream pipeline

Per-turn `synthId` state

Audio-driven text reveal

Per-message replay

Voice-mode prompt: three layers

Per-agent voice tuning

Voice files

STT (Whisper)

Known gaps

Multi-agent

Personas

Skills (two-tier)

Cron with Engram hooks

Claude Code CLI mode

Slash commands

Attachments

Status