Mantle
A lightweight TypeScript agent harness with provenance-aware long-term memory wired in before inference, mid-stream voice, and three LLM providers including ChatGPT subscription billing.
The agent shouldn't have to ask for memory. Memory should already be on the table when it sits down.
Pitch
Mantle is the runtime my agents actually live in. The harness itself is intentionally small — an agent loop, a tool registry, three provider adapters, a vanilla-JS WebSocket UI. Three npm runtime deps. No web framework on either side of the wire.
The complexity that earns its keep lives in the layers wired around the loop. Engram, my long-term memory engine, is loaded before inference: every user turn arrives at the model with a relevant memory pack already injected into the system prompt's dynamic zone. Voice streams mid-reply through a sentence-boundary chunker so the first audio chunk lands while the model is still drafting later ones. Heartbeat runs a per-agent archival pipeline that turns session transcripts into framed memories on its own, in the background, on event-gated intervals. Multi-agent is per-agent isolation by construction — each agent's memory pool is a physically separate ChromaDB store. Three providers, including OpenAI Codex routed through ChatGPT subscription billing instead of API credits.
What's wired in
- Engram integration — companion-mode deployment, 4-type speech-act library, 6-intent retrieval matrix, per-agent isolated stores
- Pre-inference memory pack — ~6 query variants fanned into one batched search, FTS5 fallback, optional temporal retrieval, ambient memories on every turn
- Heartbeat archival pipeline — event-gated session-ingest + session-mine tasks build the two-pool retrieval over time without operator intervention
- Voice — Chatterbox-streaming TTS with mid-reply sentence chunking, audio-driven text reveal, per-message replay, per-agent voice tuning. Whisper STT endpoint live; hands-free input loop in progress.
- Multi-agent — per-agent workspaces, per-agent Engram stores, persona masks, per-agent provider/model/voice/heartbeat overrides
- Cron — SQLite-backed scheduled jobs with Engram pre/post-run hooks for context enrichment and outcome storage
- Claude Code CLI mode — sessions can route through the
claudebinary instead of the in-process loop; Mantle still injects the per-turn dynamic context via--append-system-prompt - Skills — two-tier (global + per-agent), token-budgeted, loaded into the system prompt's stable zone
- Prompt caching — a three-zone system prompt (stable / persona / dynamic) gives both providers the longest possible cache prefix; Anthropic gets explicit
cache_control, xAI auto-caches by prefix
Reading order
Architecture is the loop and how the providers, prompt zones, and cache breakpoints fit. Memory is the Engram integration — the part worth reading the code for. Voice is the mid-stream pipeline. Runtime is everything operational that doesn't fit elsewhere — multi-agent, personas, skills, cron, CC CLI mode. Decisions is the log of load-bearing calls.
This isn't a thesis project — Mantle exists so my agents have somewhere to live. But the calls that shaped it are deliberate, and the integration with Engram is the part that's actually interesting.
Architecture
One loop, three providers, three-zone prompts, one tool registry.
The loop
The core loop is provider-agnostic by design. Every provider adapter — Claude (Anthropic SDK), Grok (OpenAI SDK pointed at xAI), OpenAI Codex (Responses API on chatgpt.com/backend-api/codex) — emits the same normalized ProviderEvent channel: text_delta, thinking_delta, tool_call_delta, message_end, tool_use. The loop reads that channel, dispatches tool calls through a single registry, persists everything as JSONL transcript, and feeds results back for the next iteration.
user message
↓
buildSystemPrompt ← stable + persona + dynamic (memory pack)
↓
provider.stream() ← claude | grok | openai-codex
↓
ProviderEvent[] ← text · thinking · tool_call · message_end
↓
registry.dispatch(toolCall) ← core + MCP-bridged tools
↓
loop until end_turnThe harness is what's not in there: no web framework, no agent framework, no orchestration library. Three runtime deps. Everything else is platform.
Three providers, one event channel
- Claude provider · default-claude
Anthropic SDK. Extended thinking gated by
thinkingLevel(off → 0, low → 4096, medium → 10000, high → 16384). Reads backcache_read_input_tokensandcache_creation_input_tokensper turn.- Grok provider · default
OpenAI SDK with
baseURLpointed at xAI. Reasoning splits across two architectures:grok-4.3honorsreasoning_effort; the oldergrok-4.20-*andgrok-4-1-fast-*lineups encode reasoning depth in the model id and reject the param at the API layer. Provider gates this via aMODELS_WITH_CONFIGURABLE_REASONINGallowlist.- OpenAI Codex provider · subscription
Routes through the user's ChatGPT subscription via OAuth (PKCE) instead of API credits. Backend is
chatgpt.com/backend-api/codex, notapi.openai.com. Uses the Responses API (not Chat Completions). Stream mandatory,instructionsfield for system prompt, strict input shape.x-codex-*response headers prove subscription billing without checking the dashboard.
The interesting one is Codex. ChatGPT Plus / Pro / Team / Enterprise subscriptions include Codex inference quota; Mantle reuses OpenAI's published Codex CLI client_id to ride that auth flow and route requests through the same backend the official CLI does. Token rotation, JWT identity decode, and refresh dedup live in src/auth/openai-codex.ts. The ChatGPT-Account-Id header is required and comes from the JWT's chatgpt_account_id claim.
System prompt: three zones
buildSystemPrompt (in src/agent/prompt-builder.ts) emits three zones, in order. The ordering is not cosmetic — it's the cache discipline.
- stable zone · cache-friendly
Base identity, workspace files (
AGENTS.md,IDENTITY.md,USER.md,MEMORY.md), skills index. Invalidates only when a workspace file is edited or a skill is toggled.- persona zone · cache-friendly
Active persona profile. Invalidates only on mask swap. Skipped entirely in heartbeat mode (archival judgment shouldn't be biased by voice).
- dynamic zone · per-turn
Persona transition note + timestamp + per-turn memory pack. Not cached — it's expected to change every turn. The whole point of having three zones is that the dynamic zone can change without invalidating the stable + persona prefix.
Anthropic gets four explicit cache_control markers (the API maximum): end of tools array, end of stable, end of persona, last assistant message's last block. xAI auto-caches by prefix match — same zone ordering produces the same cache hit pattern without explicit markers. Codex doesn't expose cache breakpoints (and rejects the param), so its cost profile is whatever the ChatGPT subscription absorbs.
Tool descriptions live in the structured tools API param only, never as prose in the system prompt. Earlier versions duplicated them as Markdown — that was ~12 kB extra per turn for no benefit.
Tool registry
A single Registry (src/tools/registry.ts) holds every tool the agent can call. Three sources feed it:
- Core tools in
src/tools/core/— filesystem, bash, web, memory wrappers, sessions, cron, agent-attachments, research. - MCP-bridged tools via stdio. Engram is the canonical bridge (~25 tools), plus Brave Search and
@playwright/mcpfor browser automation as upstream MCP servers.bridgeMcpToolswires upstream tools into the registry; new tools appear without TypeScript changes. - Per-task allow-lists — heartbeat tasks declare a
tools:whitelist, and the registry filters dispatch through it. The model literally cannot call anything outside the allow-list for that invocation. (See memory for the two-phase archival pipeline that uses this.)
Sessions
JSONL files in .mantle/sessions/<agentId>/, one per session, with a per-agent index.json. Each entry is a turn: user message, assistant text, thinking, tool calls, tool results. The same format both the in-process loop and the Claude Code CLI mode write.
Compaction (src/agent/compaction.ts) summarizes long sessions when the turn budget gets tight — produces a compact rolling summary that replaces older turns in context.
Claude Code CLI mode
Sessions can route through the claude CLI binary instead of the in-process loop. CC owns model selection and tool execution; Mantle forwards the per-turn dynamic context (persona, memory pack, transition note) via --append-system-prompt and renders CC's event stream into the same UI. Useful when the goal is CC's tool surface + agent loop with Mantle's memory injection.
The trade-off: voice mode doesn't apply to CC sessions (CC owns the loop, no text_delta interception layer to hook the chunker into).
What's not in the harness
A short list of things that aren't part of Mantle's core, on purpose:
- No web framework. Server is
Bun.serve()returning HTTP + WebSocket + static. UI is plain HTML/JS/CSS. The simplicity is load-bearing — every dep is a future maintenance bill. - No agent framework. No LangChain, no LlamaIndex, no autogen. The loop is ~200 lines.
- No orchestrator. Multi-agent is per-agent state, not coordinated execution.
- No vendored Engram. Mantle installs Engram editable from the sibling repo and respawns the MCP adapter on schema changes. Edits to
../rev-engram/src/engram*/*.pyshow up next spawn. - Not yet: sandboxed tool execution, NEXUS companion-app integration, Phase 2 voice (hands-free VAD + Whisper input loop).
Memory
Engram integration. Two pools, six query variants, one pre-injected pack.
Most agent harnesses give the model a
recalltool and trust it to call when needed. Mantle doesn't.
This is the part of the harness worth reading the code for. Mantle integrates Engram — a 6-signal provenance-aware retrieval engine — and uses it differently from the conventional pattern.
Companion deployment
Engram ships with two deployment shapes: multi-agent (named-agent authority hierarchy, 10-type memory library) and companion (role-based authority, 4-type speech-act library, 6-intent retrieval matrix). Mantle runs Engram in companion mode. The agent's voice and personality come from workspace files (SOUL.md, AGENTS.md, IDENTITY.md); Engram is purely the retrieval and storage layer.
The companion library is built around speech acts the agent might hold about its user:
- want companion type
A goal, plan, or thing the user intends to do. Boosted under the
proceduralquery intent.- preference companion type
A stable taste, habit, or how-they-like-things pattern. Boosted under the
preferenceintent.- opinion companion type
What the user thinks or feels about a thing. Boosted under the
reflectionintent.- observation companion type
Biographical fact or current state. Boosted under
state_checkandrecallintents.
The intent × type matrix is what shapes retrieval. A "what's the user planning to do this week?" turn ranks want drawers higher than observation drawers, even when the cosine similarities are equal.
Two-pool memory model
Engram stores two physically separate ChromaDB collections, with separate retrieval surfaces:
- memory pool engram_drawers
Framed interpretations. "Kyle prefers small focused PRs over bundled ones." Authored from the agent's perspective with type, signature, pin status, confidence. Scored on all 6 Engram signals (similarity × salience × authority × confidence × type-multiplier × scope-penalty).
- source pool engram_source_chunks
Raw chunked content. Full text of session transcripts, ingested code, ingested docs. Similarity-only retrieval. Separate query surface (
engram_search_source/recall_source).
The pools never compete because they're served by different tools. "What does Kyle think about the auth flow?" hits the memory pool. "What did Kyle literally say last Tuesday about the cron bug?" hits the source pool. Engram's piece-4 scope-boundary invariant — that source chunks cannot enter the memory scoring pool — is enforced structurally here: separate collections, separate query tools, no shared ranking.
The pre-inference memory pack
Most agent harnesses give the model a recall or search_memory tool and trust it to call when needed. Mantle doesn't. On every user turn, before any inference happens, buildMemoryPack (src/agent/memory-pack.ts) runs and assembles a markdown block that gets injected into the system prompt's dynamic zone.
Six stages:
- Query fan-out — multi-source, multi-variant query set built from the user message, the prior assistant text (windowed to 600 chars), and the prior user text.
- Three concurrent calls via
Promise.all: batched topical search, optional temporal search, ambient sample. - FTS5 fallback — only if topical returned empty.
- Pack assembly — top-9 topical hits, temporal section, reminiscing tail.
- Empty-pack note — even when nothing relevant came back, the pack still injects "memory was queried and returned nothing — don't search for more" so the agent doesn't waste a turn calling recall.
- Inject into dynamic zone — per-turn variation never invalidates the larger stable + persona cache prefix.
Stage 1 — query fan-out
The user message is decomposed into roughly 6 variants:
- Full text of the user turn
- Per-clause split on
.!?;, top 3 by length, min 12 chars per clause - Stripped-filler versions of full + each clause (articles + short prepositions are kept — Jina's phrase embeddings are structure-sensitive)
- Template HyDE reframings —
Kyle wants {topic}/Kyle prefers {topic}/Kyle thinks {topic}/Kyle is {topic}— fired only when the topic phrase has ≥5 content words, since shorter topics let the framing tokens dominate the embedding
Plus the prior assistant turn (windowed to 600 chars) — critical for "tell me more" / "go ahead" follow-ups where the user message has no topical signal — and the prior user turn for thread continuity.
Stripped variants below 8 chars are dropped. Single-word fragments suffer from Jina word-sense ambiguity and aren't worth the embed call.
Stage 2 — three concurrent calls
Promise.all([
engram_search_batch(queryVariants, { min_score: 0.10 }), // single embed pass
hasTemporalPhrase ? engram_search_temporal(...) : null,
engram_sample_drawers({ k: 6, exclude_types: ["observation"] })
])engram_search_batch is one embed pass for N queries — N forward passes' worth of cost collapsed to ~one. engram_sample_drawers skips embedding entirely (just chroma rows + scoring), so the speculative ambient pull is essentially free.
Temporal retrieval fires only when chrono-node parses a date phrase from the user turn. Grain (day / week / month / year) is inferred from the matched phrase; mode is auto-selected — empty residual after stripping the temporal phrase fires session-enum mode, substantive residual fires semantic-within-range. False-positive parses ("fix the May bug") that return zero results are silently suppressed.
Stage 3 — FTS5 fallback
Only fires sequentially if topical merge came back empty: engram_search with min_score=0 to surface keyword-heavy hits the embedding floor missed. The fallback lives in Engram's SQLite FTS5 index, not Chroma, so it catches signature-verbatim hits at 100% R@1 on unique phrases.
Stage 4 — pack assembly
─── Recalled Memories ──────────────────────
topical (top 9 by score, with type/wing/room context)
temporal (when present, with date grain)
ambient (3 random drawers, voiceful types only)
─── End Recall ─────────────────────────────Each hit renders with type, wing, room, and the framed text. The agent gets memory as background context, not as a tool. Mid-turn recall calls are still possible (the tool is registered) but actively discouraged in operating instructions, since the pre-injected pack already covers the topical retrieval budget for the turn.
MANTLE_DISABLE_MEMORY_TOOLS=1 hides the recall tools from the agent surface entirely while keeping them callable from the pack builder. Used to test whether the pack alone serves the agent's needs.
Per-agent isolation and lazy spawn
Each Mantle agent runs its own Engram MCP adapter process pointed at its own ENGRAM_PATH. Memory pools never cross agents by default — Sly's drawers and Vega's drawers live in physically separate ChromaDB stores; one agent's writes are invisible to the other unless they explicitly share a path.
Spawning is lazy. On startup Mantle does not spin up Python:
- Loads Engram's tool surface from a cached schema (
.mantle/cache/engram-schema.json) so the registry can registerengram_*tools without spawning anything. - If the cache is missing (first boot ever, or user deleted it), probes the default agent's Engram once, captures the schema, writes the cache. That probe stays warm as the default agent's client.
- Other agents stay dormant until the agent receives its first chat turn, heartbeat tick, cron job, or memory-pack query.
Concurrent first-spawns for the same agent are deduped through a single in-flight promise — a burst of parallel calls (the memory pack fires up to 12 batched searches in one shot) only triggers one Python process.
The Jina embedder is daemon-level, not per-agent, so it loads once across all agents.
Implementation: src/engram/manager.ts.
Daemon split
Engram runs as two processes: a long-lived daemon (engram_daemon, holds the Chroma stores + Jina embedder) and a thin per-agent MCP adapter (engram_mcp, stdio→HTTP translator) spawned by Mantle.
Mantle does not manage the daemon — start it separately and leave it running. Adapter env: ENGRAM_DAEMON_URL (default 127.0.0.1:49765), ENGRAM_PATH (per-agent store), auth token resolved from ENGRAM_AUTH_TOKEN or ENGRAM_AUTH_FILE (default ~/.engram/auth.json, daemon bootstraps on first run).
Boot flow: Mantle probes daemonUrl/healthz first; if unreachable, Engram is disabled for the session and Mantle runs without memory. No boot stall, no failed spawn, no pages of Python tracebacks.
Heartbeat archival pipeline
This is what builds the two-pool retrieval over time, without operator intervention. A timer-based background runner (src/heartbeat/runner.ts) executes scheduled tasks per agent. Each task runs in its own agent loop with a stripped "archivist" system prompt (no SOUL.md, no persona, neutral judgment) and a per-task tool allow-list.
The default HEARTBEAT.md defines a two-phase pipeline:
- session-ingest phase 1 · every 30m
Renders each JSONL session transcript to a sibling
.mdfile (one H2 per turn), then ingests it into the source pool underwing=sessions, room=<session-id>. Mechanically scoped torender_session_markdownandengram_ingest_source. Cannot write framed memories.- session-mine phase 2 · every 2h
Walks recent sessions, extracts framed companion memories (wants, preferences, opinions, observations), checks both pools for duplicates before writing into the memory pool. Mechanically scoped to memory-write tools. Cannot ingest sources.
The two-lens retrieval falls out structurally. Phase 1 fills the source pool with raw transcript chunks; Phase 2 fills the memory pool with interpretive framings of the same material. Same source content, different query surfaces, different ranking semantics.
The tool allow-list is mechanical, not advisory. Phase 1 literally cannot call remember. Phase 2 literally cannot call engram_ingest_source. The registry filters dispatch through the per-task whitelist; calls outside it return as tool-not-available rather than executing.
Tasks are event-gated via watch: paths. A task with watch: [sessions, MEMORY.md] only fires when the interval elapsed and at least one watched path changed since the last run. Idle days cost zero tokens.
Authoring discipline
The companion library expects framed interpretations, not verbatim logs. The agent is the first interpreter; the retrieval agent is the second. From AGENTS.md-shaped guidance:
"Kyle is planning to check out the new pho place this week. Worth asking how it went."
Not:
User said: "I'm going to try that new pho place".
Memories should carry forward intent, context, and the agent's interpretation. Verbatim quotes belong in the source pool — they're already there, automatically, via session-ingest.
Tool surface
High-level wrappers in src/tools/core/memory.ts: remember, recall, recall_source, memory_status. The raw Engram surface (~25 tools as of v0.7) is bridged automatically through bridgeMcpTools — engram_search, engram_search_batch, engram_search_source, engram_search_temporal, engram_sample_drawers, engram_check_duplicate, engram_build_context_pack, engram_record_retrieval, and the rest. Wrappers gate on registry.has(...) so they degrade gracefully when a tool isn't bridged yet.
engram_research and engram_watch_* aren't wired in the daemon adapter yet — the wrappers handle that absence rather than the agent needing to know which version of Engram is running.
Voice
Mid-stream TTS via Chatterbox, audio-driven text reveal, per-message replay.
What's here
Mid-stream TTS replies via chatterbox-streaming (the generate_stream-supporting fork of the original 0.5B Chatterbox — not the distilled turbo variant), Whisper for transcription, plus on-demand per-message replay and per-agent voice tuning. Phase 1 is shipped — TTS-out is fully integrated. Phase 2 is in progress: hands-free input via browser-side Silero VAD into the Whisper endpoint.
Sidecar
A Python FastAPI service at voice/server.py, port 7333, loopback only. Spawned eagerly by VoiceManager (src/voice/manager.ts) on mantle start — it's a cheap idle process, and the model loads happen lazy via POST /voice/load. Reuses the Engram .venv (pip install -r voice/requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124).
GET /health
GET /voice/status
POST /voice/load # serialized — see below
POST /voice/unload
POST /voice/tts/synthesize → audio/wav
POST /voice/stt/transcribe → { text, language, duration_s, inference_ms }TTS + STT loads are SERIALIZED. Transformers v5's _LazyModule is not thread-safe; parallel imports fail with a misleading cannot import name 'LlamaModel'. Any new transformers-touching engine has to go through the same sequential chain in /voice/load. This bit me hard once.
Mid-stream pipeline
StreamChunker (src/voice/stream-chunker.ts) watches text_delta events from the agent loop. Greedy sentence-merge: emits when the buffer ends in .!? followed by whitespace AND length ≥ minChars.
The first chunk uses a lower threshold (firstChunkMinChars, default 50) for live streaming so the opening sentence ships sooner than a full 60-char chunk would. Subsequent chunks use the standard minChars (60) for prosody. The 50-char default produces ~4s of opening audio, comfortably longer than worst-case chunk-2 synthesis (~1.5s at default cfm timesteps), so back-to-back playback lands gap-free.
text_delta stream
↓
StreamChunker ← sentence-merge, minChars threshold
↓ (parallel synth, serial emit)
voiceClient.synthesize() ─┐
├─ tts_audio { synthId, chunkIdx, text, audioBase64 }
│
Promise chain (idx order)─┘
↓
WebSocket → browserEach chunk fires voiceClient.synthesize() immediately (parallel), but emit-to-UI chains through a Promise so events reach the WS in idx order. The chain checks abortSignal at the start of each step, so replays cancel mid-stream cleanly.
The chunker preserves leading whitespace on each chunk so paragraph breaks (\n\n) survive into the UI's markdown render. The synth side strips that whitespace before sending to chatterbox; the python normalizer would otherwise turn it into a . glottal stop.
Per-turn synthId state
Every voice-bound turn (live reply OR per-message replay) gets a server-generated UUID synthId stamped on every event for that turn. The client keeps voice state in Map<synthId, VoiceTurn>. Each VoiceTurn tracks pending chunks awaiting in-order release, in-flight decode promises, decoded-vs-played counters, and per-turn callback hooks.
This replaced an earlier global-state model whose reset-on-tts_done raced async decodeAudioData calls and dropped the final chunk. The per-turn model means a new turn starting before the previous one's audio finishes can't write into the wrong bubble — each turn captures its target DOM bubble at _onTurnStart (closure) and renders into that one for its lifetime.
Audio-driven text reveal
When tts_start arrives without a replayId, the UI flips voiceModeForTurn = true and suppresses text_delta rendering for that turn. The visible bubble is built chunk-by-chunk as audio plays. A chunk-scoped typewriter paces text reveal to the audio buffer's duration so visible text never trails the voice. onChunkEnd snaps any tail in.
The "Responding…" indicator (three pulsing dots) gets appended at turn start and removed in onPlaybackComplete — which fires when chunksScheduled === chunksPlayed AND tts_done has arrived AND no decodes are in flight. That condition matters: it's distinct from "decoded and queued."
Per-message replay
Every completed assistant message gets a .msg-speaker-btn (line-art SVG, swaps to a hollow square when playing). Click → re-synthesize the bubble's text via the agent's voice pipeline + tuning. Click again or another speaker → stop.
inbound WS: replay { agentId, text, replayId }
↓
build VoicePipeline for agent ← per-agent voice file, tuning resolved fresh
↓
StreamChunker.feed(text)
↓
tts_start { synthId, replayId }
↓
... tts_audio chunks ...
↓
tts_done { synthId }Only one replay plays globally; clicking a different speaker stops the current first. extractMessageTextForVoice walks all .msg-content children (a single bubble can have empty placeholder, text-then-tools-then-text, etc.) and joins with \n\n. Speaker buttons are skipped on bubbles still streaming (they have a .streaming-cursor descendant).
stopCurrentReplay sends replay_stop to the server AND calls MantleVoice.purgeTurn(synthId) for instant local cutoff. Purge filters the playback queue by synthId AND stops the current AudioBufferSource if its tagged _mantleSynthId matches.
Voice-mode prompt: three layers
Voice quality requires three complementary layers that can't substitute for each other:
- prompt model side
VOICE_MODE_PROMPTappended to the dynamic zone whenvoiceMode: true. Semantic guidance — write spoken English, use[laugh]/[sigh]proactively, defer code and links.- normalizer post-model
voice/normalizer.py, 23-step pipeline ported from rev-nexus. Mechanical Chatterbox-quirk handling — strip double quotes (sigh artifact), fix ellipsis (comma-collapse), validate[bracket]tag whitelist, pad short segments, strip markdown / URLs / emoji.- display strip ui side
ws.ts::stripDisplayTags. Bracket tags like[chuckle]stay in the synth text (chatterbox turns them into real sound effects) but are stripped from thetextshipped ontts_audioso the user doesn't see raw[chuckle]in the bubble.
The prompt shapes what only the model decides; the normalizer enforces what only post-processing can guarantee; the display strip keeps the audio cue out of the visible text. All three are required.
Per-agent voice tuning
AgentConfig.voice?: AgentVoiceConfig overrides the global config.voice.defaults field-by-field. Three knobs (down from six in the turbo era — chatterbox-streaming's generate_stream API doesn't expose top_k/top_p/repetition_penalty/cfm_timesteps):
- temperature 0.5–0.9 typical
Sampling temperature. Default 0.7. Lower = more consistent, higher = more variance per take.
- cfgWeight speaker anchoring
Classifier-free guidance — the speaker-anchoring lever turbo dropped. 0.0 = no anchoring (model prior leaks, accent drift); 0.5 = balanced default; 1.0 = strong fidelity to reference clip. Raise if accent drifts on shorter clips.
- exaggeration emotion intensity
Baked into the cached
T3Condviaprepare_conditionals. 0.0 = flat, 0.5 = default, 1.0 = highly expressive. Costs nothing at synth time — only re-prepares the conditional when the value changes.
Resolution happens fresh inside buildVoicePipeline on each call — saved tuning takes effect on the next reply / replay without restart.
The UI exposes this as a gear button next to the voice toggle: equalizer-style sliders, agent-accent fill, modified-vs-default highlight. "Test" calls POST /api/voice/preview without persisting; "Save" PUTs the agent config; "Reset" wipes overrides (voice: null).
Voice files
voices/<agent-id>.wav at project root. Convention: filename = agent id, lowercase. Centralized rather than per-workspace so a future voice library / multi-agent sharing UI fits cleanly. Missing file → toggle stays is-unavailable with a tooltip pointing back to voices/.
STT (Whisper)
/voice/stt/transcribe accepts a raw WAV body and an optional ?language= to skip auto-detect. Returns text + detected language + duration + inference time. Whisper's own vad_filter is disabled on this path — caller is expected to have endpointed via browser-side Silero VAD.
Phase 2 (the hands-free input loop) is the work in progress: VAD → Whisper → user message → memory pack → agent loop → TTS, all without a click. The Whisper endpoint is live; the browser pipeline is being assembled.
Known gaps
- No
AbortSignalon the python synth fetch — replay stop cancels server-side scheduling but in-flight synths complete (audio dropped before WS send). - No "free VRAM" UI control — toggle-off keeps models warm. Hit
POST /api/voice/unloaddirectly if you need the GPU back. - No voice support on Claude Code CLI sessions (CC owns its own loop; no
text_deltainterception layer to wire the chunker into). - HTTP headers are Latin-1;
X-Normalized-Textis ASCII-stripped, future headers need the same care.
Runtime
Multi-agent, personas, skills, cron, Claude Code CLI mode.
The operational layer. Everything in this section is what makes Mantle useful day-to-day rather than what makes it interesting architecturally — but a couple of these calls (especially CC CLI mode and the mechanical heartbeat allow-list) are load-bearing for keeping the harness small.
Multi-agent
Each agent owns a workspace directory, its own session storage, and its own Engram store. Adding an agent is mechanical:
copy templates/agent-workspace/ → workspaces/<id>/
fill in {{user}} / {{name}} / {{date}} placeholders
add an entry to config.json's agents[]
restartThe harness handles per-agent session storage, Engram isolation, skill scoping, and heartbeat scheduling automatically.
{
"agents": [
{
"id": "sly",
"name": "Sly",
"workspace": "./workspaces/sly",
"engramPath": "~/.rev-mantle/engram-sly",
"defaultProvider": "grok",
"accentColor": "#00d4aa",
"heartbeat": { "enabled": true, "intervalMinutes": 30 }
}
],
"defaultAgent": "sly"
}engramPath is optional. Unset → each agent gets ~/.rev-mantle/engram-<id> so memory pools are isolated by construction. Set explicitly to share — e.g. point Sly and Vega at the same path and one agent's memories are recallable by the other. Per-agent provider, model, persona, voice tuning, and heartbeat behavior all flow from the same agents[] entry.
The active agent is selectable from the UI sidebar. Switching agents loads their workspace files into the system prompt's stable zone — which invalidates the cache for that agent until the next turn settles.
Personas
Per-agent persona masks (personas.json in the agent's workspace) layer on top of SOUL.md. The persona block lives in its own zone of the system prompt, separately cached from the stable zone — switching personas mid-session only invalidates the persona cache, not the stable cache.
Switching personas mid-session also injects a transition note into the dynamic zone so the agent acknowledges the mask change naturally rather than silently shifting voice. The transition note is per-session: lastMessagePersona lives in the session index, and the note fires only on actual swaps, not on every message.
Persona is selectable per message via the profile bar in the UI. Heartbeat tasks always run with persona stripped — archival judgment shouldn't be biased by voice.
Skills (two-tier)
A SKILL.md system loads named skill packs into the system prompt's stable zone with token budgeting:
- Global —
./skills/at the Mantle root, available to all agents - Per-agent —
<workspace>/skills/, available only to that agent
Resolution: agent-specific skills override globals on name conflict. Per-agent disable / enable is configurable via agent.disabledSkills[] / agent.enabledSkills[]. Global disable via config.skills.disabled[]. Resolution order: agent-disable → agent-enable → global-disable → on by default.
Skills are application content, not core code. The repository ships only skills/memory-maintenance/ as a canonical SKILL.md format example. Workspace skill packs and operator skill packs live outside the tree.
Cron with Engram hooks
A SQLite-backed scheduled job system (src/cron/) lives alongside heartbeat for programmatic scheduling. Coexists with the heartbeat runner — heartbeat is for archival, cron is for everything else.
- at schedule type
One-shot. ISO timestamp or relative like "20m". Three retries on transient errors (rate_limit, overloaded, network, timeout, 5xx).
- every schedule type
Fixed interval, minimum 60s.
- cron schedule type
Standard 5/6-field cron expression via croner. TZ-aware.
Each job picks a session target: isolated (default — fresh session per run), persistent (cron-<jobId> — same session across runs), or session:<id> (fixed existing session).
Engram hooks come in three shapes:
- Pre-run context enrichment — query Engram for relevant memory before the job's agent loop runs, injected as additional context.
- Pre-run conditional skip — query Engram, evaluate the result, skip the run entirely if conditions don't match. Costs zero tokens when skipped.
- Post-run outcome storage — store the run outcome as a memory drawer with
wing: cron-history,room: <job-name>, and a causal chain viaparent_idlinking sequential runs.
Backoff: [30s, 60s, 5m, 15m, 1h] on consecutive errors. Auto-disable after 5 consecutive failures (configurable). Run logs are JSONL per job.
The agent has a cron_jobs tool with actions list / create / update / delete / run / history / analyze. Guardrails: 20 jobs per agent max, 1 minute minimum interval, agents only manage their own jobs. UI panel sits between Skills and Sessions in the sidebar.
Claude Code CLI mode
Sessions can route through the claude CLI binary instead of the in-process agent loop. CC owns model selection, tool execution, and the loop itself; Mantle stays in the role of context provider.
user message → Mantle UI → Mantle WS handler
↓
buildSystemPrompt({ heartbeat: false, voiceMode: false })
↓ persona + memory pack + transition note
spawn `claude --append-system-prompt <dynamic context>` ...
↓
CC event stream
↓
parsed and rendered into the same UIUseful when the goal is CC's tool surface and agent loop with Mantle's memory injection layered on top. CC's session storage is used for the actual loop state; Mantle persists its own session record alongside.
The trade-off: voice mode doesn't apply to CC sessions (CC owns the loop, no text_delta interception layer to hook the chunker into).
Slash commands
Parsed client-side in app.js. Unrecognized / prefixes pass through as messages.
/think [off|low|medium|high] adjust extended thinking budget
/reasoning [on|off] toggle reasoning surfacing
/clear clear current session
/new start a new session
/compact compact current session
/stop abort in-flight inference
/model show / pick model
/tools list available tools
/status show agent + provider state
/help slash command referenceBackend support is plumbed: GET /api/tools, POST /api/agents/:id/sessions/:id/compact, WebSocket {type:"stop"} aborts via AbortController.
Attachments
Files dropped into the chat input upload to per-agent attachment storage and surface in the agent's tool registry as agent_attachments — list / read by id / etc. Useful for "look at this image" / "review this doc" turns where the agent needs durable access to the file across multiple tool calls in one turn.
Status
Working: chat, multi-agent, Engram integration, pre-inference memory pack, lazy spawn, heartbeat, cron, personas, Claude Code CLI mode, prompt caching, attachments, persona transitions, mid-stream TTS, per-message replay, per-agent voice tuning, ChatGPT subscription billing.
Not yet implemented: Playwright/browser automation as a first-class tool (the upstream @playwright/mcp server is bridged, but a native first-class tool is on the list), sandboxed tool execution, NEXUS companion-app integration, Phase 2 voice (hands-free VAD + Whisper input loop).
Decisions
Load-bearing calls that shaped the harness.
Pre-inference memory pack over in-loop recall
◇ shippedMost agent harnesses give the model a
recallorsearch_memorytool and trust it to call when needed. Mantle inverts that: assemble the memory pack before inference, inject into the system prompt's dynamic zone, treat memory as already-on-the-table context rather than a tool the agent has to know to use. Saves a round-trip per turn, makes memory injection an architectural commitment rather than a behavioral nudge. The recall tool is still registered but actively discouraged in operating instructions — the pack already covers the topical retrieval budget.Two-pool memory model
◇ shippedFramed memories scored on all 6 Engram signals; raw transcripts in a physically separate ChromaDB collection with similarity-only retrieval and a separate query surface. The pools never compete because they're served by different tools. Engram's piece-4 scope-boundary invariant — that source chunks cannot enter the memory scoring pool — is enforced structurally here: separate collections, separate query tools, no shared ranking. Two-lens retrieval falls out for free.
Mechanical tool allow-list for heartbeat tasks
◇ shippedEach heartbeat task declares
tools: [...]and the registry filters dispatch through it. The model literally cannot call anything outside the allow-list for that invocation. Phase 1 (session-ingest) mechanically can't callremember. Phase 2 (session-mine) mechanically can't callengram_ingest_source. The two-pool separation falls out of mechanical enforcement instead of being a discipline the archivist agent has to maintain. Same machinery means per-task provider/model overrides also Just Work.Daemon split for Engram
◇ shippedEarlier versions ran one Engram process per agent, with each one holding its own copy of the embedder. Memory blew up linearly in agent count. Pivoted to the daemon split:
engram_daemonholds the Chroma stores and the Jina embedder,engram_mcpis the thin per-agent stdio adapter that translates MCP calls to HTTP against the daemon. Mantle does NOT manage the daemon — start it separately, leave it running. Edits to../rev-engram/src/engram*/*.pyshow up next adapter spawn (no vendoring).Lazy spawn with first-call dedup
◇ shippedEngram processes do not spawn at boot. Mantle loads the tool surface from a cached schema (
.mantle/cache/engram-schema.json) so the registry can registerengram_*tools without spawning Python. Other agents stay dormant until first chat turn / heartbeat tick / cron job / memory-pack query. Concurrent first-spawns for the same agent are deduped via a single in-flight promise — a burst of 12 parallel batched searches triggers one Python process, not 12.Three-zone system prompt
◇ shippedStable + persona + dynamic. Ordering is the cache discipline, not cosmetic. Anthropic gets four explicit
cache_controlmarkers (the API maximum); xAI auto-caches by prefix match. The pre-inference memory pack lives in the dynamic zone, so per-turn variation never invalidates the longer stable + persona prefix. Heartbeat mode strips the persona zone entirely (archival judgment shouldn't be biased by voice) — same machinery, just less in stable.Per-turn synthId voice state
◇ shippedFirst voice implementation kept playback state in module-level globals and reset on
tts_done. That raced asyncdecodeAudioDatacalls — the final chunk was decoded after the reset and dropped silently. Replaced withMap<synthId, VoiceTurn>where every voice-bound turn (live or replay) gets a server-generated UUID stamped on every event. EachVoiceTurntracks pending chunks, in-flight decode promises, decoded-vs-played counters. New turn starting before the previous finishes can't corrupt the previous one's bubble.OpenAI Codex via ChatGPT subscription, not API credits
◇ shippedThe
openai-codexprovider routes throughchatgpt.com/backend-api/codexover OAuth (PKCE), reusing OpenAI's published Codex CLIclient_id. Billing flows through the user's ChatGPT Plus/Pro/Team/Enterprise quota windows rather thanplatform.openai.comcredits.x-codex-*response headers prove subscription billing without checking the dashboard. Constraints from the codex backend are real:stream: truemandatory, system prompt in the top-levelinstructionsfield, strict input shape, notemperature/max_output_tokens/metadata. The provider just doesn't include them.Mid-stream sentence chunking with first-chunk floor
◇ shippedFirst-chunk threshold (
firstChunkMinChars, default 50) lower than subsequent (minChars, default 60) so opening audio ships sooner than a full prosody chunk would. 50-char default produces ~4s of opening audio, comfortably longer than worst-case chunk-2 synth (~1.5s at default cfm timesteps), so back-to-back playback lands gap-free. Replay path overrides first-chunk to 60 — full text is already in hand, so no TTFB benefit, just gap risk to avoid.Three runtime npm deps
◇ shipped@anthropic-ai/sdk,openai(used for both xAI and Codex backends),yaml. Pluschrono-nodefor date parsing in the memory pack andcronerfor cron expressions. Everything else is platform —Bun.serve()for HTTP/WebSocket, vanilla HTML/JS/CSS for the UI,bun:sqlitefor cron storage. No web framework on either side of the wire; no agent framework anywhere. Each dep is a future maintenance bill, and the loop is small enough not to need them.No vendored Engram, install editable from sibling repo
◇ shippedEngram lives in
../rev-engram/. Mantle's.venv/installs it viapip install -e ../rev-engramrather than vendoring a copy or pulling from PyPI. Edits to../rev-engram/src/engram*/*.pyshow up next adapter spawn — no rebuild, no version-pin lag during co-development. Schema cache (.mantle/cache/engram-schema.json) makes this safe at boot: even if the editable install changes underneath us, Mantle registers the tool surface from the cached schema and re-probes only when the cache is missing.