Mantle
A TypeScript agent harness with a deliberately small core — memory wired in before inference, a model-backend catalog behind one event channel, multi-agent channels and two-way voice — under a feature shell that bolts on and tears off without the core importing a line of it.
[latest2026-06-07]Rebuilding the face — Svelte 5 ui-next replaces a 10k-line vanilla UI.
The agent shouldn't have to ask for memory. Memory should already be on the table when it sits down.
Pitch
Mantle is the runtime my agents actually live in — a companion-agent hub built around one loop strong enough to drive all of it: chat, autonomous cron and heartbeat jobs, multi-agent rooms, voice. The loop is the engine; everything heavier — memory, voice, music, channels — hangs off it without the loop having to know it's there, which is mostly what keeps a harness this broad from collapsing into a tangle.
The features all lean that way. Engram, my long-term memory engine, is loaded before inference, so every turn arrives with a relevant memory pack already on the table. Channels put several agents and you in one shared conversation. Voice runs both directions, with full-duplex realtime calls as a mode of their own. And one event channel normalizes eight model backends — Claude, ChatGPT, Grok, and a fully local llama.cpp across API keys, subscription logins, and CLI subprocesses — so nothing downstream knows which one produced a token.
It can stay light, too. Run it on a single API key and it's just the loop and a chat UI; the heavy parts — a local memory daemon, on-device models, local voice synthesis — are all opt-in, there when you want them and gone when you don't.
The sixty-second version
The whole page in eight lines — each one links to its long form.
- A harness, not a framework — architecture: one loop that drives every kind of turn, a feature shell that hangs off it without coupling, and a three-zone prompt built for caching. Music is the throwaway exhibit — a whole feature that clips out without the loop noticing.
- One turn, up to 100 iterations — agent loop: build the prompt → stream → dispatch tools → persist → loop, behind one normalized event channel and the guards that keep a long turn honest.
- Every way to get tokens — backends: an 8-cell
(vendor × mode)catalog — API keys, subscription OAuth, fully local, and CLI subprocesses — where adding a backend is one file and one row. - Memory already on the table — memory: Engram assembles a relevant pack and injects it before the model speaks, instead of handing it a
recalltool; per-agent stores, kept current by a heartbeat. - A room full of agents — channels: several companions and you in one shared transcript, with @-mention turn-taking, agent-to-agent riffs, and private asides — a bolt-on room that rips out clean.
- The agent edits the harness — systems deck: a Cursor-style surface where the assistant proposes diffs to your skills, cron, and heartbeat — through the same loop everything else runs — and nothing hits disk until you accept.
- Voice, both directions — voice: mid-stream TTS and hands-free mic input, plus a separate full-duplex call mode proxied to Grok Voice.
- A strong wall, not a sandbox — security: an argon2 login wall over a hardened tool surface, and an honest line about where the hardening stops.
This isn't a thesis project — Mantle exists so my agents have somewhere to live. But the calls that shaped it are deliberate, and the way memory and the loop fit together is the part worth the read.
Architecture
One loop that drives every kind of turn, with the heavy features hanging off it — plus a three-zone prompt built for caching.
The point of the harness is everything the loop doesn't have to know about.
Mantle's core is small and does a lot: one agent loop, a tool registry, a model-backend catalog, and the prompt builder that feeds them. That same loop drives every kind of turn the harness runs, and everything with real surface area — memory, voice, channels, music, the whole UI — hangs off it as a feature shell. The core stays small not because the harness does little, but because the loop never has to carry any of what's bolted around it.
One loop, every kind of turn
A chat message, a cron job, a heartbeat tick, a subagent, a channel sub-turn, an async result delivered back later — they all enter through the same function, runTriggeredAgentTurn (src/agent/triggered-turn.ts). It resolves the backend, builds the system prompt, filters the tool surface, and pins the ToolContext — and the pin is the load-bearing part. agentId, sessionId, and workspacePath are fixed inside the front door and cannot be overridden by the caller, so per-agent isolation is a property of the entry point, not something every caller reassembles. The tool surface is filtered the same way each time — an allow-list or a deny filter, and then the agent's own disabled-tools list runs last over whatever survives.
What the front door deliberately doesn't own is the agent lock (whose scope is a per-trigger decision) and session bookkeeping — those stay with each caller, which is why a chat turn and a heartbeat tick differ only in how much the caller wires up first. One loop, driven six ways, is what lets a harness this broad stay a single readable engine instead of six forked ones.
The feature shell
The heavy features attach as rooms. A room (src/rooms/types.ts) is just an object with an id and an onAgentPurge hook to drop its per-agent state when an agent is deleted; a REST prefix, a set of tools, a footprint reporter, and start/stop are all optional. Music and channels are the two today, collected at boot by a thin registry. The loop never imports them — a room that needs to broadcast over the WebSocket gets that function injected at construction rather than reaching up the stack for it.
None of that is exotic; it's ordinary layering, the kind any codebase should want. An import-direction check (scripts/check-arch.ts) keeps the core honest about it in CI, so it can't quietly grow a dependency on the shell — and the payoff is mundane and real: tearing a feature out is deleting its folder and the line that registers it. The next exhibit makes that literal.
The system prompt: three zones
buildSystemPrompt (src/agent/prompt-builder.ts) assembles every turn's prompt from three zones in a fixed order — and the order is the cache discipline, not cosmetics.
stable and persona are the long, cacheable prefix; dynamic is the per-turn tail that carries the recalled memory pack and a fresh timestamp, so it changes every turn without invalidating the prefix above it. A few of the choices inside are load-bearing:
- Environment early. Just after a one-line identity header, the stable zone injects an environment block — working directory, platform, shell, git branch — so the agent reads where am I before the operating rules. The branch comes from reading
.git/HEADand walking up to the repo root, not agitsubprocess, which keeps the build sub-millisecond even though it runs every turn. - Standing skills inline, the rest by reference. Skills flagged
alwaysget their full body in the stable zone; the others are a one-line-per-skill index in the dynamic zone, so editing anySKILL.mdre-renders cheaply instead of busting the stable cache. - Tool descriptions never appear as prose — only in the structured
toolsparameter, never duplicated into the prompt.
The ordering is shared across backends; the cache mechanism is not — Claude gets four explicit cache_control breakpoints (the API maximum), Grok auto-caches by prefix, Codex caches server-side, and local reuses the llama.cpp KV cache. Same zones, exploited four ways; that's the Backends story. Workspace files aren't all-or-nothing either: each ## section of AGENTS.md / IDENTITY.md / SOUL.md / USER.md can be switched off per agent (the state is monotonic — only the off entries persist, so a new heading defaults on). It's prompt surgery without editing the underlying files, and the surface for it lives in the systems deck.
What's not in the harness
- No agent framework. The loop imports nothing from the shell around it; tools arrive through an injected
executeToolCall. The same loop drives chat, heartbeat, cron, a subagent, and a channel turn — all that differs is the caller. - No web framework on the server. It's
Bun.serve()returning HTTP, a WebSocket, and static files. The UI is a framework — Svelte 5 — but that's the one place a framework earns its keep; the wire between them stays plain. - No vendored Engram. Mantle installs its memory engine editable from the sibling repo, so edits to its Python show up on the next adapter spawn.
- Not yet: sandboxed tool execution, and the Nexus companion-app integration.
The face
The browser UI is a standalone Svelte 5 (runes) + Vite app — not SvelteKit, no SSR — served as a static bundle by the same Bun process. It's a mounted SPA in four tiers — a core runtime of $state modules, a shared component kit, the core views, and fifteen bolt-on rooms that mirror the backend's shape. The one piece that refuses the framework is the streaming hot path: token text is written to the DOM imperatively by a manual reveal clock, and Svelte is kept away from that node on purpose — reconciliation would lose on a token-by-token workload. Everywhere else, the framework pays for itself.
Agent loop
One turn, up to 100 iterations — a normalized event channel, bounded-parallel tools, and the guards that keep a long loop honest.
The harness is mostly platform; the loop is the engine. runAgentLoop (src/agent/loop.ts) is where a user turn becomes model output, tool calls, and the next turn's context. It's compact, and most of it is judgment, not plumbing.
The cycle
A turn runs up to 100 iterations. Each iteration loads the transcript, builds the three-zone system prompt (see Architecture), streams from the active backend, and collects the response. If the model stopped at end_turn or max_tokens — or called no tools — the turn is done. Otherwise the loop dispatches the tool calls, persists their results as a user message (the Anthropic convention), and goes back around with the results in context.
One normalized event channel
The provider contract is deliberately tiny:
interface Provider {
name: string;
stream(params): AsyncIterable<ProviderEvent>;
}ProviderEvent is just eight variants — text and thinking deltas, the tool-call start/delta/end trio, message_end, and error. That thinness is the whole reason a new backend is cheap to add: an adapter only has to normalize its wire format into those eight events. (How each of the backends does that is the Backends story.)
The loop doesn't pass that stream straight through. It emits a richer thirteen-variant stream to the UI:
Six events forward as-is. The two tool-argument deltas get accumulated and parsed into a single tool_call_input. And six are synthesized by the loop — the tool-execution lifecycle (tool_call_executing, tool_call_progress, tool_call_result, agent_attachment), plus blank_response and note_delivered. They're synthesized because the provider never sees them: tools run in the harness, and a steered-in note arrives from the user, not the model.
The provider streams the model's words. The loop owns everything that happens when the model reaches for a tool.
Tool dispatch within a step
When a turn calls several tools, they run in parallel — wall-clock becomes max(tools) instead of the sum — but parallelism is bounded. A model that emits twenty bash calls in one turn shouldn't spawn twenty processes at once, so dispatch goes through per-category FIFO semaphores:
Results are reassembled into the original tool_use order before they go back to the model (Anthropic requires it), but each tool's tool_call_result is emitted to the UI the instant it finishes — so a fast read visibly completes while a slow bash still spins. The semaphore slot is taken before tool_call_executing fires, so the "Ns" timer in the UI measures real execution, not time spent queued behind the cap.
One more within-turn optimization: a read-cache. Re-reading the same file or re-running the same glob with identical arguments inside one turn doesn't re-ship the content — it returns a short stub pointing back at the iteration that already has it in context. Write tools invalidate overlapping paths, and a bash call (which can mutate anything) clears the cache outright, so a read after an edit always runs fresh.
Keeping a long loop honest
A 100-iteration ceiling stays safe because the loop has cheaper ways to stop early:
- Loop-detector — a 30-call rolling window fingerprinted by
(tool, args, result). The same call with the same args three times appends an inline hint to the result so the model self-corrects; five times aborts the turn cleanly. It also catches a tool failing in a streak, and read/edit/read/edit ping-pong that never converges. - Idle watchdog — if no stream event arrives for 90 seconds, the backend request is aborted. It's composed with the user's
/stopand a 10-minute turn deadline viaAbortSignal.any, so all three unwind through the same plumbing, and it resets on every event (extended thinking keeps it alive). - Blank-response retry — Grok occasionally finishes with no content and no tool calls. The loop silently re-streams once, without burning an iteration, before surfacing a retry affordance.
- Truncation — any single tool result over ~24K characters is cut to an 80/20 head-and-tail with an omission marker, so one chatty tool can't crowd the context mid-loop.
None of these are proofs. The loop-detector can't always tell a stuck loop from legitimate repetition, so it nudges before it aborts; truncation can clip a tail that mattered. They're pragmatic floors that make a 100-iteration ceiling safe to offer — and a finally block flushes the session index on the way out, so even an aborted turn leaves the transcript consistent for /retry.
Second loops
The same loop runs more than chat turns. Subagents (spawn_agent — depth ≤ 2, up to four per parent) and background tasks each fire their own runAgentLoop, and when they finish their result re-enters the parent conversation as a fresh synthetic turn. The delivery is durable: results land in a per-agent JSONL outbox and are appended idempotently on the next lock release — so a result is never lost because the parent was mid-turn when it finished, and never doubled if a retry fires. Every one of these, like chat and the scheduled turns, enters through the single front door described in Architecture; what comes back is a typed TurnOutcome (its stop cause, token usage, last text), so callers branch on a struct instead of re-scanning the transcript.
Backends
Every way to reach a model — an 8-cell (vendor × mode) catalog spanning API keys, subscription logins, CLI subprocesses, and a local sidecar, all normalized to one event stream.
The agent loop is backend-agnostic: it speaks one normalized ProviderEvent channel and doesn't care what produced it (see Agent loop). So the layer beneath it is a catalog — a single array in providers/catalog.ts that enumerates every way Mantle can reach a model as a (vendor × access-mode) cell. Eight cells today, and nothing is constructed at boot: a cell is built on demand, per turn, by the one resolver every dispatch site calls.
Two integration shapes hide behind that uniform catalog. Most cells are in-process providers — they implement a single method, stream(), returning the ProviderEvent channel, and let runAgentLoop drive the tools and the iteration. Two are CLI subprocesses — they hand the entire agentic turn to a real vendor binary and translate its output back. The resolver knows the difference and refuses to route a CLI cell as a provider; they're two execution worlds, kept separate on purpose. Either way the output lands on the same AgentStreamEvent stream, so the UI never has to care which one ran.
| backend | reaches | how it runs |
|---|---|---|
anthropic/api | Claude, API key | in-process · loop drives |
anthropic/cli | Claude Code binary | subprocess · owns the turn |
openai/api | ChatGPT, API key | in-process · loop drives |
openai/subscription | ChatGPT, Codex login | in-process · loop drives |
xai/api | Grok, API key | in-process · loop drives |
xai/subscription | Grok Build login | in-process · loop drives |
xai/cli | Grok Build binary | subprocess · owns the turn |
local | llama.cpp, your GPU | in-process · loop drives |
Adding a cell is genuinely cheap: a new API provider is one file and one row, because the shared streaming layer already handles the wire shape — the OpenAI adapter is sixty lines, the Grok one sixty-four. google is reserved in the vendor list with no cell behind it; that's the whole cost of leaving the door open for Gemini.
API providers — key plus SDK
The easy path. An API provider is a thin adapter that normalizes one vendor SDK's stream into ProviderEvents:
anthropic/api— the Anthropic SDK. Extended thinking maps to a token budget perthinkingLevel, and it places the fourcache_controlbreakpoints described in Architecture.openai/api— the OpenAI SDK againstapi.openai.com, Chat Completions. (It has to ask for usage explicitly; OpenAI streams no token counts otherwise.)xai/api— the same OpenAI SDK pointed atapi.x.ai. Reasoning effort is gated by an allowlist, because only some Grok models accept thereasoning_effortparam — the rest bake reasoning depth into the model id and reject the field. xAI auto-caches by prefix.
Because the shape is just "SDK in, ProviderEvent out," adding another API model — gemini, deepseek, whatever — is a single adapter file. Boot is tolerant: a missing backend isn't fatal, only a configuration with no usable backend at all.
Subscription, not API credits
API keys bill per token. The next two cells bill against a flat subscription instead, by riding the OAuth a coding-CLI product already uses — in-process, over a token, no subprocess.
openai/subscriptionrides a ChatGPT subscription. Mantle runs its own PKCE login (mantle auth login) againstauth.openai.comwith the Codex CLI's published client id, stores the tokens at.mantle/auth/openai-codex.json(mode0600), and calls the Responses API atchatgpt.com/backend-api/codex— notapi.openai.com. Concurrent turns share one in-flight refresh, so the rotating refresh token is never spent twice.xai/subscription(Grok Build) rides an X / SuperGrok subscription, but skips its own login entirely: it reuses the credentials Grok's own CLI already wrote at~/.grok/auth.json— you're already signed in, so there's nothing to set up. It refreshes into Mantle's own store and uses whichever of the two holds the fresher access token, since these refresh tokens rotate single-use. The backend is the Responses API atcli-chat-proxy.grok.com.
The two differ only in how they get the token — codex logs in itself, Grok Build piggybacks on a login you already have — but once the token is in hand, both are ordinary in-process providers.
Subprocess delegation
Sometimes you don't want Mantle's loop at all — you want the real CLI's full agent: its tools, its orchestration, its model selection. In that mode Mantle stops being the loop and becomes a context provider, injecting only the per-turn dynamic block (persona + memory pack) and rendering the CLI's event stream into the same UI.
anthropic/cli(Claude Code) —claude --print --output-format stream-json --resume <id> --append-system-prompt <dynamic> --mcp-config <engram>. Claude Code owns the model and tools; because its stream carries structured tool events, its tool calls render live in the UI. Mantle captures the session id on the first turn and--resumes it after.xai/cli(Grok Build CLI) —grok -p --output-format streaming-json --session-id <id> --rules <dynamic> --yolo. Grok's headless stream surfaces only text and thinking — tools run silently inside its own loop — so the UI shows its thoughts and final answer. Mantle mints the session id up front, since Grok's-sflag is create-or-resume.
This is where Grok Build is unusual: it's the one source that runs both ways — as a token provider above (xai/subscription), and as a CLI subprocess here. It takes the Codex idea and the Claude Code idea and offers both.
Local — a purpose-built sidecar
local runs models entirely on your machine. The provider is almost the xai/api adapter pointed at localhost: it talks to a llama.cpp llama-server over that server's OpenAI-compatible /v1 API.
What makes it more than "point at a port" is the sidecar lifecycle. A LocalModelManager spawns llama-server, polls its /health, forwards its logs, and kills it on shutdown — the same pattern Mantle uses for the voice sidecar. llama-server serves one model per process, so switching models is kill-and-respawn; it's lazy (nothing loads until a local model is used, and an idle model auto-unloads to hand the GPU back), and a cold load — multiple gigabytes — feeds keep-alive ticks to the loop's watchdog so a long load doesn't trip the 90-second idle abort. (Not Ollama — Mantle drives llama.cpp directly.)
Models live in a registry. mantle pull <hf-link> fetches a GGUF — picking the right quant and reassembling multi-part split files — and records it in local/registry.json with its per-model spawn settings:
- tool mode off · core · custom · all
Many small local models can't reliably function-call, and the full ~76-tool surface overflows llama.cpp's tool-call grammar besides. So a model declares what it gets: nothing, a curated 14-tool core set (the default), an explicit allow-list, or everything.
- reasoning inline think
Reasoning models (DeepSeek-R1, Qwen3, QwQ) emit chain-of-thought in the content stream; the provider splits inline
<think>…</think>out into the thinking channel.- spawn settings per model
Context size, GPU layers, thread count, KV-cache type, sampling, and an optional chat-template override — applied as
llama-serverflags on load.--jinjaactivates the model's own template so tool calls work.
The model browser
Pulling a model by hand means already knowing the repo and the right quant. The in-app model browser removes that step: it searches HuggingFace's GGUF repos directly — sorted and faceted, cursor-paginated straight off the HTTP Link header — and renders a master-detail view from the parsed GGUF header (exact parameter count, trained context length, architecture, chat template) beside the rendered model card. Split quant sets (model-00001-of-00003.gguf) collapse into one entry with a summed size; projector shards are filtered out.
What earns the "GPU-aware" label is fit estimation. The weight sizes are exact, but a model's KV-cache footprint can't be read from the HuggingFace API, so it's modeled with a GQA-calibrated power-law — fit against known points (Qwen2.5-7B ≈ 56 KB/token, Llama-3-70B ≈ 320 KB/token). The browser reads total VRAM from nvidia-smi, classifies each quant as fits / tight / partial-offload at a working context, and recommends the largest that comfortably fits. It's a calibrated heuristic, not a VRAM profiler — it calls fits-versus-tight reliably, and errs conservative. On install the same math runs backward: it solves for the largest context window that fits live free VRAM, caps it at the model's trained length, and steps the KV cache from f16 down to q8_0 to buy more room.
Whatever the shape — SDK, subscription token, CLI binary, or local sidecar — the output is the same ProviderEvent (or, for the subprocess modes, the same AgentStreamEvent). Supporting a new source is choosing a shape and writing one adapter.
Memory
How Mantle uses Engram — three surfaces, a pack assembled before inference, and per-agent isolation. The part worth reading the code for.
Most agent harnesses give the model a
recalltool and trust it to call when needed. Mantle doesn't.
Mantle integrates Engram — a provenance-aware retrieval engine — and uses it differently from the conventional pattern. The scoring math and retrieval theory live on the Engram page; this section is about how Mantle uses it.
Three surfaces
The agent reaches memory through three surfaces, in order of immediacy:
- MEMORY.md working memory
A small, curated file in the agent's workspace, always present in the system prompt's stable zone (see Architecture). It's the always-on scratchpad — current goals, recent state. The heartbeat migrates stale entries out to Engram so it stays short.
- memory pool engram_drawers
Framed interpretations — "the user prefers small, focused PRs." Scored on all of Engram's signals, and surfaced automatically through the pre-inference pack below. The agent doesn't fetch these; they arrive.
- source pool engram_source_chunks
Raw chunked content — full transcripts and ingested docs. Similarity-only, on a separate query surface (
engram_search_source/recall_source). What the agent reaches for when it needs the verbatim, not the interpretation.
The two Engram pools never compete because different tools serve them. "What does the user think about the auth flow?" hits the memory pool; "what did the user literally say last Tuesday?" hits the source pool. Engram enforces that scope boundary structurally — source chunks can't enter the memory scoring pool. (Mantle runs Engram in its companion deployment: a four-type speech-act library — want, preference, opinion, observation — with role-based authority. The type/intent matrix that shapes ranking is Engram's; see its page.)
The pre-inference memory pack
Most harnesses register a recall tool and trust the model to call it. Mantle inverts that: on every user turn, before any inference, buildMemoryPack (src/agent/memory-pack.ts) assembles a block of relevant memories and injects it into the system prompt's dynamic zone. Memory is already on the table when the model sits down.
The fan-out is the part that earns its keep. A conversational turn dilutes its own topic across filler, so the user message — plus the prior assistant and user turns — is decomposed into many query variants: full text, a stopword-stripped version (articles and short prepositions are kept — the embedder's phrase representations are structure-sensitive), per-clause splits, and four memory-shaped HyDE reframings keyed to the user's name (wants… / prefers… / thinks… / is…), fired only when the topic carries ≥5 content words so the framing tokens don't dominate. All of them ride one batched embed pass.
Those queries fan out into a single wave — four concurrent Engram calls plus a keyword fallback — and the results assemble into the pack in a fixed priority order, the whole thing raced against a wall-clock budget so a slow store degrades to "no pack" rather than stalling the turn.
What the pack pulls
The pack isn't one query. It's a few kinds of recall with different jobs, assembled into three sections in a deliberate priority order.
- topical primary · every turn
The main event: a batched semantic search across every query variant — top-9 above a 0.10 score floor — run alongside a second navigated search that tags each hit with its currency, so the pack can tell the latest state of an evolving memory from a superseded earlier version. Together they lead the pack as Recalled; everything else is supporting cast.
- fallback safety net · only if empty
A keyword FTS5 search at score 0, run only when topical comes back empty. The embedding floor can miss exact-phrase hits — a name, an error string, a signature term — that keyword matching catches. It merges into Recalled, so a literal match never returns silence just because it scored below the semantic floor.
- temporal conditional · on date language
When
chrono-nodeparses a date phrase from the turn, a date-window query runs alongside topical. Two modes: a pure recap ("what did I do Tuesday") enumerates that window's sessions; a date-plus-topic query ranks memories semantically within the range. False-positive parses — a stray "May" that returns nothing — are dropped silently. It lands as its own Temporal section.- reminiscing ambient · every turn
A random sample of non-observation drawers, pulled with no embedding, so it's nearly free. It isn't about the current topic — it's shared-history color the agent can raise proactively ("that reminds me…"). Three normally; six when topical found nothing, so even an off-topic turn keeps some texture. It closes the pack as Reminiscing.
The priority is deliberate: relevance first, then time-anchored recall, then ambient history — and when relevance comes up empty, the pack leans harder on reminiscing instead of going blank.
Two details make the pack read as context rather than a tool result:
- It tells the model to use the memories as background — let them inform the reply, don't quote them back, and silently ignore any that turn out off-topic.
- When retrieval finds nothing at all, the pack still injects a note saying memory was queried and came back empty — don't search for more — so the agent doesn't burn a turn calling
recall.
Recalled text is treated as untrusted on the way in: newlines are collapsed before it's rendered, so a stored memory can't forge a markdown heading that impersonates the pack's own section structure. The recall tool stays registered, but operating instructions actively discourage mid-turn calls — the pack already spent the turn's retrieval budget.
The pack costs tokens and a retrieval round-trip on every turn, and semantic recall will sometimes surface a memory that doesn't fit. The "use as background, silently ignore what's off-topic" framing holds the cost in place — the bet is that memory already on the table beats a tool the model has to remember to call.
Per-agent isolation and lazy spawn
Each agent runs its own Engram MCP adapter pointed at its own ENGRAM_PATH, so memory pools never cross agents — one agent's drawers are physically invisible to another unless they're pointed at the same store on purpose.
Spawning is lazy. At boot, Mantle registers the engram_* tools from a cached schema (.mantle/cache/engram-schema.json) without starting any Python — and re-probes only if the daemon's tool surface has changed. An agent's adapter spawns on its first actual memory call — chat turn, heartbeat tick, cron job, or pack query — and concurrent first-calls dedupe through a single in-flight promise, so the pack firing a dozen batched searches at once still triggers one process. (src/engram/manager.ts.)
Daemon split
Engram runs as two processes: a long-lived engram_daemon that holds the Chroma stores and the embedder, and the thin per-agent engram_mcp adapter Mantle spawns — a stdio-to-HTTP translator holding neither. Mantle does not manage the daemon; you start it separately and leave it running. At boot Mantle probes daemonUrl/healthz (default 127.0.0.1:49765) with a short timeout, and if it's unreachable, memory is simply disabled for the session — no boot stall, no failed spawns, no pages of Python tracebacks. This daemon is the one genuinely heavy thing behind an otherwise light core — a separate Python service with a vector store and an embedder. The loop stays light; rich memory doesn't come free.
Heartbeat builds the pools
The two pools fill over time without anyone tending them, via the heartbeat archival pipeline (the runner mechanics are in the systems deck). It runs as two heartbeat tasks — session-ingest and session-mine — each its own agent loop with a stripped archivist prompt and a mechanical tool allow-list:
- session-ingest renders each session transcript to markdown and ingests it into the source pool. It's scoped to the ingest tools — it cannot write framed memories.
- session-mine walks recent sessions and extracts framed companion memories into the memory pool. It's scoped to the memory-write tools — it cannot ingest raw sources.
The two-lens retrieval falls out structurally: same source material, two pools, two query surfaces. The separation isn't a discipline the archivist has to maintain — the registry enforces it by filtering each phase's tool calls against its allow-list.
Authoring discipline
The companion library wants framed interpretations, not verbatim logs. The agent is the first interpreter:
"The user is planning to try the new pho place this week — worth asking how it went."
Not:
User said: "I'm going to try that new pho place".
A memory carries forward intent and context, not a quote. The verbatim version is already in the source pool, captured automatically by the ingest task.
Tool surface
High-level wrappers in src/tools/core/memory.ts sit over the raw Engram surface, which bridgeEngramTools registers — one registry entry per Engram tool, each routing to the calling agent's own adapter at call time. The agent's visible retrieval surface is recall and its variants plus memory_status; remember and recall_source are registered but hidden from chat, so an agent doesn't author or delete its own memories mid-conversation — that's the heartbeat's job. The wrappers gate on registry.has(...), so when a given Engram build doesn't expose a tool, they degrade quietly instead of the agent needing to know which version is running.
Channels
Multi-agent rooms — several companions and you in one shared transcript, with @-mention turn-taking, agent-to-agent riffs, and private asides. A bolt-on room that rips out clean.
A channel is a group room: you and any number of named agent companions in one shared, author-tagged transcript, taking turns. It's the multi-agent surface — say Wren and a second agent in the same conversation, talking to you and to each other — and it's a room in the architectural sense: a folder under src/rooms/channel/ and a thin branch in the WebSocket handler. Delete the folder and the feature is gone; the core loop never imported it (see Architecture for the contract that guarantees that).
The interesting problem a channel has to solve is that a model provider wants a strict user-then-assistant alternation, and a shared room is the opposite — many authors, interleaved.
One transcript, a view per agent
The transcript is stored once, as JSONL, with every row stamped with its author. When it's a given agent's turn to speak, the controller doesn't hand the model the raw transcript — it projects it. That agent's own assistant rows replay as assistant; every other row — whether it came from you or from another agent — becomes a user message prefixed with the speaker's name, so the model can tell who said what. The same transcript, projected differently for each agent, is what lets one room hold a real multi-party conversation.
The load-bearing detail is the collapse. After projection, any run of consecutive rows that mapped to the same role is merged into one message — because two adjacent user messages (you, then another agent) would make the provider reject the request outright. That invariant is guarded by a fuzz test that runs the projection across two hundred random speaker interleavings and asserts the alternation never breaks. A context window walks back from the newest row, so per-turn cost stays flat as a channel ages.
The stamping has to happen the moment each row is written, not after the turn. A channel agent runs the full loop, tool calls and all, and every intermediate row (a recall result, a mid-thought) has to carry its author from birth — stamp only the final reply and the projection mis-attributes the tool rows to someone else, so the agent never sees its own recall come back. A ChannelSessionManager stamps the author onto every row as it's persisted; it's a subclass, so the core session manager stays untouched.
Who speaks, and when
Each message you send builds a speaker queue, in tiers: agents with their live mic on (auto-respond), in roster order; then anyone you @-mentioned who isn't already queued; and if neither, the agent who spoke last. Mentions are checked against the actual roster, so a hallucinated handle is simply ignored. Speakers run strictly in sequence — each agent's whole sub-turn finishes and persists before the next agent's projection reads the transcript — so speaker N sees everything speakers 1…N-1 just said, with no extra synchronization.
Past the opening queue, an optional volley lets the agents riff with each other:
In free style an agent hands the floor on by @-mentioning another; a ping-pong guard vetoes an A→B→A→B loop and falls back to the rotation. In round-robin the floor just rotates. Either way the volley is capped — twelve agent turns, then the floor returns to you — and any agent can bow out early with channel_yield, a pseudo-tool intercepted in the controller that never reaches the registry. When every live-mic agent has yielded, the riff ends. A "Jump in" button preempts the whole thing: it aborts the current sub-turn and skips the rest of the queue at once.
That last part rides the lock. Channel sub-turns hold the per-agent lock at rank 3 — above autonomous work (heartbeat, cron, background) but below a direct 1:1 message at rank 4 — so a companion mid-riff yields the instant you DM it, and the preemption signal propagates to the whole volley, not just the agent currently holding the lock.
Private asides
You can mark a message as a whisper to a subset of the room. The scope rides on the row; the projection filters whisper-scoped rows out of every non-member's view entirely (they see a hole, and the collapse merges around it); and every row an agent writes while answering an aside inherits the same scope. The volley and the rotation stay inside the whispered set, so a side conversation can never pull in someone who wasn't part of it. Whisper to agents who aren't in the channel and the server refuses and tells you — quietly sending it to the whole room is the one wrong answer.
What an agent can do in a room
A channel is a hangout, not a workbench, and the tool surface says so: in a room an agent gets recall and Engram search, web_fetch and web search — no bash, no filesystem, no spawning. The prompt tells it as much: look a fact up if you need to and drop the answer conversationally, don't paste a wall of results. Each speaker still gets its own unmodified persona and, optionally, a per-speaker memory pack — it's that agent in the room, not a generic one.
The systems deck
A Cursor-style surface where you and your agent manage the harness — skills, cron, heartbeat — proposed as file diffs through the same loop everything else runs, with nothing written until you accept.
Mantle's agents run on configuration that's just files and rows: skills are SKILL.md packs, the heartbeat is a HEARTBEAT.md, cron jobs are records in a small database. The systems deck is the full-page surface for all of it — tabs for skills, tools, cron, and the heartbeat — and its trick is that you don't edit alone. A chat dock sits alongside whatever you have open, and the agent in it can propose the edits for you, Cursor-style: as a diff you accept or reject hunk by hunk, with nothing touching disk until you say so.
The agent edits the harness
Ask the dock to "add a skill that summarizes my open PRs" or "make the memory-maintenance task run hourly," and what runs is not a special code path — it's a real agent turn through the same front door as chat, cron, and everything else (see Architecture), against a hidden, persistent session that survives refreshes. What's different is its output. The turn is given two classes of pseudo-tool, and which one it reaches for decides what you see:
- file edits propose_edit → diff
For anything that's a file — a skill, the heartbeat, a cron job's spec — the agent calls
propose_editwith the full revised text. Nothing is written. The new text is diffed against what's open (a dependency-free LCS line diff) and rendered as an inline, Cursor-style review: accept or reject each hunk, cherry-pick some and drop others, or take it all. The buffer is written only once every hunk has a verdict.- structured changes confirm card
For a structured mutation — create a cron job, disable a skill, set the heartbeat interval — the agent calls a scoped tool that stages a confirm card instead of acting. You approve it, and only then does the real registry tool run. File edits are never eligible for auto-approval; they always come back as a diff.
There's a trust dial for that second class: you can grant standing approval for a specific action on a specific agent — Wren may create cron jobs — and that kind stops prompting. The instructions the agent follows on each deck page aren't baked into the binary, either; they're builder skills loaded live from disk and injected into the prompt's dynamic zone, so the assistant's behavior on the cron page versus the skills page is itself an editable file.
The boundary that makes this safe to hand an agent is simple and visible: nothing hits disk until you accept it. A proposal is staged server-side; an approval is the only thing that calls a write. And the skill writer validates frontmatter the same way the loader does, so the agent can't produce a skill that discovery would silently skip over.
What you're configuring: the autonomous tier
The deck earns its place because Mantle agents do things on their own, and you need somewhere to see and shape it. Two systems run without anyone typing.
Heartbeat
A per-agent timer reads tasks from HEARTBEAT.md and runs the due ones. The load-bearing part is the event gate: a task fires only when its interval has elapsed and something it watches has actually changed — a sentinel like sessions resolves to the session index, so a "mine new transcripts" task wakes after new conversations exist, not on a bare clock. Idle days cost zero tokens. Each task runs its own loop with a stripped archivist prompt (no persona, no soul — archival judgment shouldn't be inflected by voice) and a mechanical tool allow-list, and it yields the moment you start chatting. This is the machinery behind the two-pool memory archival in Memory.
Cron
Cron is the programmatic side: SQLite-backed jobs on a schedule — one-shot, fixed-interval, or a cron expression — that run any agent turn. Three things lift it above a task runner:
- It delivers where it should. A finished run can stay silent, raise a toast, post into your most recent chat session as a synthetic message, or — the default — let the run itself decide, via a
cron_reportverdict it files as its last action. That verdict also rides into the next run's prompt as[Previous run: …], so a recurring job has continuity across invocations. - It can pace itself. A run can call
cron_snoozeto push its next fire out, clamped to a sane range; a one-shot that snoozes re-arms instead of completing. - It's memory-aware. Optional Engram hooks let a job skip unless it has enough relevant memory to act on, prepend recalled context to its prompt, and store each run's outcome chained to the last — a causal history of the job, in memory.
Crash recovery and a missed-job catch-up (capped and staggered, so a cold start doesn't stampede the API) round it out — and the containment matches how a job was made: one an agent schedules itself mid-turn inherits that turn's tool allow-list and can't widen its own surface, while one you approve here in the dock runs as your change, gated by the confirm card rather than an allow-list.
Voice
Two-way voice — mid-stream TTS over two backends, hands-free mic input, and a separate full-duplex call mode with no agent loop behind it.
Voice is two independent subsystems. Speaking turns the model's reply into audio while it is still being written. Listening turns the user's speech into an ordinary chat message. They run separately and never reach into each other — except for one rule: while the agent is speaking, the mic goes quiet, so the agent's own voice can't trip the listener. A live back-and-forth is a third thing entirely, with its own machinery and no agent loop behind it — that's Calls, at the end.
Speaking
The reply doesn't wait for the model to finish. A StreamChunker (src/voice/stream-chunker.ts) watches the text_delta stream and flushes complete sentences the moment they land. Boundaries are sentence-final .!? followed by whitespace — commas and semicolons are deliberately not boundaries, because the neural TTS lays out prosody per sentence and splitting on clauses produces stilted, list-y delivery. The opening sentence ships on a lower character floor (50 vs 60) because it dominates time-to-first-audio; 50 characters is ~4 seconds of speech, comfortably longer than the next chunk's synth time, so playback runs gap-free.
Two backends, one interface
Each chunk goes to the agent's selected TTS backend, and buildVoicePipeline hides which one behind a single feed / flushAndWait interface, so neither the agent loop nor the per-message replay path knows or cares what's synthesizing.
- chatterbox-streaming local · GPU
A Python sidecar (
voice/server.py, FastAPI, loopback port 7333) spawned at startup as a cheap idle process; the models lazy-load only when voice mode is first toggled. Voice is a cloned reference.wav, so each agent can sound like a specific recorded voice. It synthesizes in sub-chunks streamed back as NDJSON, so audio starts arriving before the sentence is finished.- xAI hosted TTS hosted · no GPU
A single REST call to
api.x.ai/v1/ttsper chunk, authed with the same key as the Grok provider, returning one mp3 the browser decodes natively. Five built-in voices (eve, ara, rex, sal, leo), no local hardware. The trade is fidelity-of-cloning for zero-setup.
The backend is a per-agent choice, not a fork in the code — the pipeline shape is identical on both sides. If the chosen backend isn't viable for a turn (sidecar down, no voice file, no API key), buildVoicePipeline returns null and the turn falls back to text silently rather than erroring.
Parallel synth, serial emit
Synthesis and delivery are deliberately decoupled. Every chunk starts synthesizing the instant the chunker emits it — chunk two is already in the synth while chunk one's audio is still streaming to the browser. But a chained promise (emitChain) serializes the WebSocket sends, so the audio reaches the browser strictly in order even though it was produced out of order. The browser keys every event by a per-turn synthId and holds its own per-turn state — pending chunks, decode tracking, reveal hooks — so a new reply that starts before the previous one's audio drains can't write into the wrong bubble.
On the chatterbox path there's one more guard. Neural TTS occasionally hallucinates a multi-second tail — a sigh, a yawn, an elongated noise — that spans several sub-chunks. A running budget (text length ÷ 12 chars/sec, ×1.3 headroom, floor 4s) watches cumulative audio and aborts the synth mid-stream when it overshoots, closing the HTTP connection so the sidecar stops generating and releases its lock for the next chunk.
Reveal and pacing
Audio drives the text, not the other way around. The visible bubble fills chunk-by-chunk as each chunk starts playing, so the words never run ahead of the voice. Between chunks the player inserts a pause keyed to how the prior chunk ended — 250ms after a sentence-final period, ~80ms after a comma, semicolon, or colon — so spoken cadence survives the seams where the chunker cut the stream apart.
Sounding like speech
Reading a chat reply aloud verbatim sounds wrong — markdown, URLs, and code don't speak. Three layers fix that, and none can substitute for the others:
- prompt model side
A voice-mode instruction appended to the dynamic zone tells the model to write spoken English — short sentences, paralinguistic tags like
[laugh]and[sigh]where they fit, code and links deferred. This shapes what only the model can decide.- normalizer post-model
A sidecar text pipeline (
voice/normalizer.py) mechanically cleans what the model can't be trusted to: strips markdown, URLs, and emoji, expands currency and acronyms, and rewrites the punctuation that makes the TTS model misbehave (straight quotes and em-dashes are known triggers for a runaway sigh). This enforces what only post-processing can guarantee.- display strip ui side
The text sent to the synth and the text shown in the bubble are separate strings. Bracket tags and other synth-only markup are stripped from the display copy, so the bubble stays clean even when the synth copy carries instructions meant for the voice, not the eye.
Listening
The listen path is hands-free and entirely browser-driven up to the transcript. getUserMedia opens the mic with echo cancellation, noise suppression, and automatic gain control — the same primitives video-call apps use — and feeds a Silero VAD that decides when an utterance starts and ends. The endpointing is tuned to feel like the assistants people are used to: ~1150ms of trailing silence ends a turn, generous enough for natural thinking pauses without cutting the user off mid-thought. On speech-end the captured audio is WAV-encoded at 16kHz and POSTed to the sidecar, where faster-whisper (large-v3-turbo) transcribes it and the result is auto-sent through the normal chat path — lazy session creation, the memory pack, attachments, all of it — exactly as if it had been typed.
Two filters keep junk out. Below a minimum speech length, the VAD drops the clip as a cough or a click. After transcription, a short list of known Whisper hallucination phrases — "thank you," "you," stray "[music]" — is dropped silently, because Whisper emits those with high internal confidence on near-silent input and sending one would start a turn the user never asked for. Whisper's own VAD filter is left off on this path, since the browser already endpointed the utterance. Speech recognition loads on demand — toggling the mic warms faster-whisper into VRAM, so TTS-only sessions never pay for it — and the two engines load sequentially, because the underlying transformers runtime races on a concurrent first-import.
One rule: the mic yields to the voice
The two halves coordinate at exactly one point. Echo cancellation alone isn't enough — speakers bleed into the mic at any volume, and the canceller only references the call audio path, not the synthesized playback — so when a turn begins speaking, the mic pauses outright. It resumes 500ms after the last audio drains, a cooldown that covers both the tail still leaving the speakers and the moment the VAD needs to re-arm. Overlapping turns (a replay started while a streamed reply is still playing) are reference-counted, so the mic only wakes back up once everything has gone quiet.
The chatterbox backend exposes three runtime knobs — temperature, cfgWeight (how tightly the output anchors to the reference clip; the reliable default sits high, since it also suppresses the hallucinated tails), and exaggeration (expressiveness, baked into cached speaker conditioning). They resolve fresh on every turn, so a change takes effect on the next reply without a restart, and any finished message carries a speaker button that re-synthesizes it through the same pipeline.
Calls
A call is a different mode. Everything else in Mantle runs through the agent loop — a turn arrives, the model thinks, tools fire, a reply comes back. A call runs none of that. It's a live, full-duplex conversation: audio flowing both directions at once, the user free to interrupt at any moment, with xAI's Grok Voice Agent doing the listening and the speaking. Mantle's job is to sit in the middle and otherwise stay out of the way.
Why Mantle sits in the middle
The call could in principle run straight from the browser to xAI — except a browser WebSocket can't attach an Authorization header, and the realtime endpoint authenticates with a bearer key on the upgrade request. So the browser opens a socket to Mantle, and Mantle opens the real one to wss://api.x.ai/v1/realtime with the key attached. From there it's a relay: PCM frames from the mic go up to xAI as input_audio_buffer.append; the assistant's synthesized voice streams back as output_audio.delta and is relayed down to the browser. It's the same key as the Grok provider — one xAI account, two very different uses. Holding it server-side is the whole reason for the proxy, and it pays a second dividend: the transcript and the duration cap live somewhere the browser can't reach around.
The turn belongs to the server
In the chat loop, Mantle decides when a turn is over. In a call, xAI does. The session opens with server-side voice activity detection, and it owns turn-taking completely — when the user has started talking, when they've stopped (~700ms of trailing silence), and when to start the reply. Barge-in falls out of that for free: the instant the user talks over the agent, xAI fires a speech-started event and cancels the response in progress, so the agent stops mid-sentence without Mantle coordinating anything. The user's speech is transcribed (by whisper-1, on xAI's side) and streamed back as text alongside the audio, so the UI captions both halves. There are no tools in a call — if the agent needs one, it says so and follows up in chat afterward.
A different prompt: VOICE.md
A call doesn't get the chat system prompt — that's built for a tool-using, memory-backed agent, none of which applies when the agent is only talking. So a call loads a single file from the agent's workspace, VOICE.md: a distilled, voice-shaped personality, used alone, followed by a short call-mode footer (short turns, no markdown, stop when interrupted). If an agent has no VOICE.md, the call falls back to the full chat prompt with that footer and logs a one-time nudge to write one. Leaving the memory pack out is deliberate: a live conversation's latency budget has no room for a pre-inference retrieval pass. If you want memory in a call, you write it into VOICE.md by hand.
A call still leaves the same trail a chat does — every completed turn is appended to the session JSONL as an ordinary row, tagged so the sidebar renders it as a call. Re-opening one replays its prior turns into the xAI conversation before any audio flows, so the agent resumes with full context, but stays quiet until the user actually speaks.
A call is the one place the harness hands the whole conversation to someone else and just carries the wire. Everything else — the loop, the backends, memory, voice — Mantle owns end to end.
Security
The trust boundary — a login wall, a hardened tool surface, and an honest line about where the hardening stops.
Mantle is built to be reached over the network — another machine, a phone over Tailscale — not just localhost. That takes a real front door and a clear idea of how much damage an agent holding a shell should be able to do. Both are here, and the honest framing is the point: a strong wall, deliberately not a sandbox.
The front door
Access is gated by a session cookie, and the credential handling is conventional in the way security code should be. Passwords are hashed with argon2id (via Bun.password — salted, constant-time to verify); there's a single owner account by design, though the store is an array so multi-user can land later without a schema rewrite. The session token is stateless — a small payload signed with HMAC-SHA256 against a 32-byte secret kept at mode 0600, verified in constant time — so there's no server-side session table to hold. The cookie is HttpOnly and SameSite=Lax (which is also the CSRF control), and the gate sits in front of the API, the WebSocket upgrade (a browser can't attach an auth header to a socket, so the cookie is validated before it opens), and uploads. Shutdown is exempt — but only from loopback, so mantle stop works locally while a remote caller still has to authenticate.
The storage is fail-closed on purpose. Credentials are written atomically (tmp + rename) so a crash mid-write can't lock the owner out, and a corrupt users.json is treated as "an account exists" rather than reopening the public setup screen. Recovery runs through the CLI (mantle user passwd|reset), which reads the store directly and works whether the server is up or down.
On transport, TLS is supported and opt-in. Point server.tls at a cert and key and Bun terminates HTTPS, the session cookie gains its Secure flag, and a configured-but-missing cert is a fatal boot error rather than a silent downgrade to plain HTTP — there's no way to think you're encrypted and not be. Off localhost it's also what the browser's mic and WebRTC need, since those require a secure context. Left unset, Mantle serves plain HTTP and leans on Tailscale's WireGuard for transport encryption — which is the sensible default for a single operator on a private tailnet, and where most of its life is actually spent.
The tool surface points inward
The other half of the model assumes the agent itself can be turned against its operator — a web page it fetches or a document it ingests carrying a prompt-injection payload. Two boundaries make exfiltration hard:
- filesystem boundary contained to allowed roots
The filesystem tools are confined to an allow-list of roots, and
.mantle/authplus the config file are always denied — so a compromised agent can't read its own API keys or session secret throughread_file. Paths are canonicalized first, so a symlink or NTFS junction is contained by where it points, not its name. (A Windows reserved-device-name guard —nul,con, thecom/lptports — closes a phantom-path bug along the way.)- ssrf guard every redirect hop
web_fetchresolves the target host and refuses loopback, RFC1918, link-local, the Tailscale CGNAT range, and the cloud-metadata IP — re-checking on every redirect hop and capping the body size. Theweb_fetchwrapper also strips hop-by-hop, Host, and Cookie headers, so the agent can't be talked into pivoting into the private network, hitting a metadata endpoint, or smuggling credentials out.
Where the wall stops
The outbound side of auth — riding a ChatGPT or SuperGrok subscription over OAuth instead of API keys — is its own story, told under Backends. What lives here is the inbound half: who gets to talk to the agent, and how far a hijacked agent can reach on the way back out.