Session Model
This document is the canonical reference for Helix Agents' session- centric storage model. The framework uses sessionId as the primary key for all state operations.
For framework-level concepts (Run, Agent, Tool, Sub-Agents, etc.) see ./concepts.md. For the actual step-by-step execution loop see ./execution-flow.md.
The framework uses a session-centric storage model where sessionId is the primary key for all state operations.
Session vs Run
Session: A conversation container. Identified by
sessionId. Contains all messages, custom state, and checkpoints.Run: A single execution within a session. When a session is interrupted, suspended (HITL), or resumed, a new run starts but continues the same session. Each run captures
startSequencefrom the stream to enable run-scoped chunk filtering (prevents content duplication when refreshing mid-stream in multi-run sessions).Every
execute()andresume()turn writes a run record, solistRuns()grows by one per turn on all runtimes (JS, Temporal, DBOS, Cloudflare DO, Cloudflare Workflows). Temporal reached this parity via the continuation run-record fix (FU-TEMPORAL-CONTINUATION-RUN-RECORD): previously a continuationexecute()on an existing (e.g.completed) session skippedcreateRun, so Temporal under-reported turn count (1 vs N). The executor now writes an executor-side run record before returning the handle for every turn.As of v7,
RunStatusincludes three suspension variants —'suspended_client_tool','suspended_awaiting_children','suspended_step_partial'— written by runtimes that suspend at HITL boundaries. These mirrorRunOutcome.kinddiscriminators surfaced viaAgentResult.status. See./concepts.mdfor the full HITL model.On Temporal and Cloudflare Workflows, distinct runs within a single session are tagged with the
__resume-Nworkflow-id suffix convention (${prefix}__${agentType}__${sessionId}__resume-${N}, single-dash; spec §5). The counter lives onSessionState.resumeCountand is incremented atomically viaincrementResumeCount. See./concepts.md§Client-Executed Tools for per-runtime resume mechanics.Continuing a
completedstructured-output session now works. BecausestepCountis non-monotonic across turns (each turn restarts its own counter — seegetLatestCheckpointbelow), a continued turn resetsstepCount → 0and gets a fresh per-turnmaxStepsbudget. A structured-output agent that completed via__finish__used to leave a danglingtool_usethat broke the next LLM call on continuation; the__finish__history invariant + heal makes that transcript valid, so a follow-upexecute(sessionId)continues cleanly (preserving memory, returning a fresh typed result). The same per-turnstepCount → 0reset applies to persistent-companion re-consult (continuing acompletedchild) — uniform across all runtimes (JS, Temporal, DBOS, Cloudflare DO, Cloudflare Workflows). See./concepts.md§Persistent-companion continuation + the__finish__heal.
Key Benefits
- Efficient Message Storage: Messages are stored once per session, not duplicated per run. This is O(n) storage vs O(n²) for run-centric models.
- Natural Conversation Continuity: Reusing the same
sessionIdautomatically continues the conversation with full history. - Clean Sub-Agent Isolation: Each sub-agent gets its own
sessionId, preventing state conflicts.
Usage
// Start a new session (sessionId is required)
const sessionId = `session-${Date.now()}`;
const handle = await executor.execute(agent, { message: 'Hello' }, { sessionId });
// Continue the same session (pass the same sessionId)
const handle2 = await executor.execute(
agent,
{ message: 'Follow up' },
{
sessionId: handle.sessionId,
}
);
// Branch from a checkpoint (creates a new session from existing state)
const newSessionId = `session-${Date.now()}`;
const handle3 = await executor.execute(
agent,
{ message: 'What if...' },
{
sessionId: newSessionId,
branch: { fromSessionId: handle.sessionId, checkpointId: 'cp_123' },
}
);v7 SessionState Shape
The full SessionState<TState, TOutput> interface lives at packages/core/src/types/session.ts:86-295. v7 added a number of suspension- and concurrency-related fields. The canonical shape is:
interface SessionState<TState, TOutput> {
// Identity
sessionId: string;
agentType: string;
streamId?: string;
// Custom application state + status
customState: TState;
status: SessionStatus; // 'active' | 'completed' | 'failed' | 'interrupted' | 'paused'
stepCount: number;
output?: TOutput;
error?: string;
// v7: γ-cascade discriminator. Currently 'parent_suspended' marks a
// child that was failed because its parent suspended; the cascade in
// applyResultsAndReload re-spawns these on parent resume.
failureReason?: string;
// Interrupt context (set when status === 'interrupted' or 'paused')
interruptContext?: InterruptContext;
// v7 HITL suspension state
pendingClientToolCalls?: Record<string, PendingClientToolCall>;
suspendedAwaitingChildren?: Record<string, SuspendedChildWait>;
suspendedStepId?: string;
completedClientToolCalls?: Record<string, number>; // root-only
clientToolCallOwnership?: ClientToolCallOwnership; // root-only
// v7 tracing continuity (sessionId-seeded)
tracingContext?: { traceId: string; rootSpanId: string };
// v7 session GC + cross-session links
expiresAt?: number;
parentSessionId?: string;
rootSessionId?: string;
// v7 DBOS write-once mode binding
mode?: 'standard' | 'persistent';
// v7 distributed coordination
version: number; // monotonic; incremented on every modification
resumeCount: number; // counter for unique resume workflow IDs
// Checkpoint tracking
checkpointId?: string;
checkpointedAt?: number;
checkpointSource?: 'staging' | 'save';
// User context
userId?: string;
tags?: string[];
metadata?: Record<string, string>;
// v7 persisted workspace ref (so it survives interrupt/resume)
workspaceRef?: WorkspaceRef;
// Timestamps
createdAt: number;
updatedAt: number;
}v7-NEW field summary
| Field | Purpose |
|---|---|
failureReason | γ-cascade discriminator (e.g. 'parent_suspended'); used by applyResultsAndReload to decide re-spawn vs. drain. |
pendingClientToolCalls | Map of toolCallId → pending entry; canonical signal for "awaiting client submission". |
suspendedAwaitingChildren | Map of parentToolCallId → child wait info; populated when parent paused awaiting sub-agents. |
suspendedStepId | Mid-step suspension marker for mixed server+client tool batches. |
completedClientToolCalls | Root-only timestamp map; makes 'already_completed' durable across runtime restarts. |
clientToolCallOwnership | Root-only toolCallId → owningSessionId; routes submissions to the owning sub-agent. |
tracingContext | sessionId-seeded traceId + rootSpanId; one trace per session across runs. |
expiresAt | Operator GC hint for abandoned sessions. |
mode | Write-once 'standard' / 'persistent' binding (DBOS-enforced). |
version / resumeCount | Optimistic concurrency + unique resume workflow IDs. |
workspaceRef | Persisted workspace ref (so it survives interrupt/resume cycles). |
parentSessionId / rootSessionId | Sub-agent cross-session linkage; rootSessionId enables O(1) ownership writes. |
State Store Interface
All state stores implement SessionStateStore (defined at packages/core/src/store/state-store.ts). v7 introduces several new atomic primitives that runtime code now depends on heavily.
Lifecycle
createSession(sessionId, options)— Atomically create a session (throws if already exists). All implementations guarantee exactly-one-wins semantics for concurrent calls with the same sessionId.sessionExists(sessionId)/deleteSession(sessionId)/cloneSession(...)— standard lifecycle helpers.
State
loadState(sessionId)/saveState(sessionId, state)— Load / save session state.mergeCustomState(sessionId, writes)— Atomically mergeStepWritesfromImmerStateTrackerinto custom state.updateStatus(sessionId, status, context?)— Atomic status update (no CAS).
v7 atomic primitives
compareAndSetStatus(sessionId, expectedStatuses, newStatus, options?)— Atomic CAS on session status (and optionalexpectedVersion). Returns a discriminated result:{ ok: true; newVersion: number }on success{ ok: false; currentStatus: SessionStatus; currentVersion: number }on mismatchoptionsacceptsinterruptContext,error, andexpectedVersion. Used to prevent double-resume races.
saveStateAndPromoteStaging(sessionId, state, appendMessages, checkpointMeta, options?)— Atomic write of state + appended messages + staging promotion + checkpoint creation in one operation. HonorsexpectedVersion(throwsStaleStateErroron mismatch). Cross-runtime invariant C-1: when a runtime suspends, this is the single primitive that persists pending tool calls, ownership, completed phase-1 messages, the checkpoint, andsuspendedStepIdatomically.incrementStepCount(sessionId)/incrementResumeCount(sessionId)— Atomic counters.
Interrupt flag
setInterruptFlag(sessionId, reason?)— Durable interrupt request (writes durably so other processes can observe it).checkInterruptFlag(sessionId)— Atomic check-and-clear; polled by the runLoop at the top of every step iteration. Foundation for cross-process interrupt parity (JS, CF DO, CFW Workflows all rely on it).clearInterruptFlag(sessionId)— Explicit clear (rarely used directly;checkInterruptFlagclears as part of the read).
Messages, runs, checkpoints, sub-sessions, staging
appendMessages/getMessages/getMessageCount/truncateMessagescreateRun/updateRunStatus/getCurrentRun/listRuns/getRuncreateCheckpoint/getLatestCheckpoint/getCheckpoint/listCheckpointsaddSubSessionRefs/updateSubSessionRef/getSubSessionRefsstageChanges/getStagedChanges/promoteStaging/discardStaging/hasStagedChanges/cleanupOrphanedStaging
Optional extensions
patchMetadata?(sessionId, patch)— Used by runtime-dbos persistent mode to record the active DBOS workflow ID.
Third-party stores: atomic implementation required
There is no non-atomic fallback. A previously-exported defaultSaveStateAndPromoteStaging(store, ...) helper (sequential appendMessages → saveState → promoteStaging) was removed in P3.R3-BC-FALLBACK because the crash-between-calls window it created is exactly the corruption the atomic primitive was added to prevent. All five in-tree stores (memory, redis, postgres, D1, DO) implement the atomic version; custom stores must do the same.
getLatestCheckpoint must return the most-recently-WRITTEN checkpoint
getLatestCheckpoint(sessionId) must return the checkpoint that was written most recently — i.e. the one the session row's checkpointId pointer references — not the checkpoint with the highest stepCount.
stepCount is not monotonic across turns/runs: each execute()/resume() turn restarts its own step counter, so a later turn's first checkpoint (including a client-tool suspension checkpoint at step 0/1) can have a lower stepCount than the previous turn's final checkpoint. The runtime calls getLatestCheckpoint during resume/retry to find the boundary of committed state and then truncateMessages past it. If the store returns a stale higher-stepCount checkpoint, the truncation deletes the current turn's user message and the suspending assistant tool_use message, leaving an orphaned tool_result that fails the next LLM call (unexpected tool_use_id ...).
Every checkpoint-writing path (createCheckpoint, saveState, promoteStaging, saveStateAndPromoteStaging) updates the session's checkpointId pointer atomically, so resolving via that pointer is correct on all backends. The in-tree SQL/Redis stores resolve the pointer first and fall back to step_count-ordering only when the pointer is absent or dangling (legacy data). The cross-store contract test (packages/core/src/testing/checkpoint-operations.ts) enforces this invariant; custom stores must satisfy it.