State Stores
This guide covers the SessionStateStore interface and the v7 contracts every implementation must satisfy. For end-user custom-state patterns (defining schemas, reading/writing state inside tools), see State Management.
Overview
The state store is the single durable surface for an agent's session data. v7's stateless-suspension redesign elevated this to a load- bearing role: every HITL pause writes its full continuation context into SessionState.suspensionContext so the next process to wake up on the session can pick up where the previous one left off, even across restarts and machine moves.
In v7, every in-tree state store implements the full SessionStateStore interface atomically. Third-party stores can delegate to a default non-atomic fallback (with a warning log) until they upgrade.
What changed in v7
If you maintain a custom SessionStateStore implementation or operate sessions at the SQL level, the v7 changes that matter:
- New required method:
saveStateAndPromoteStaging. Atomic write-and-promote that replaces the v6 two-call dance. - Forward-only schema migrations added to all in-tree stores — Postgres V5, D1 V8, DO SQLite V4. They add a
suspension_contextcolumn and indexes onpendingClientToolCallsfor efficient expiration queries. See Storage migrations. compareAndSetStatusreturn shape. Old:Promise<boolean>. New: discriminated{ ok: true; newVersion } | { ok: false; currentStatus; currentVersion }. The single most-commonly-tripped v7 breaking change.- New
SessionStatefields —suspendedAwaitingChildren,suspendedStepId,tracingContext,expiresAt. Custom stores must persist all of them, even if the columns store JSON. expiredSessionCleanup— operator-driven helper for reaping sessions whoseexpiresAtis in the past.
If you are upgrading from v6, read the v6 to v7 migration guide end-to-end before deploying. The rest of this page describes the v7 model.
saveStateAndPromoteStaging
SessionStateStore.saveStateAndPromoteStaging(sessionId, state, opts) atomically:
- Persists the full
SessionState(messages, custom state, suspensionContext, all the v7 fields). - Promotes any staged Immer patches into the canonical state.
- Bumps the session version.
In v7 this is the canonical write path used by the run loop after every step. The previous v6 flow — saveStaging followed by a separate commitStaging call — has a small window where a crash leaves staging written but unpromoted. The atomic primitive closes that window.
Implementing in a custom store
If you maintain a third-party SessionStateStore, you MUST implement saveStateAndPromoteStaging atomically — run both writes inside a single transaction (Postgres) or compare-and-swap (Redis/DO).
Earlier versions exported a defaultSaveStateAndPromoteStaging() helper from @helix-agents/core that performed the legacy two-call flow (non-atomic). That helper was removed in P3.R3-BC-FALLBACK: a sequential appendMessages → saveState → promoteStaging opens a small window where a crash between calls leaves staging written but unpromoted, which is exactly the corruption the atomic primitive was added to prevent. All five in-tree stores (memory, redis, postgres, D1, DO) implement the atomic version; custom stores must do the same.
compareAndSetStatus returns an object
The status-CAS API changed in v7 to surface what the store saw, not just whether the swap succeeded:
// v6
const ok = await store.compareAndSetStatus(sessionId, ['active'], 'paused');
if (ok) { ... }
// v7
const result = await store.compareAndSetStatus(
sessionId,
['active'],
'paused',
);
if (result.ok) {
console.log('promoted to version', result.newVersion);
} else {
console.log(
'lost CAS — store is at',
result.currentStatus,
'version',
result.currentVersion,
);
}Every call site in your codebase must update. The lossy boolean form is gone.
New SessionState fields
v7 adds four fields to SessionState. Custom stores that serialize state must round-trip all of them.
| Field | Type | Purpose |
|---|---|---|
suspensionContext | SuspensionContext | undefined | Continuation context for HITL pauses. Read on resume to restore loop. |
suspendedAwaitingChildren | SuspendedChildWait[] | undefined | Per-child waits for cascading sub-agent suspensions. |
suspendedStepId | string | undefined | The step ID a suspended_step_partial outcome is anchored to. |
tracingContext | TracingContext | undefined | Persisted Langfuse / OTel trace IDs so resume continues the same trace. |
expiresAt | number | undefined | Epoch ms TTL — read by expiredSessionCleanup. |
In Postgres, all four serialize into the state JSONB column (no schema migration needed beyond V5's suspension_context column for indexing). In D1 / DO SQLite they live in the state TEXT column.
Storage migrations
Every in-tree store ships a forward migration in v7. Apply migrations before rolling new code; new code reading old data is fine, but old code reading new data is undefined behavior.
| Package | Migration | Notes |
|---|---|---|
@helix-agents/store-postgres | V10 | StepWrites unification: replaces patches/merge_changes with writes. TRUNCATEs staging (ephemeral). |
@helix-agents/store-cloudflare (D1) | V13 | StepWrites unification: same shape change as Postgres V10 for D1 staging table. |
@helix-agents/store-cloudflare (DO) | V7 | StepWrites unification: same shape change for DO SQLite staging table. |
@helix-agents/store-redis | (none) | Lua scripts updated in-place; schemaless — no migration table entry. |
@helix-agents/store-memory | (none) | In-memory; no migration needed. |
Verify the active migration version with:
Postgres and D1:
SELECT version FROM __agents_migrations
ORDER BY version DESC
LIMIT 1;Postgres should show 10 or higher; D1 should show 13 or higher.
Durable Objects (DO SQLite):
DO uses its own schema_info key-value table (not __agents_migrations). Check the schema version from within the DO process:
// Inside the DO (e.g. in a debug handler or health endpoint):
const rows = this.storage.sql.exec("SELECT value FROM schema_info WHERE key = 'version'").toArray();
console.log('DO schema version:', rows[0]?.value);
// Should be '7' or higher after the V7 StepWrites migration.Or, if you have direct SQLite access to the DO's storage:
SELECT value FROM schema_info WHERE key = 'version';The returned value should be 7 or higher.
Forward-only (Phase B / StepWrites migration)
Phase B's schema migrations (Postgres V10, D1 V13, DO V7) are forward-only. They TRUNCATE the ephemeral staging table and DROP the patches/merge_changes columns. In-flight steps at migration time are recovered by checkpoint replay.
Rolling deploys are NOT safe for this migration — it's a flag-day:
- If migrations apply first: old workers fail on
stageChanges(columnwritesmissing in old code's INSERT). - If migrations apply after: new workers fail on
stageChanges(columnspatches/merge_changesalready dropped).
Apply migrations during a brief maintenance window or via a full cutover.
Forward-only (v7 stateless suspension)
Rolling back from v7 to v6 after applying the v7 migrations (Postgres V5, D1 V8, DO V4) is unsafe by default. Sessions paused under v7 carry suspension context that v6 does not know how to read; resuming them under v6 silently loses the context. See the migration guide's rollback semantics for the recovery procedure.
Operator-driven session cleanup
v7 adds expiredSessionCleanup to @helix-agents/agent-server — a helper for reaping sessions whose expiresAt is in the past. The helper:
- Pages through
stateStore.listSessions()(configurable page size, default 200). - Loads each session and checks
expiresAt. - For each expired non-terminal session: enumerates owned workspace snapshots via the matching
WorkspaceProvider'ssnapshot.list/snapshot.deletecapability and deletes them (closes the R2 cost- amplification gap). - CAS's the session status to
'failed'with reason'session_expired'. Per-session failures (load errors, snapshot errors, CAS conflicts) are logged but do not abort the loop.
The framework does not run this automatically. Wire it into a scheduled job (cron, Cloudflare Alarm, k8s CronJob).
import { expiredSessionCleanup } from '@helix-agents/agent-server';
// Cloudflare Alarm handler
export default {
async scheduled(_event, env, _ctx) {
const summary = await expiredSessionCleanup({
stateStore,
workspaceProviders, // Map<providerId, WorkspaceProvider>
logger: consoleLogger,
});
console.log('cleanup summary', summary);
},
};The returned summary ({ detected, marked, alreadyTerminal, snapshotsDeleted, errors }) gives operators a per-run observability handle.
See also
- State Management — defining and using custom state
- Checkpoints — how snapshots layer on top of state
- v6 to v7 migration guide
- Storage Overview — reference for each store implementation