Skip to content

Future work (separate sessions)

This doc captures meaningful sub-projects that are scoped to be tackled in a future, dedicated session rather than as inline follow-ups to the current branch's work.

These are NOT in the active follow-ups backlog (docs/dev/follow-ups.md) — they're roadmap items that need their own spec/plan/implementation cycle.


Workspaces parity matrix (Temporal / CFW Workflows / DBOS)

Why future: Substantial standalone feature. Each runtime needs workspace providers (local-bash, etc.) wrapped in its own deterministic boundary (activities for Temporal, steps for DBOS, workflow steps for CFW Workflows). Cross-cuts state-store semantics, sandboxing, filesystem isolation, and per-runtime sandbox enforcement.

Current state:

  • JS runtime: ✅ Full workspace support
  • CF Durable Objects (via agent-server): ✅ Full workspace support
  • Temporal (runtime-temporal): ❌ Run-start fail-fast at executor.ts:268. Documented as "not supported" in CLAUDE.md.
  • CFW Workflows (runtime-cloudflare/src/workflow.ts:392): ❌ Run-start fail-fast.
  • DBOS (runtime-dbos): ❌ Run-start fail-fast — DBOSAgentExecutor's execute() / resume() / retry() call assertRuntimeSupportsWorkspaces (the same guard Temporal and CFW Workflows use) before any DBOS workflow starts, so a workspace-declaring agent gets an immediate error rather than silent breakage. Full provider support remains future work (this section).

Scope for the future session:

  1. Decide architectural model:

    • Option A: workspace providers run inside activities/steps (durability boundary preserved; activities can call non-deterministic I/O).
    • Option B: workspaces become "workflow-level" abstractions with deterministic state in the workflow body and async I/O delegated to dedicated workspace activities.
    • Option C: narrow workspaces to read-only scopes initially; full mutation support comes later.
  2. Per-runtime implementation:

    • Temporal: workspace activities + remove the fail-fast guard.
    • CFW Workflows: workspace step wrappers + remove the fail-fast guard.
    • DBOS: add fail-fast guard FIRST (so silent breakage becomes loud); then workspace step wrappers.
  3. Cross-runtime parity tests in packages/e2e exercising every workspace operation against every runtime.

  4. Update CLAUDE.md's parity matrix when complete.

Estimated effort: 2-3 weeks across all four runtimes. Dependencies: None — A.2/A.3/B all complete on the HITL surface. Owner: TBD; needs its own brainstorm session to choose the architectural model before scoping the plan.


Docker orphan-container reaper (label-scoped sweep)

Why future: Small, self-contained reliability feature, but it needs its own design decision (when/where to run the sweep, how to scope it so it never reaps a live workspace) and a gated integration test against a real daemon. Out of scope for the round-2 review fixes.

Current state: DockerWorkspaceProvider.openInternal has a real leak window: if the Docker daemon successfully creates the container but the createContainer HTTP response is lost (the call rejects), containerId stays undefined, so the compensation path in the catch cannot remove the now-orphaned container — it leaks. Every container the provider creates already carries helix.workspace.id and helix.session.id labels (buildContainerSpec), so a label-scoped reaper COULD enumerate and remove these orphans, but no such reaper is implemented today. The inline comment at provider.ts (the createContainer call site) points here.

Scope for the future session:

  1. Add a label-filtered list capability to the DockerEngine interface (e.g. listContainers({ labelSelector })) + its DockerodeEngine and FakeDockerEngine implementations.
  2. Implement a reaper that lists containers carrying helix.workspace.id, identifies ones with no live workspace (e.g. by age + absence from an in-process registry, or a "created-but-never- started" heuristic), and removes them via the now-idempotent engine.remove().
  3. Decide the trigger: opt-in periodic sweep, a one-shot sweep at provider construction, and/or a sweep on the next open(). Must never reap a container that belongs to a workspace this process is actively using.
  4. Gated integration test (HELIX_DOCKER_TESTS=1): simulate an orphan (create a labelled container without tracking it), run the reaper, assert it is removed and that a live workspace's container is left untouched.

Estimated effort: ~1-2 days. Dependencies: None — builds on the idempotent remove() (404-swallow) already shipped. Priority: low — the leak window only opens on a lost createContainer response (rare), and orphaned containers are stopped (sleep infinity with no started process) so they consume little.


Local-sandbox seatbelt profile delivery: verify -f reliability across macOS versions

Why future: Small investigation plus a possible hardening fallback; not blocking, but confidentiality/integrity-relevant.

Current state: workspace-local-sandbox delivers the seatbelt SBPL profile via sandbox-exec -f <file> — the profile is written to .helix-sandbox.sb inside the workspace tmpdir and self-protected by a trailing (deny file-write* (literal <profile>)) (SBPL last-match-wins). The real-seatbelt tests pass on the current dev macOS version, but we have validated only one macOS release + profile shape. sandbox-exec -f with subpath path-filter rules has a reputation for behaving inconsistently across macOS releases; if that affected our file-write* subpath rules, the write-confinement could mis-apply WITHOUT failing loudly (the command would still run, just with a wider write boundary than intended).

Scope for the future session:

  1. Exercise the real seatbelt profile (write-confined-to-workspace + network-deny + the self-protect deny) across 2-3 macOS versions and record a verified-version matrix.
  2. If any inconsistency surfaces, add an inline -p <profile-string> delivery path as a fallback (we already build the full profile string in buildSeatbeltProfile, so it is cheap) and select between -f and -p per host.
  3. Either way, document which delivery mechanism is used and the verified macOS versions.

Estimated effort: ~1 day. Dependencies: none. Priority: medium — no evidence of a problem on the versions we run today, but the failure mode (silent under-confinement) is the kind we want to rule out explicitly.


Docker workspace: optional container-reuse ("reconnect") mode

Why future: Real design tension with the recreate-over-persisted-tmpdir resume model; needs a brainstorm before scoping.

Current state: DockerWorkspaceProvider.resolve() recreates a FRESH container over the persisted, bind-mounted tmpdir. This is robust for cross-process / cross-host resume (only the workspace files need to survive), but it discards any state that lived ONLY inside the container across a suspend/resume — installed packages, /tmp contents, a long-running side-process, a warmed-up toolchain.

Scope for the future session:

  1. Add an opt-in reuse: 'recreate' | 'reconnect' provider/config option, default 'recreate' (the current behavior — no change for existing agents).
  2. In 'reconnect' mode, find the surviving container on resume (the container already carries a helix.workspace.id label); reconnect if it is still running, fall back to recreate if it is gone.
  3. Reconnect-drift detection: when reconnecting, compare the live container's HostConfig (cap-drop, network mode, resource limits, user, read-only rootfs, etc.) against the requested hardened spec, and warn (or refuse) if they diverge — an existing container cannot be re-hardened in place, only recreated.
  4. Reconcile with the stateless-suspension contract: a reconnect-mode workspace is no longer purely file-resumable (it depends on the container surviving the process restart). Document the trade-off and the failure behavior when the container is gone.
  5. Gated integration test (HELIX_DOCKER_TESTS=1): resume reconnects to the same container and observes in-container-only state; the drift warning fires when the live config differs from the requested spec.

Estimated effort: ~3-5 days incl. the brainstorm. Dependencies: none technically; the tension with the stateless-suspension model needs a design decision first. Priority: low-medium — driven by demand for preserving in-container state across resume.


Generic mount capability for workspace providers

Why future: Ergonomics / consolidation; overlaps existing per-provider knobs, so the first task is deciding whether a new capability is warranted at all.

Current state: Mounting external storage into a workspace is provider-specific today: workspace-local-sandbox exposes readWritePaths / readOnlyPaths, workspace-docker bind-mounts only its own session tmpdir (no user-configurable extra mounts), and cloudflare-sandbox has bucketMounts. There is no unified, first-class "mount this host directory (or bucket) into the workspace at path P, read-only or read-write" capability with consistent semantics across providers.

Scope for the future session:

  1. Decide whether a generic mount capability (source → workspace path, ro/rw) is worth a new capability slot, or whether per-provider config plus better docs is sufficient.
  2. If pursued: a mount config that local-sandbox maps to an extra bind
    • isolation-allowlist entry, docker maps to an extra HostConfig.Bind, and cloudflare-sandbox maps to bucketMounts — with consistent semantics and the same hardening guarantees (writes stay confined unless the mount is explicitly rw).
  3. Treat the security surface explicitly: a mount widens the read/write boundary, so it must be opt-in, audited, and reflected in the capability/ref so it survives resume.

Estimated effort: ~3-4 days if pursued as a capability; ~half a day if it lands as per-provider config + docs consolidation. Dependencies: none. Priority: low — current per-provider knobs cover most cases.


Docker workspace: optional advanced HostConfig escape hatches

Why future: Power-user knobs that relax the hardened baseline ONLY when explicitly requested; demand-driven.

Current state: DockerWorkspaceConfig deliberately exposes a minimal, hardened surface — image, network (off/allow), memoryMb, cpus, pidsLimit, pullPolicy. The daemon supports more (per-resource ulimits, memory-swap, re-adding a single dropped capability, custom security-opt, extra tmpfs mounts). A legitimate workload occasionally needs one of these (e.g. a higher file-descriptor ulimit, or re-adding one capability a tool requires), and today there is no escape hatch short of forking the provider.

Scope for the future session:

  1. Add explicit, documented optional fields for the most-requested knobs (candidates: ulimits, memorySwapMb, capAdd, extra tmpfs paths), each clearly marked as relaxing the hardened baseline.
  2. Keep the secure defaults UNCHANGED — these are additive opt-ins; an agent that sets none gets exactly today's hardened spec.
  3. Validate / clamp where it matters (e.g. continue to refuse privileged-style escalations, or gate the riskiest ones behind an explicit acknowledgment flag).
  4. Thread them through ContainerSpec → dockerode HostConfig + the ref payload (so they survive resume) + unit and gated-integration tests asserting the knob actually takes effect on the daemon.

Estimated effort: ~2-3 days. Dependencies: none. Priority: low — add when a concrete need appears; the minimal hardened surface is intentional.


Investigate decoupling the workspace filesystem from the execution sandbox

Why future: Architectural exploration that could be a significant model change; it needs investigation and a design decision before any commitment.

Current state: Each provider couples its fs and shell together — workspace-docker exposes a host-side TmpdirFileSystem over the same directory the container sees at /workspace; workspace-local-sandbox (and local-bash) expose a tmpdir filesystem and a shell over the same tmpdir. This coupling guarantees coherence (a file written via fs is immediately visible to shell and vice versa) and keeps the providers simple, but it means you cannot pair one provider's filesystem with a different provider's execution backend.

Investigation goals:

  1. Determine whether there is real demand for composing a filesystem implementation with a SEPARATE execution backend — e.g. a durable / remote filesystem paired with a container exec, or a single filesystem shared across multiple exec backends.
  2. If so, sketch what a decoupled model would look like in the Workspace abstraction (e.g. a FileSystemProvider and an ExecutionProvider that a Workspace composes), and crucially HOW it would preserve the coherence guarantee — the exec backend must see the filesystem's bytes — without the bind-mount shortcut.
  3. Weigh the cost honestly: the bind-mount coupling is load-bearing for our providers' simplicity and for reusing the single hardened TmpdirFileSystem. Decoupling reintroduces a byte-transport / synchronization problem (getting an arbitrary fs's bytes into an arbitrary exec environment) that bind mounts currently solve for free.
  4. Decide whether this is worth pursuing, or whether the coupled model is the right long-term choice.

Estimated effort: ~1-2 days investigation + design before any implementation scoping. Dependencies: none. Priority: low — exploratory; pursue only if a concrete composition need emerges.


F6 — Atomic-merge across parallel sibling writers (counters + array-append on DBOS)

Why future: Real feature work that needs an API design decision. Consistently scoped out of D, A.1, A.2, and the original B carving as "separate sub-project" (per test-infrastructure-roadmap.md F6 entry and the A.2 design spec).

Current state: Two distinct gaps live under this F6:

  1. Scalar counters on every runtime. Parallel scalar customState writes are last-write-wins on every runtime + every state store (verified in packages/e2e/src/__tests__/staged-state.integ.test.ts:335-338). Even runtime-js + runtime-cloudflare's in-memory delta-merge doesn't cover scalars — arrayDeltaMode: true only classifies pure-append array mutations as deltas; everything else (scalars, objects, modified-or-shrunk arrays) falls through to { kind: 'replace' } and is LWW. Closing this needs a new IncrementBy / DecrementBy WriteOp kind.

  2. Array-append on DBOS (parity gap with JS / Cloudflare). With arrayDeltaMode: true, runtime-js and runtime-cloudflare merge parallel sibling array-appends via per-tool trackers + an in-memory applyStepWrites at promotion — both deltas survive. On DBOS, each parallel tool runs as an independent durable step that writes its delta to the SAME staging key (the stepId) and loads state fresh from the store, so the last writer's delta overwrites the first — only ONE append survives at promotion. The DBOS staging-atomicity test (packages/runtime-dbos/src/__tests__/integration/staging-atomicity.integ.test.ts) asserts length === 1 for two parallel writers and documents the gap explicitly. Closing this needs either (a) per-tool staging slots scoped by (stepId, toolCallId) instead of just stepId, plus a fan-in merge at promotion; or (b) collapsing the parallel work into a single orchestrator step.

Scope (was "counter-only" before; broadened in the integ-fix fix-round): atomic-merge across parallel sibling writers covers BOTH counters (everywhere) and array-append on DBOS specifically.

Workaround for users today: Use array-append semantics (push entries, assert array length) on JS / Cloudflare; on DBOS, serialize parallel writes via a single orchestrator tool that does the appends sequentially, or fan out via separate child workflows.

Scope for the future session — API decision needed:

  1. Option F6.A — New MergeChanges opcodes Add IncrementBy / DecrementBy opcodes to the existing MergeChanges schema. Tools opt in via either Immer-equivalent diff-detection (recognize "this was an increment") or a new explicit API (ctx.incrementState('count', 1)). Pros: schema-free. Cons: detection is fragile; explicit API diverges from Immer style.

  2. Option F6.B — Schema-decoration for merge strategyz.number().describe('@merge:counter'). State store reads schema, applies per-key merge on commit. Pros: declarative, automatic. Cons: couples store to schema; describe-based markers are brittle.

  3. Option F6.C — Explicit increment API only (no opcodes)ctx.incrementState(path, 1) generates the right opcode internally. Pros: simplest, explicit, no magic. Cons: diverges from natural Immer pattern users already know.

Implementation work (regardless of API choice): ~1 week for the counter half; the DBOS array-append parity-fix adds another ~3-5 days (per-tool staging slot keyed by (stepId, toolCallId) + fan-in merge at promoteStaging, plus updates to every store's appendCustomState contract).

  • Core: new opcode types + apply logic
  • Each state store's applyMergeToCustomState (5 stores)
  • runtime-dbos: per-tool staging-slot scoping so parallel sibling appends don't overwrite each other (separate from the counter work but shares the merge-on-promotion plumbing)
  • Cross-runtime parity tests
  • Re-tighten staging-atomicity.integ.test.ts from length === 1 to length === N for N concurrent writers (covers both the counter case AND the DBOS array-append parity case)
  • Update CLAUDE.md / upgrade guide to remove the documented limit

Dependencies: None — independent of HITL/sub-agent/workspace work. Owner: TBD; needs its own brainstorm session to pick the API model before scoping the plan.


CFW Workflows γ-cascade re-spawn — ✅ landed

Closed by: commit 6cbd78808, tracked as FU-A2-40 in docs/dev/follow-ups.md. See the "Done items" section there for the full closure write-up. Mirrors runtime-temporal's FU-A2-09 closure: parent's commitSuspendedStep marks each suspendedAwaitingChildren entry as failed:'parent_suspended'; resume branch's applyResultsAndReload surfaces them via childrenToRespawn; the workflow body re-dispatches via workflowBinding.create({ id: 'agent__<type>__<id>__respawn-<attempt>' }) and polls each child's durable state until terminal, then drains via recordSubSessionResult

  • a final clear step that resets the parent's suspension discriminators when fully resolved.

Coverage: subagent-respawn-on-resume.integ.test.ts (3 D1+Miniflare tests). 220/220 runtime-cloudflare integ + 1048/1048 unit pass.


CF DO + CFW Workflows harness setup helpers (O5 — Phase 1 DONE)

Status: Phase 1 ✅ COMPLETE. cf-do-d1 and cfw-workflows-d1 are now first-class participants in the shared parity matrix, proven end-to-end on the lifecycle-hooks-parity suite. The IMPL_PENDING_O5 marker has been DROPPED from both descriptors. FU-A2-38 + FU-A2-39 ✅ landed earlier (see "Done items" in docs/dev/follow-ups.md) and were the foundation Phase 1 built on. Phases 2..N (the other harness parity suites) remain open, via the now-established recipe.

What landed (Phase 1 — the reusable machinery, built once):

  • Context-split registries: harness/registries/node-backends.ts (the 7 Node descriptors) + harness/registries/cf-backends.ts (the 2 CF descriptors). The CF entrypoints import CF_BACKENDS directly so the workerd bundle's static import graph never pulls the Node-only loaders (ioredis / pg / @temporalio / @dbos-inc). backend-descriptor.ts re-exports a composed ALL_BACKENDS for back-compat.
  • selectViable(registry, opts): the filter pipeline extracted to operate on any passed-in registry. getViableBackends is now a thin wrapper over selectViable(NODE_BACKENDS, …); CF entrypoints call selectViable(CF_BACKENDS, …) explicitly.
  • recordedHooks bridge: harness/recorded-hooks.ts provides a portable RecordedHooks recorder + wireAgentHooks(agent, rec). BackendEnv gained recordedHooks / resetRecordedHooks / prepareAgent, so the identical hook-cardinality + firing-order + previousRunId assertions hold whether the hooks fire in-process (Node) or inside the DO / workflow body (CF).
  • Node-import-free shared scenario module:parity/scenarios/lifecycle-hooks.ts exports runLifecycleHooksParity(backends, opts) — single-sourced scenario logic run by three thin entrypoints (Node .integ.test.ts, CF-DO .cf.test.ts, CFW .wf-noiso.test.ts).
  • Generic injectable HarnessAgentServer DO in packages/e2e/src/test-worker.ts (+ binding + migration v9 in wrangler.cloudflare.toml) — a createAgentServer-based DO that runs any injected hook-wired agent.
  • Both CF executor adapters: setupCfDoD1 (stub.fetch AgentExecutor adapter over the HARNESS_AGENT_SERVER DO) + cf-d1-common (shared D1 store + clearStore + LLM slot), and setupCfwWorkflowsD1 (env.AGENT_WORKFLOW.create() + poll adapter, consumed from runtime-cloudflare's workflows pool via @helix-agents/e2e deep exports).

Verified: the lifecycle-hooks-parity suite is parity-complete across all runtimes:

  • Node (7 backends × 3 scenarios = 21 tests): js-memory/redis/postgres, temporal-memory/redis/postgres, dbos-postgres — all green.
  • CF-DO (cf-do-d1, workerd e2e pool): Scenarios 1 + 2 EXECUTE + pass; Scenario 3 gated (FU-O5-CF-DO-APPROVAL-STREAM-READ).
  • CFW (cfw-workflows-d1, runtime-cloudflare workflows pool): Scenarios 1 + 3 EXECUTE + pass; Scenario 2 gated (FU-O5-CFW-TRACING-CONTEXT-PERSISTENCE).

Two gated divergences (tracked, NOT weakened — gated scenarios it.skip):

  • FU-O5-CF-DO-APPROVAL-STREAM-READ — CF-DO emits the tool_approval_request chunk into the DO's INTERNAL stream, not the binding-side DurableObjectStreamManager, and the approvalId is not persisted on state, so the harness can't recover it. Gates Scenario 3 on cf-do-d1.
  • FU-O5-CFW-TRACING-CONTEXT-PERSISTENCE — the CFW D1 store models tracingContext as suspension-scoped (NULLed on the resume-leg completion save), so the suspend-time write doesn't survive to completion. Gates Scenario 2 on cfw-workflows-d1.

Phases 2..N — what's left: apply the now-established recipe (shared scenario module + selectViable(CF_BACKENDS, …) + the two adapters + recordedHooks) to the OTHER harness parity suites (usage-subagent, expired-session, concurrency-invariants-stress, approval-gate-hook-parity, atomic-suspend-write, …). Each is a thin entrypoint over the existing machinery.

Priority: medium — Phase 1 establishes the pattern; remaining suites are incremental.


Redis customState pipeline → Lua atomic script — ✅ landed

Closed by: commit 7a2432d7a, tracked as FU-A2-41 in docs/dev/follow-ups.md. See the "Done items" section there for the full closure write-up. The SAVE_STATE_ATOMIC_SCRIPT now performs the CAS version check, main hash field write, TTL application, and the complete customState replacement (scalars + arrays + array-key index + per-key TTLs) inside a single EVAL call. The pre-fix non-atomic pipeline + orphan-recovery loop are both gone — partial state is now structurally impossible (any crash before script completion leaves the prior state untouched).

Coverage: packages/store-redis/src/__tests__/integration/save-state-atomic.integ.test.ts covers happy path, replacement semantics (absent-key-drop), scalar↔ array type transitions, CAS rejection on stale version, 10-way parallel-save serialization, no-orphan-list-keys after a sequence of array/scalar/delete transitions, and empty-customState clears.


Pre-existing Temporal integration test bisect (FU-A2-42)

Why future: Diagnostic + fix; needs git bisect across the v7 stateless-suspension commit train. Tracked in docs/dev/follow-ups.md as FU-A2-42.

Current state: packages/runtime-temporal/src/__tests__/integration/temporal.integ.test.ts has 12 failing tests when run against a live Temporal server. Verified failing on commit a7b325f67 (the commit BEFORE round-2 work started), so they predate the round-2 fixes. The simplest failure (should use provided initial state) shows that initialState: { notes: ['Pre-existing 1'] } is being lost between runner.executeWorkflow and the workflow's customState — empty array reaches the state store.

Scope for the future session:

  1. Run git bisect between the last known-good commit (the v6→v7 transition point) and a7b325f67 to identify the breaking commit.
  2. Verify initialState reaches executeWorkflow and the workflow's Immer baseline.
  3. Likely a small fix once root cause is known.

Estimated effort: 0.5–1 day. Dependencies: None. Priority: medium — these tests cover real consumer-facing surfaces (initial state, conversation continuation, branching).


P2 polish backlog (round-2 review remainders) ✅ closed

Status: closed (P2 polish bundled sweep).

All six items landed in one session — interface contracts updated, parity tests added, and (per the user's standing "no module-level state" rule) module-level mutable state across packages audited and either encapsulated or explicitly documented as intentional.

P2 backlog items:

  1. failStream idempotency guard — added to memory store, Cloudflare D1 StreamDurableObject, and DOStreamManager. RedisStreamManager already had it via its CAS_TO_TERMINAL_SCRIPT. The contract is now documented at the interface level (packages/core/src/store/stream-manager.ts): endStream/failStream are idempotent — the first terminal writer wins. Cross-store parity pinned by packages/e2e/src/__tests__/stream-terminal-state-parity.integ.test.ts.
  2. getViableBackends skip-reason visibility — the harness now prints (a) a one-line summary when ALL backends got filtered out (the "0 tests ran with no signal why" pain point), and (b) a per-call structured breakdown when HELIX_TEST_VERBOSE_SKIP=1 is set.
  3. Debug console.log cleanup in non-canonical examples — audited every example app's source; the only "debug leftover" was examples/research-assistant-cloudflare-do/src/tools/web-search.ts (a tool-body console.log). Removed. The other console.log usages in examples are legitimate CLI-banner / demo-script output and the my-agent-server.ts lifecycle-hook example is a template illustrating where users would add their own logging.
  4. chunkParseCache per-instance LRU — closed earlier as part of the Redis round-3 closures (commit 8a411c26a); now a per-instance ChunkParseCache class with proper LRU eviction.
  5. Memory-store latestSequence parity test — the new stream-terminal-state-parity.integ.test.ts asserts uniformly across memory + Redis that getStreamInfo().latestSequence stays the monotonic counter after cleanupToStep shrinks the chunks table.
  6. prepareHelixReconnectRequest getter-form docs — JSDoc cross-references the resumeFromSequence / existingMessageId function-form contract (mirroring the prepareHelixChatRequest docs) so consumers know the getter pattern works on the reconnect path too.

Bonus — module-level mutable state audit (per the user's "no module-level stuff" rule, applied to ALL of packages/, not just the P2 backlog scope):

  • packages/core/src/tracing/tracing-hooks.tstracingStateMap
    • cleanupCounter were module-level singletons shared across every createTracingHooks invocation in the process; rewritten as a TracingStateRegistry class instantiated per-invocation. The standalone getTraceContextFromHook / injectTraceContext exports (which read the singleton) now throw with a clear migration message pointing at @helix-agents/tracing-langfuse.
  • packages/runtime-dbos/src/steps/execute-companion-tool.ts — the let _companionDeps: BindCompanionToolDeps | null plain module-level singleton converted to a CompanionToolStep class with a static deps field, mirroring the ExecuteToolStep / ExecuteSubAgentStep patterns elsewhere in the package. (True per-instance scoping isn't possible without DBOS structural changes — workflow bodies are registered globally and run outside any executor instance context — but the class form gives the bind/get pair a named container so the lifecycle is greppable.)
  • packages/runtime-cloudflare/src/client-tool-workflow-helper.tsrootOwnershipLocks — documented inline as intentional isolate-scoped serialization. Within a CFW isolate every concurrent workflow shares the lock-space for the same rootId, which is exactly the serialization required. Moving to per-instance scoping would break the cross-workflow serialization.
  • packages/runtime-cloudflare/src/workspaces/sandbox/code.tswarnedLanguages — documented inline as intentional cross-process log-spam suppression. The Set is bounded by the count of distinct non-pinned languages (currently 0).

Round-3 remaining items (deferred)

Round-3 surfaced ~50 findings across 8 review angles (security, errors, concurrency, perf, type safety, migration, observability, build/release). Batches H–N landed the P0s + high-impact P1s. The items below are real but each is bounded enough to defer.

Redis round-3 closures landed (4 commits on omnara/stateless-suspension-redesign)

The Redis-side P3.R3-CONC, P3.R3-PERF, P3.R3-OBS, P3.R3-MISC, and P2-polish items shipped in commits 7a2432d7a296ad28e7:

  • 7a2432d7a — FU-A2-41 atomic saveState Lua (no more split customState pipeline; orphan-recovery loop deleted)
  • 11723f815 — P3.R3-CONC + P3.R3-MISC: patchMetadata, updateStatus, setInterruptFlag atomic Lua + compareAndSetStatus const hoist
  • 8a411c26a — Redis-stream hardening: initStream atomic, scan-based getStreamCount, NOT_ACTIVE log, Logger option, maxChunks=0 startup warning, per-instance LRU chunkParseCache
  • 296ad28e7 — deleteSession TOCTOU (enumerate all status indexes), listRuns per-status secondary sorted-set index, cleanupOrphanedStagingData pipelined + structured summary log

Verification: 154 unit + 484 integ tests in store-redis (incl. 4 new test files dedicated to the Redis closures with 34 new tests covering atomicity, CAS, concurrency, type transitions, structural-orphan absence, and observability). Cross-runtime e2e suite: 1390 passed | 61 skipped | 1 failed (C-3 temporal-memory pre-existing timing flake; unrelated). Full npm run test:integration matrix: only 12 FU-A2-42 runtime-temporal failures (pre-existing) — every Redis-touching package green.

The remaining bullet items below are Postgres / D1 / non-Redis or not yet closed.

P3.R3-CONC: Concurrency CAS Lua follow-ups

Status: ✅ mostly closed — 5 of 6 items landed across commits 11723f815, 8a411c26a, 296ad28e7 (also tracked as FU-A2-44 in follow-ups.md). Only the RedisLockManager fencing-token bit remains. Surfaced by: round-3 review #3 P1.C1 / P1.C3 + P2 cluster.

Closed sub-items:

  1. patchMetadata — converted to atomic PATCH_METADATA_SCRIPT (read + merge + write in one EVAL). Two concurrent patches with disjoint keys can no longer race; per-key last-write-wins is preserved.
  2. updateStatus — converted to UPDATE_STATUS_ATOMIC_SCRIPT (status field write + interrupt-context handling + secondary-index ZREM/ZADD + TTL refresh, all in one EVAL). Concurrent transitions can no longer leave a session indexed in two status:* ZSETs.
  3. initStream — converted to atomic INIT_STREAM_SCRIPT (HSETNX status + sequence/createdAt + HSET updatedAt + EXPIRE meta/chunks in one EVAL). Closed the window where a crash between HSETNX status and EXPIRE left the stream key without a TTL.
  4. deleteSession TOCTOU — cleanup ZREM list now enumerates ALL valid SessionStatus values (not just the one read back from HGETALL), so a concurrent updateStatus between the HGETALL and the index Lua can't leave the sessionId orphaned in a status index we didn't touch.
  5. setInterruptFlag — converted to SET_INTERRUPT_FLAG_SCRIPT (HSET reason + timestamp + EXPIRE in one EVAL). Closed the narrow window where a process crash between the HSET and the EXPIRE left the flag without a TTL. Cross-store contract preserved: last-writer-wins on reason.

Still open:

  • RedisLockManager fencing-token INCR not atomic with the lock acquisition. Redlock's mutex still bounds the race per getNextFencingToken call, so this is low priority — not blocking production. Tracked for completeness; fix would consolidate the fencing-token bump into the same EVAL as the lock acquisition.

Effort for remaining: ~half-day. Priority: low — narrow race, bounded by Redlock's mutex.

P3.R3-PERF: Performance follow-ups ✅ closed (no longer tracked)

The six landed sub-items (Redis listRuns status index, D1 promoteStaging single targeted read, Postgres concurrentStatements migration phase, Postgres + D1 message_count denormalization, Redis cleanupOrphanedStagingData pipelining, Redis maxChunks=0 warning) are documented in the closure commits themselves. The one remaining sub-item — Postgres + D1 listSessions cursor pagination — is dropped from the radar; the per-row win from the denormalization closes the dominant cost, and cursor pagination is only a measurable win on dashboards that paginate beyond ~10k rows, which is rare in HITL deployments. If a real workload hits the OFFSET wall, re-open with intent.

P3.R3-OBS: Observability follow-ups ✅ closed

Status: closed (P3.R3-OBS sweep).

All seven sub-items landed across the sweep. Highlights:

  1. RedisStateStore.cleanupOrphanedStagingData Logger wired through constructor options (Round-3 closures, commit 296ad28e7).
  2. safeInvokeHook logger param tightened to required Logger. The logger ?? console fallback is gone — callers without a structured logger pass noopLogger. Test asserts no console writes happen on the noop path.
  3. parentSpanSource plumbed into Langfuse span metadata via registerResumedRun({ metadata: { 'helix.tracing.parentSpanSource' } }). Operators can now filter resumed runs by recovery path ('checkpoint' / 'session-state' / 'none') on their dashboards. Test pinned at suspend-resume-hooks.test.ts.
  4. Redis stream writer NOT_ACTIVE throw logs (commit 8a411c26a).
  5. interruptAgent 504 path emits LogEvent.agentServer.interruptDeadlineExceeded warn with { sessionId, deadlineMs, reason } before throwing. abortAgent has parity.
  6. skipped_record_tokens severity now depends on cause: debug when the LLM step errored (expected downstream effect of the already-logged upstream failure), warn when there's no error cause (adapter bug worth flagging).
  7. LogEvent canonical vocabulary added at packages/core/src/logger/events.ts. Migrated call sites: safeInvokeHook, agentServer.{authenticate,interrupt,abort,submitSchemaLimits}, usage.hook.{skipped,record_tokens,record_tool,record_subagent}, langfuse.lifecycle_hook_failed. Test enforces the <domain>.<subject>.<action> snake_case convention.

Additional console-fallback removals folded into this sweep (consistent with the user directive "we should always be using the logger"):

  • agent-server.ts allowUnauthenticated console.warn fallback — removed; warning routes through configured logger only.
  • ai-sdk/src/react/index.ts useAutoResync / useResumableChat console.error fallbacks — removed; consumers wire onError or read resyncError from the hook return.
  • core/src/orchestration/wait-for-status-transition.ts dev-only console.warn — removed.
  • core/src/workspace/types/metrics.ts consoleMetrics constant replaced with createLoggerMetrics(logger: Logger) factory.

The only remaining production-code console.* calls live in core/src/logger/console.ts (the legitimate consoleLogger implementation operators opt into) and core/src/logger/default.ts.

P3.R3-TYPE: Type-safety polish

Status: ✅ mostly closed — items 1-5 addressed by FU-TYPE-SAFETY-2026-05 (see follow-ups.md "Done items"). Only the zod/v3 ↔ zod/v4 schema-drift migration remains. Surfaced by: round-3 review #5 P2 cluster.

Closed via FU-TYPE-SAFETY-2026-05:

  1. Tool<any, any>[] and AgentConfig<any, any> epidemic — Stage A introduced AnyTool / AnyAgentConfig aliases as Tool<z.ZodType, z.ZodType> / AgentConfig<z.ZodType, z.ZodType> and propagated across 40+ files.
  2. AgentHooks<any, any> — tightened to AgentHooks<unknown, unknown> alongside the HookManager.invoke<K extends keyof AgentHooks> tightening that removed 6+ triple-casts.
  3. AgentConfig<any, any> in PersistentAgentConfig.agent — covered by the same Stage A bulk replace.
  4. getAllToolInvocations / getToolParts return types — Stage A switched to AISDKToolPart[] using the local isToolPart guard (which now validates state + toolCallId, not just the discriminator).
  5. ToolContext.getState<T>() — documented as a caller-driven API contract rather than refactored. JSDoc on core/src/types/tool.ts explains the contract end-to-end; the recommended pattern is Schema.parse(ctx.getState()) for safety-critical tools. Threading TState through Tool generics would be a deeper refactor that diverges from the current input/ output-only generic shape; the documented contract is the chosen trade-off.

Still open:

  1. zod vs zod/v4 schema drift between ErrorDetailSchema (zod) and StreamFailEventSchema's inline copy (zod/v4). Migrate error-detail.ts to zod/v4 so both surfaces agree on a single schema source.

Effort for remaining: ~half-day. Priority: low — drift is detectable by tests; no runtime breakage today.

P3.R3-SEC: Security polish

Status: open Surfaced by: round-3 review #1 P2 cluster.

  1. Body-size caps on /chat, /start, /resume agent-server routes (only /submit-tool-result has one today).
  2. SUBAGENT_TOOL_PREFIX collision check in buildEffectiveTools (mirrors the existing workspace_ / companion__ checks).
  3. sessionId length cap + character allowlist at the HTTP boundary before Redis key interpolation (log-injection vector via ANSI escapes in long sessionIds).
  4. Demo auth.ts strengthen the comment + add explicit "this checks ONLY presence" to the function body.

Effort: ~half-day total. Priority: medium — bounded attack surface, but body-size DoS is real on auth-disabled deployments.

P3.R3-MISC: Round-3 P3 polish

Status: open

  • expiredSessionCleanup uses logger.info?. on REQUIRED interface methods (info/warn/error are non-optional in the Logger interface; ?. hides type errors).
  • cleanupOrphanedStagingData silently deletes parse-failed keys; log the count separately from genuine orphans.
  • getStreamCount blocking KEYSSCAN (or add @internal warning).
  • compareAndSetStatus Lua script string-built per call → hoist to module-level const.

Effort: ~half-day. Priority: low.


Back-compat removal pass — deferred deletions

Per round-3 inverted-review pair: 4 high-confidence deletions landed in commit 676d1b339. The following are real back-compat affordances we don't want, but each deletion is gated by a test or audit that needs sub-project scope:

P3.R3-BC-FALLBACK: defaultSaveStateAndPromoteStaging ✅ closed

Status: closed (P3.R3 back-compat sweep).

defaultSaveStateAndPromoteStaging was removed from packages/core/src/store/state-store.ts and all docs were updated to require atomic saveStateAndPromoteStaging from custom stores. The non-atomic sequential fallback opens a crash window between appendMessages → saveState → promoteStaging that defeats the purpose of the atomic primitive — the "STATE CORRUPTION RISK" doc note added in round-2 was the signal that the right answer was deletion, not "kept-and-warned." All in-tree stores (memory, redis, postgres, D1, DO) already shipped atomic implementations.

Docs updated:

  • docs/guide/state-stores.md — removed the fallback example, rephrased "two paths" as a single required atomic implementation.
  • docs/internals/session-model.md — removed "Fallback for third-party stores" subsection.
  • docs/upgrade-guides/v6-to-v7-stateless-suspension.md — clarified that the helper was removed and atomic is mandatory.

P3.R3-BC-LUA-FALLBACK: allowSequentialFallback in Redis ✅ closed

Status: closed (commit 4a47638fe).

RedisStateStoreOptions.allowSequentialFallback option deleted, private promoteStagingSequential method deleted, conditional gone. The promoteStaging catch block now unconditionally rethrows the original Lua-EVAL error — there's no quiet non-atomic fallback that could silently corrupt state on a misconfigured production deployment.

Four unit-test files that previously used ioredis-mock (which doesn't support EVAL, hence why the fallback existed at all) were migrated to integration tests against real Redis under packages/store-redis/src/__tests__/integration/. This matches the project's existing integ-suite pattern and removes the last allowSequentialFallback consumer.

The in-code comment at redis-state.ts:2919-2927 documents the removal rationale (the catch-block "STATE CORRUPTION RISK" doc note added in round-2 was the signal that the right answer was deletion, not "kept-and-warned").

P3.R3-BC-CONVERTER: isToolResultError content-shape fallback ✅ closed

Status: closed (P3.R3 back-compat sweep).

The heuristic content-inspection fallback was deleted in packages/ai-sdk/src/converter/helix-to-aisdk-converter.ts. isToolResultError now reads ONLY the explicit metadata[COMMON_METADATA_KEYS.TOOL_FAILED] flag. Absent the flag, the tool is treated as successful — the safer default for partial- success outputs like { error: '', data: [...] }. Two regression tests at helix-to-aisdk-converter.test.ts (lines marked P3.R3-BC-CONVERTER closure) lock the new behavior.

P3.R3-BC-FRONTENDHANDLER: FrontendHandler redundancy with handleChatStream ✅ closed

Status: closed in v8. FrontendHandler + createFrontendHandler

  • createCloudflareFrontendHandler removed. The replacement surface (handleChatStream, buildSnapshot, getUIMessages, createCloudflareChatHandler) shipped in v7 and is the only public path going forward. See docs/upgrade-guides/v7-to-v8.md for the migration walkthrough and three observable behavior gaps (missing-stream HTTP 200 vs 204, no ValidationError class on bad-request rejection, derived generateMessageId for multi-turn de-dup).

Total deleted: 9730 LOC across 9 files (the 1425-LOC handler-factory.ts, its 77-case unit test, six FrontendHandler-only integ tests, and the Cloudflare convenience factory). The FrontendHandlerError base class survives (still used by route handlers + the express adapter's catch blocks); FrontendResponse survives (still used by buildSSEResponse + express adapters).

P3.R3-BC-MISC: Smaller back-compat removals ✅ closed

Status: closed (P3.R3 back-compat sweep).

  • buildAgentInput bare-string fallback — ✅ done. The bare-string return path at handle-chat-stream.ts was removed; the function now always returns the structured AgentInputObject form ({ message: [userMsg] }). Closure documented inline at handle-chat-stream.ts:821-830.
  • ReplayContent legacy ordering branch — ✅ done. The ~65-LOC duplicate emit path that flattened text / reasoning / toolCalls into a hardcoded order was deleted. The remaining ~25-line synthesizeOrderedItemsFromFlatFields helper is called ONLY at the input boundary as a convenience normalization for callers that didn't build an orderedItems array — the emit loop has a single code path against orderedItems regardless of input shape. Closure documented inline at replay-events.ts:175-194.
  • D1 migration chain collapse — ✅ done (commit b23000542). runMigration() now detects fresh databases and applies a single collapsed schema instead of walking V1..V10 incrementally — saves ~100-500ms on every fresh-DB worker boot. Schema parity vs the incremental path is pinned by an integ test that uses PRAGMA table_info / index_list to compare both paths structurally.

Skills — deferred

Why future: Skills v1 (3-level progressive disclosure, in-code + filesystem providers) shipped intentionally scoped. The items below were either explicitly deferred in the design spec's "Future work" section or flagged by reviewers as fast-follows; each needs its own increment rather than inline polish on the v1 branch. See docs/superpowers/specs/2026-06-03-skills-progressive-disclosure-design.md and the Skills section of docs/internals/concepts.md for the shipped surface.

Deferred items:

  1. Claude-Code-compatible auto-discovery — a fileSystemSkillProvider pointed at ~/.claude/skills and ./.claude/skills (trivial follow-on; just default roots + the existing provider).
  2. Workspace-backed provider — read skills + run scripts through the workspace abstraction. Enables skills on Cloudflare (where node:fs is unavailable) AND true Level-3 script execution (v1 only discloses script content; running is delegated to the agent's shell/workspace tools).
  3. search_skills semantic/BM25 tool — for very large libraries where catalog-in-prompt no longer scales (catalog-in-prompt covers ~100 skills; beyond that, an index is needed).
  4. Programmatic re-load dedup + SKILL_INJECTION stamping — short- circuit a re-load_skill of an already-loaded skill at the dispatch layer (which has state.messages, unlike the tool's ToolContext), using the shipped collectLoadedSkillNames building block. Stamp { [SKILL_INJECTION]: true, skillIds } onto load_skill result messages at the cross-runtime tool-result construction seam, plus compaction reattachment of dropped skill bodies within a token budget.
  5. Request-context-tiered / dynamic skill sets — per-user/tenant skill catalogs, with explicit cache-invalidation accounting (v1 assumes a stable per-session skill set so the catalog stays in the cached prefix).
  6. Catalog description budget — a per-skill char cap + least-used-drop for very large libraries (v1 includes every skill's full description verbatim).
  7. allowed-tools enforcement — the field is parsed + carried on SkillMetadata in v1 but NOT wired into the permission/approval layer; enforcement (pre-approving a skill's declared tools) is a follow-up.
  8. fs provider hardening — (a) optional realpath-based symlink containment (v1's traversal guard is lexical and does not follow symlinks); (b) richer staleness detection that catches in-place SKILL.md/resource edits, not just root-entry add/remove; (c) the truncation notice on read_skill_file should reflect the actual returned slice, not the raw file size.
  9. DBOS catalog-resolution in a @DBOS.step() — v1 resolves an async (filesystem/remote) provider's catalog in the workflow body, which is fine for the deterministic in-code provider but non-deterministic for a remote/fs provider. A remote-provider story would move catalog resolution into a @DBOS.step().

Newly-discovered (deep cross-runtime/integration review):

  1. DBOS async-provider determinism. resolveSkillsCatalog runs in the DBOS workflow body (not a @DBOS.step). For the in-code provider this is deterministic; for an async filesystem/remote provider it does IO in the workflow body. Impact is bounded — the recomputed catalog only feeds the checkpointed callLLM step (whose cached result is reused on replay), so a mid-run SKILL.md edit yields a divergent-but-discarded string, not a crash or wrong output. Fix later: resolve the catalog in a @DBOS.step (checkpointed), or reject async providers on DBOS at registerAgent. (Overlaps item 9; this is the determinism framing + the bounded-impact analysis.)
  2. CF DO/D1 oversized skill-body row. A load_skill body larger than Cloudflare DO SQL's ~2 MiB per-row limit fails with an opaque SQLite error — there is no pre-flight byte check on message-row inserts (the existing ~1.8 MiB cap applies only to stageChanges/customState, not message rows). Pre-existing for all large tool results; skills make large author-authored content more likely. Fix later: typed pre-flight size check on message inserts.
  3. DBOS SkillsRegistryHolder silent-miss hardening. Today a stale/empty holder resolves the catalog to '' silently, but the LiveModelRegistryHolder throw-guard front-runs it (the model resolves first and throws on a miss), so skilled agents fail loud on the model before the catalog can silently vanish — safe by ordering. Defense-in-depth: make SkillsRegistryHolder throw on a registered-but-missing agent so the guarantee isn't order-dependent.
  4. Test coverage. Add (a) a multi-step step-2 catalog assertion; (b) a skilled SUB-AGENT test. Mostly DONE in the comprehensive test pass: multi-step catalog-stability + body-persistence (JS), load_skill / read_skill_file dispatch on every runtime, skilled sub-agent isolation (JS, real parent→child path), skill-body cross-store round-trip + checkpoint/truncate (shared contract → all 5 stores), and a cross-runtime e2e matrix (packages/e2e/src/skills-cross-runtime-matrix*.test.ts, skills.cf.test.ts). Remaining: full parent→child sub-agent DISPATCH on Temporal/DBOS/CF requires the service-gated integration lane (Postgres/Temporal-devserver/workerd) — currently proven by code-reading + the per-runtime catalog/dispatch unit tests; the JS path has the full parent→child test.
  5. String tool-result serialization divergence across runtimes (general, affects skill bodies). runtime-js persists a string tool result RAW (run-loop.ts: typeof result === 'string' ? result : JSON.stringify(...)), but runtime-temporal/dbos/cloudflare build the message via the shared createToolResultMessage, which ALWAYS JSON.stringifys — so a string-returning tool's result (incl. load_skill/read_skill_file skill bodies) is persisted (and re-shown to the model) JSON-ENCODED (surrounding quotes + escaped inner quotes) on those three runtimes, vs clean raw text on JS. Surfaced by the e2e matrix live runs (tests normalize via a decode helper). Pre-existing for ALL string tool results — not skills-specific and not a correctness bug (the model reads both) — but it means skill bodies are noisier on 3 of 4 runtimes. Fix later (framework-wide, needs its own brainstorm — broad blast radius): unify string-tool-result persistence so all runtimes match runtime-js's raw-string handling.
  6. collectLoadedSkillNames counts a not-found load as "loaded". A not-found load_skill returns a graceful string result with no TOOL_FAILED marker, so the transcript-derived loaded-set helper treats the requested (missing) name as loaded. Harmless today (the helper is an exported inspector not yet wired into dispatch dedup), but when it is wired (item 4) it should treat not-found as not-loaded — e.g. mark not-found results with TOOL_FAILED or have the helper check result content. Note: collectLoadedSkillNames is now a public @helix-agents/core export (a pure, recovery-safe inspector; re-widened from internal so the e2e tests / consumer UIs can use it).

Dependencies: none — all build on the shipped v1 surface. Priority: low-to-medium per item; (1) and (4) are the cheapest wins; (14) is higher-value-but-higher-risk (framework-wide).


(Other future work goes here as it's identified.)

Released under the MIT License.