Interrupt and Resume
Helix Agents supports interrupting and resuming agent execution. This enables user-controlled pauses, crash recovery, and time-travel debugging.
Overview
When an agent is running, you can:
- Interrupt - Soft stop that saves state for later resumption
- Resume - Continue execution from where it stopped
This is different from abort, which is a hard stop that fails the agent permanently.
execute() vs resume() Semantics
Understanding the distinction between execute() and resume() is critical for correct agent lifecycle management.
execute() - Fresh Start
execute() always starts a fresh execution. Using the same sessionId continues the conversation within that session:
// New session with generated ID
const handle = await executor.execute(agent, 'Hello');
// New session with specific ID
const handle = await executor.execute(agent, 'Hello', { sessionId: 'my-session' });
// Continue conversation in existing session
const handle2 = await executor.execute(agent, 'Follow up', { sessionId: 'my-session' });When you call execute():
- Stream is reset - All previous stream chunks are cleared
- New run begins - A new run starts within the session
- Session state continues - Messages and custom state are preserved within the session
This means each execute() call starts a fresh run. If you call execute() with the same sessionId, the new run continues the conversation with access to previous messages. Clients streaming from a previous run will receive a stream_resync event notifying them of the discontinuity.
resume() - Continue Execution
resume() continues from where the agent stopped:
// Continue from last checkpoint
const newHandle = await handle.resume();
// Continue with additional context
const newHandle = await handle.resume({
mode: 'with_message',
message: 'Please continue',
});When you call resume():
- Stream is preserved - Existing stream chunks remain intact
- State is loaded - Agent state is restored from the checkpoint
- Execution continues - The agent resumes from the checkpoint step
When to Use Each
| Scenario | Method | Reason |
|---|---|---|
| Starting a new conversation | execute() | Fresh start needed |
| User sends new message to completed agent | execute() | New run, new stream |
| Crash recovery | resume() | Continue where left off |
| User interrupted, wants to continue | resume() | Preserve progress |
| Time-travel to earlier state | resume({ mode: 'from_checkpoint' }) | Branch from history |
| Tool confirmation received | resume({ mode: 'with_confirmation' }) | Continue paused run |
| Retry after failure | retry() | Restore checkpoint, re-attempt |
Stream Behavior Difference
The key technical difference is stream handling:
// execute() resets the stream
// - Clients see: stream_resync (if reconnecting) → new chunks from step 0
// - Old chunks are deleted
// resume() preserves the stream
// - Clients see: existing chunks → new chunks from checkpoint step
// - Stream continuity maintainedThis distinction matters for frontend integration. If you're building a chat UI:
- New conversation: Use
execute()- the UI should clear and show fresh messages - Reconnecting after disconnect: Use
resume()- the UI should show existing messages and continue
Retrying Failed Agents
Failed agents cannot be resumed with resume(). Use the retry() method for failure recovery.
Why retry() is Separate from resume()
| Aspect | resume() | retry() |
|---|---|---|
| Purpose | Continue interrupted/paused execution | Recover from failure |
| Valid statuses | interrupted, paused | failed |
| Stream behavior | Preserves chunks | Resets to checkpoint |
| State handling | Continues from current | Restores from checkpoint |
retry() Method
const result = await handle.result();
if (result.status === 'failed') {
// Retry from last checkpoint (default) - typically needs message
const retryHandle = await handle.retry({
message: 'Research quantum computing', // Re-provide the triggering message
});
// Or from specific checkpoint
const retryHandle = await handle.retry({
message: 'Research quantum computing',
checkpointId: 'cpv1-...',
});
// Or start completely fresh
const retryHandle = await handle.retry({
mode: 'from_start',
message: 'Research quantum computing',
});
}Message Requirement
When using from_checkpoint mode (default), the original user message that triggered the failure is part of the checkpoint state. You typically need to provide a message option to specify what to retry.
Without an explicit message, the retry will attempt to extract the last user message from the session history, but this may not work as expected in all cases. It's safer to always provide the message explicitly.
RetryOptions
| Option | Type | Default | Description |
|---|---|---|---|
mode | 'from_checkpoint' | 'from_start' | 'from_checkpoint' | How to retry |
checkpointId | string | Latest | Which checkpoint to restore from |
message | string | (extracted) | Message to retry with - recommended to always provide |
abortSignal | AbortSignal | - | Abort signal for cancellation |
Concurrency Safety
execute() Safety Check
execute() validates that the session is not already running:
import { AgentAlreadyRunningError } from '@helix-agents/core';
try {
await executor.execute(agent, message, { sessionId });
} catch (error) {
if (error instanceof AgentAlreadyRunningError) {
console.log(`Session ${error.sessionId} is already running`);
// Wait for existing execution or use different session
}
}This prevents state corruption from concurrent executions.
Atomic Status Transitions
All methods use appropriate concurrency control:
| Method | Protection | Mechanism |
|---|---|---|
execute() | Status check + StaleStateError | Rejects if running |
resume() | CAS | Atomic interrupted/paused → active |
retry() | CAS | Atomic failed → active |
When concurrent calls race past status checks, optimistic locking via version numbers ensures only one succeeds. The losing call receives a StaleStateError.
Basic Usage
Interrupting an Agent
Use handle.interrupt() to pause execution:
const handle = await executor.execute(agent, 'Research quantum computing');
// Later, interrupt the agent
await handle.interrupt('user_requested');
// Agent status is now 'interrupted'
const state = await handle.getState();
console.log(state.status); // 'interrupted'The agent will stop at the next safe point (between steps). Any in-progress step is rolled back to the last checkpoint.
Resuming an Agent
Use handle.resume() to continue execution:
// Check if we can resume
const { canResume, reason } = await handle.canResume();
if (canResume) {
// Resume execution
const newHandle = await handle.resume();
// Stream events from the resumed execution
for await (const chunk of (await newHandle.stream()) ?? []) {
console.log(chunk.type);
}
// Get final result
const result = await newHandle.result();
console.log(result.output);
}Resume Modes
The resume() method supports four modes:
continue (default)
Resume from where the agent stopped:
const newHandle = await handle.resume(); // Same as { mode: 'continue' }
const newHandle = await handle.resume({ mode: 'continue' });with_message
Resume and add a new user message to the conversation:
const newHandle = await handle.resume({
mode: 'with_message',
message: 'Actually, focus on quantum entanglement specifically',
});This is useful when the user wants to redirect the agent's focus.
with_confirmation
Resume a paused tool call with confirmation data:
// Agent paused waiting for confirmation on a tool
const newHandle = await handle.resume({
mode: 'with_confirmation',
data: { approved: true, notes: 'Proceed with the action' },
});This mode is used when a tool requires human-in-the-loop approval.
from_checkpoint
Resume from a specific historical checkpoint (time-travel):
// Get available checkpoints
const state = await handle.getState();
const checkpoints = await stateStore.listCheckpoints(handle.sessionId);
// Resume from an earlier checkpoint
const newHandle = await handle.resume({
mode: 'from_checkpoint',
checkpointId: checkpoints.items[0].id,
});Crash Recovery
If your process crashes while an agent is running, the state is preserved in the state store. On restart, you can resume:
// After restart, reconnect to the session
const handle = await executor.getHandle(agent, savedSessionId);
if (handle) {
const { canResume, reason } = await handle.canResume();
if (canResume) {
// Resume from last checkpoint
const resumed = await handle.resume();
const result = await resumed.result();
}
}For this to work, use a persistent state store like Redis:
import { RedisStateStore, RedisStreamManager } from '@helix-agents/store-redis';
const stateStore = new RedisStateStore({ host: 'localhost' });
const streamManager = new RedisStreamManager({ host: 'localhost' });
const executor = new JSAgentExecutor(stateStore, streamManager, llmAdapter);What Happens During Resume
When you call resume() on a crashed or interrupted agent, the framework performs cleanup before continuing:
- Load checkpoint - Get the last successfully completed step
- Clean up orphaned staging - Remove any uncommitted staged changes
- Truncate messages - Remove messages added after the checkpoint (prevents duplicates)
- Clean up stream chunks - Remove chunks after the checkpoint step
- Emit
stream_resync- Notify connected clients of the discontinuity - Continue execution - Resume from the checkpoint step
graph LR
subgraph Before ["Before Crash"]
direction LR
S1["Step 1"] --> S2["Step 2"] --> S3["Step 3"] --> S4["Step 4 ✗"]
end
S3 -. "Checkpoint" .-> S3
subgraph After ["After Resume"]
direction LR
R1["Step 1"] --> R2["Step 2"] --> R3["Step 3"] --> R4["Step 4"] --> R5["Step 5"]
end
R3 -. "Resume from<br/>checkpoint" .-> R3The truncation is critical for preventing duplicate messages. If step 4 partially completed (wrote messages but crashed before checkpoint), those messages are orphaned. truncateMessages() removes them so they're not duplicated when the step re-executes.
Message Count Protection
The framework only truncates messages when a checkpoint exists AND has messageCount > 0. This prevents data loss in edge cases like:
- First-step crash (no checkpoint yet)
- Old checkpoints from before
messageCountfield was added
Resumable Statuses
An agent can be resumed when its status is:
| Status | Resumable | Description |
|---|---|---|
running | Yes | Crash recovery - process died but state says running |
interrupted | Yes | User interrupted execution |
paused | Yes | Waiting for tool confirmation |
completed | No* | Agent finished successfully |
failed | No* | Agent failed with error |
*Terminal states can only be resumed with mode: 'from_checkpoint' to time-travel to a previous state.
Stream Events
The framework emits stream chunks for interrupt/resume events:
run_interrupted
Emitted when an agent is interrupted:
{
type: 'run_interrupted',
runId: 'run-123',
checkpointId: 'cpv1-run-123-s5-t1234567890-abc123',
reason: 'user_requested'
}run_resumed
Emitted when an agent resumes:
{
type: 'run_resumed',
runId: 'run-123',
fromCheckpointId: 'cpv1-run-123-s5-t1234567890-abc123',
fromStepCount: 5,
mode: 'continue'
}run_paused
Emitted when an agent pauses for confirmation:
{
type: 'run_paused',
runId: 'run-123',
reason: 'tool_confirmation_required',
pendingToolName: 'delete_file',
pendingToolCallId: 'call-456'
}checkpoint_created
Emitted when a checkpoint is saved:
{
type: 'checkpoint_created',
runId: 'run-123',
checkpointId: 'cpv1-run-123-s5-t1234567890-abc123',
stepCount: 5
}Error Handling
AgentAlreadyRunningError
Thrown when trying to resume an agent that's already running:
import { AgentAlreadyRunningError } from '@helix-agents/core';
try {
await handle.resume();
} catch (error) {
if (error instanceof AgentAlreadyRunningError) {
console.log(`Agent session ${error.sessionId} is already running`);
}
}AgentNotResumableError
Thrown when trying to resume an agent in a terminal state:
import { AgentNotResumableError } from '@helix-agents/core';
try {
await handle.resume();
} catch (error) {
if (error instanceof AgentNotResumableError) {
console.log(`Cannot resume: ${error.currentStatus}`);
}
}Runtime Considerations
JavaScript Runtime
The JS runtime handles interrupts in-process. When you call interrupt():
- The current step is marked for rollback
- Staged changes are discarded
- Status is set to 'interrupted'
- The run loop exits cleanly
Temporal Runtime
Temporal uses workflow signals for interrupts:
interrupt()sends a signal to the workflow- The workflow handles the signal at the next safe point
- State is persisted via Temporal's durability guarantees
- Resume creates a new workflow execution that continues from the checkpoint
Cloudflare Runtime
Cloudflare Workflows handle interrupts via instance coordination:
interrupt()updates the state and marks for stop- The workflow step completes and checks the interrupt flag
- Resume creates a new workflow instance with the same run ID
Interrupt Behavior During Sub-Agent Execution
When an agent spawns sub-agents (child workflows), interrupt signals propagate through the entire agent hierarchy. This enables responsive cancellation even during complex multi-agent operations.
How Interrupt Propagation Works
When you interrupt a parent agent that has running sub-agents:
- Parent receives interrupt - The interrupt flag is set on the parent agent
- Children are signaled - The parent propagates the interrupt to all running children
- Children stop gracefully - Each child detects the interrupt and stops at its next safe point
- Parent completes - Once all children have stopped, the parent returns with status
interrupted
Target Response Time: < 1 second from interrupt request to stopped execution, even with deeply nested sub-agents.
Per-Runtime Implementation
| Runtime | Mechanism | Latency |
|---|---|---|
| JS | AbortSignal propagation via batchController | Immediate (< 100ms) |
| Temporal | Signal handlers + Trigger primitive for instant wake | < 1 second |
| Cloudflare | Event-based: Promise.race with interrupt events | Immediate (< 100ms) |
JavaScript Runtime
The JS runtime uses AbortSignal for cooperative cancellation:
// Parent's AbortController is linked to child agents
// When parent.abort() is called, children receive abort signal immediately
// In custom tools, check the signal:
const myTool = defineTool({
execute: async (input, context) => {
for (const item of items) {
if (context.abortSignal.aborted) {
return { partial: true, processed: results };
}
// ... process item
}
},
});Temporal Runtime
Temporal uses workflow signals with a Trigger primitive for instant response:
// Platform adapter pattern for sub-second interrupt response:
import { Trigger, getExternalWorkflowHandle, setHandler } from '@temporalio/workflow';
const interruptTrigger = new Trigger<string>();
setHandler(interruptSignal, (reason) => {
interruptReason = reason;
interruptTrigger.resolve(reason); // Wake up immediately
});
// The workflow races child completion against interrupt trigger
// If interrupt wins, all running children are signaled to stopCloudflare Runtime
Cloudflare uses an event-based approach for immediate interrupt response:
When interrupt() is called:
- Interrupt flag is set via
stateStore.setInterruptFlag()(persisted) - Interrupt event is sent via
instance.sendEvent()for immediate wake-up - The workflow races completion events against interrupt events using
Promise.race - Whichever event arrives first wins the race
The workflow checks for interrupts at two points:
- Before each step: In the main execution loop
- Before spawning sub-agents: A pre-spawn check prevents spawning if already interrupted
This event-based approach provides immediate interrupt response (< 100ms) without polling overhead.
Best Practices for Sub-Agent Interrupts
- Keep sub-agent operations bounded - Long-running sub-agents delay interrupt response
- Use interruptible tools - Check
context.abortSignal.abortedin loops - Configure poll intervals appropriately - Shorter intervals = faster response, more overhead
- Design for partial results - Sub-agents should return meaningful partial output when interrupted
Best Practices
1. Always Check canResume()
Before resuming, verify the agent can be resumed:
const { canResume, reason } = await handle.canResume();
if (!canResume) {
console.log(`Cannot resume: ${reason}`);
return;
}2. Handle Resume Errors
Wrap resume calls in try-catch:
try {
const resumed = await handle.resume();
await resumed.result();
} catch (error) {
if (error instanceof AgentAlreadyRunningError) {
// Another process resumed first
} else if (error instanceof AgentNotResumableError) {
// Agent is in terminal state
}
}3. Use Persistent Storage for Production
For crash recovery to work, use Redis or another persistent store:
const stateStore = new RedisStateStore({
host: process.env.REDIS_HOST,
ttl: 86400 * 7, // 7 days retention
});4. Store Session IDs for Later Resumption
Save session IDs so you can reconnect after crashes:
const handle = await executor.execute(agent, input);
// Persist the session ID
await database.save({ sessionId: handle.sessionId, userId });
// Later, after restart
const savedSession = await database.get(userId);
const handle = await executor.getHandle(agent, savedSession.sessionId);Next Steps
- Checkpoints - Understand the checkpoint system
- Distributed Coordination - Multi-process coordination
- Streaming - Handle interrupt/resume stream events