protocol

KLIS-5: Agent Lifecycle & Suspension

How agents handle interruption without losing work.

1. Purpose

Agents in a KLIS ecosystem are not continuous processes; they are interruptible state machines. KLIS-5 defines how agents gracefully handle interruptions (conflicts, preemption) without losing work.

2. Agent State Machine (Normative)

Implementations MUST track agents in one of the following high-level states:

  1. PLANNING: Reasoning about the task, generating the Intent Manifest. No network resources held.
  2. ACQUIRING: Submitting Manifest to Control Plane. Waiting for Granted or Conflict.
  3. EXECUTING: Holding valid leases. Performing side effects (I/O).
  4. PAUSED: Execution halted due to conflict WAIT or DIE instruction. State serialized.
  5. VERIFYING: Work done. Leases held. Validating results.
  6. COMPLETED: Leases released. Success.
  7. ABORTED: Leases released. Failure/Rollback.

3. The PAUSE / RESUME Protocol

When an agent encounters a conflict that cannot be immediately resolved, it transitions to PAUSED.

3.1. The StateDigest

To survive suspension (and potential process restart), the agent MUST produce a StateDigest.

{
  "checkpoint_id": "ckpt_xyz",
  "resume_point": "STEP_4_WRITE_FILE",
  "context": { },
  "intent_backlog": [ ],
  "resolution_strategy": "RETRY | ABORT | MERGE"
}
  • Persistence: The Digest MUST be persisted (disk, DB, or returned to User UI).
  • Transportability: A different agent instance SHOULD be able to resume from the Digest (stateless compute).

3.2. Resume Semantics

When the conflict is resolved (e.g., Resource X is freed):

  1. Rehydration: Agent loads StateDigest.
  2. Re-Validation: Agent MUST re-check if its previous assumptions are still true (e.g., did the file change while I was asleep?).
  3. Re-Acquisition: Agent submits the Intent Manifest again.

3.3. Semantic Rehydration

When transitioning from PAUSED to EXECUTING, the Environment MUST inject an "Observation Update" (the Resume Hint). This is normative. The agent MUST be informed of any resource changes that occurred during its downtime.

4. Abort vs. Resume

  • Abort: If re-validation fails (e.g., the file I wanted to write was deleted by someone else), the agent MUST drop the Digest and return to PLANNING (re-reasoning).
  • Resume: If conditions are stable, the agent proceeds to EXECUTING.

5. Rehydration Guarantees

The Control Plane cannot guarantee the state of the world hasn't changed. It only guarantees the lock is now available.

  • Responsibility: It is the Agent's responsibility to git pull or read_file again upon resume definition. KLIS does not provide "Snapshot Isolation" for the filesystem.

6. Non-Goals

  • Process Migration: KLIS-5 defines the protocol for checking out/in, not the OS-level process migration (CRIU).

7. Atomic Checkpointing

To prevent "Partial Digest Corruption," implementations MUST follow the Temp-Rename pattern for StateDigest persistence. The digest must be written to a temporary file and atomically renamed to the final destination.