protocol
KLIS-5: Agent Lifecycle & Suspension
How agents handle interruption without losing work.
1. Purpose
Agents in a KLIS ecosystem are not continuous processes; they are interruptible state machines. KLIS-5 defines how agents gracefully handle interruptions (conflicts, preemption) without losing work.
2. Agent State Machine (Normative)
Implementations MUST track agents in one of the following high-level states:
- PLANNING: Reasoning about the task, generating the Intent Manifest. No network resources held.
- ACQUIRING: Submitting Manifest to Control Plane. Waiting for
GrantedorConflict. - EXECUTING: Holding valid leases. Performing side effects (I/O).
- PAUSED: Execution halted due to conflict
WAITorDIEinstruction. State serialized. - VERIFYING: Work done. Leases held. Validating results.
- COMPLETED: Leases released. Success.
- ABORTED: Leases released. Failure/Rollback.
3. The PAUSE / RESUME Protocol
When an agent encounters a conflict that cannot be immediately resolved, it transitions to PAUSED.
3.1. The StateDigest
To survive suspension (and potential process restart), the agent MUST produce a StateDigest.
{
"checkpoint_id": "ckpt_xyz",
"resume_point": "STEP_4_WRITE_FILE",
"context": { },
"intent_backlog": [ ],
"resolution_strategy": "RETRY | ABORT | MERGE"
}
- Persistence: The Digest MUST be persisted (disk, DB, or returned to User UI).
- Transportability: A different agent instance SHOULD be able to resume from the Digest (stateless compute).
3.2. Resume Semantics
When the conflict is resolved (e.g., Resource X is freed):
- Rehydration: Agent loads StateDigest.
- Re-Validation: Agent MUST re-check if its previous assumptions are still true (e.g., did the file change while I was asleep?).
- Re-Acquisition: Agent submits the Intent Manifest again.
3.3. Semantic Rehydration
When transitioning from PAUSED to EXECUTING, the Environment MUST inject an "Observation Update" (the Resume Hint). This is normative. The agent MUST be informed of any resource changes that occurred during its downtime.
4. Abort vs. Resume
- Abort: If re-validation fails (e.g., the file I wanted to write was deleted by someone else), the agent MUST drop the Digest and return to PLANNING (re-reasoning).
- Resume: If conditions are stable, the agent proceeds to EXECUTING.
5. Rehydration Guarantees
The Control Plane cannot guarantee the state of the world hasn't changed. It only guarantees the lock is now available.
- Responsibility: It is the Agent's responsibility to
git pullorread_fileagain upon resume definition. KLIS does not provide "Snapshot Isolation" for the filesystem.
6. Non-Goals
- Process Migration: KLIS-5 defines the protocol for checking out/in, not the OS-level process migration (CRIU).
7. Atomic Checkpointing
To prevent "Partial Digest Corruption," implementations MUST follow the Temp-Rename pattern for StateDigest persistence. The digest must be written to a temporary file and atomically renamed to the final destination.