protocol
KLIS-4: Deadlock Prevention
How KLIS enforces Wait-Die to guarantee global progress.
1. Purpose
Distributed systems with locking are prone to Deadlocks (Cycle: A waits for B, B waits for A) and Livelocks (A and B both retry indefinitely). KLIS enforces a non-blocking or pre-emptive protocol to guarantee global progress.
2. Supported Strategies
KLIS Compliant Implementations MUST implement Wait-Die or Wound-Wait. KLIS Compliant Implementations MUST NOT use "Wait-Forever" (Blind Blocking).
2.1. The Priority Timestamp
Every agent session MUST accept a priority_timestamp (T) at initialization (usually the session start time).
- Older Transaction (Small T) = Higher Priority.
- Younger Transaction (Large T) = Lower Priority.
3. Wait-Die Protocol (Recommended)
When Agent A (Requester) requests a resource held by Agent B (Holder):
Case 1: requester.priority < holder.priority (Older waits for Younger)
- Concept: The "senior" agent is allowed to wait for the "junior" agent to finish.
- Action:
WAIT. Agent A enters a queue. - Timeout: Agent A MUST have a
WaitTimeout. If B takes too long, A aborts (prevents starvation).
Case 2: requester.priority > holder.priority (Younger requests from Older)
- Concept: The "junior" agent should not block the "senior" agent.
- Action:
DIE. Agent A request fails immediately. ADIEverdict MUST NOT result in a new timestamp. - Recovery: Agent A MUST rollback, wait a random backoff, and retry. It MUST persist its
txnTimestampin theStateDigestto gain seniority over time.
4. Contention Handling
4.1. Queue Fairness
- When a resource is freed, the Control Plane MUST notify waiters.
- Ordering: FIFO among waiters, or Priority-based (Oldest First).
- Thundering Herd: If multiple agents are waiting, the Control Plane SHOULD grant to one specific waiter rather than broadcasting "free" to all.
4.2. Exponential Backoff
- Agents that
DIEor are rejected MUST implement Exponential Backoff. Delay = min(Cap, Base * 2^Attempts) + Jitter- Jitter: Essential to prevent resonance.
4.3. Backoff Transparency
The DIE verdict MUST include a retryAfter hint (in ms) calculated via exponential backoff. The Wrapper/Supervisor MUST respect this hint to prevent "Thundering Herd" on the Control Plane.
5. Batch Acquisition Rule
To further minimize deadlock risk, KLIS encourages Atomic Batch Acquisition.
- Agents should request ALL needed resources for a unit of work at once.
- If ANY resource is unavailable/conflicted:
- If
WAITis allowed: Wait for all. - If
DIEis triggered: Release ALL implicitly held provisional locks and retry later.
- If
- Hold-and-Wait Prevention: An agent holding Resource X SHOULD NOT request Resource Y in a separate network call if ignoring Y would cause a deadlock. (Standard: Request all upfront).
6. Failure Modes
- Starvation: A "young" agent keeps dying because an "old" reliable agent is hogging the resource.
- Mitigation: The "young" agent keeps its timestamp. Eventually, it becomes the "oldest" agent in the system relative to new incoming requestors.
7. Security Considerations
- Priority Manipulation: Agents SHOULD NOT be able to self-assign arbitrary timestamps (e.g.,
T=0). Timestamps MUST be assigned by the trusted Control Plane.
8. Non-Goals
- Resource Preemption: KLIS-4 does not define "Wound-Wait" (where the older agent kills the younger agent's lease forcibly) as the default, though it is a valid alternative implementation. Wait-Die is preferred for simpler rollback logic.