protocol

KLIS-4: Deadlock Prevention

How KLIS enforces Wait-Die to guarantee global progress.

1. Purpose

Distributed systems with locking are prone to Deadlocks (Cycle: A waits for B, B waits for A) and Livelocks (A and B both retry indefinitely). KLIS enforces a non-blocking or pre-emptive protocol to guarantee global progress.

2. Supported Strategies

KLIS Compliant Implementations MUST implement Wait-Die or Wound-Wait. KLIS Compliant Implementations MUST NOT use "Wait-Forever" (Blind Blocking).

2.1. The Priority Timestamp

Every agent session MUST accept a priority_timestamp (T) at initialization (usually the session start time).

  • Older Transaction (Small T) = Higher Priority.
  • Younger Transaction (Large T) = Lower Priority.

When Agent A (Requester) requests a resource held by Agent B (Holder):

Case 1: requester.priority < holder.priority (Older waits for Younger)

  • Concept: The "senior" agent is allowed to wait for the "junior" agent to finish.
  • Action: WAIT. Agent A enters a queue.
  • Timeout: Agent A MUST have a WaitTimeout. If B takes too long, A aborts (prevents starvation).

Case 2: requester.priority > holder.priority (Younger requests from Older)

  • Concept: The "junior" agent should not block the "senior" agent.
  • Action: DIE. Agent A request fails immediately. A DIE verdict MUST NOT result in a new timestamp.
  • Recovery: Agent A MUST rollback, wait a random backoff, and retry. It MUST persist its txnTimestamp in the StateDigest to gain seniority over time.

4. Contention Handling

4.1. Queue Fairness

  • When a resource is freed, the Control Plane MUST notify waiters.
  • Ordering: FIFO among waiters, or Priority-based (Oldest First).
  • Thundering Herd: If multiple agents are waiting, the Control Plane SHOULD grant to one specific waiter rather than broadcasting "free" to all.

4.2. Exponential Backoff

  • Agents that DIE or are rejected MUST implement Exponential Backoff.
  • Delay = min(Cap, Base * 2^Attempts) + Jitter
  • Jitter: Essential to prevent resonance.

4.3. Backoff Transparency

The DIE verdict MUST include a retryAfter hint (in ms) calculated via exponential backoff. The Wrapper/Supervisor MUST respect this hint to prevent "Thundering Herd" on the Control Plane.

5. Batch Acquisition Rule

To further minimize deadlock risk, KLIS encourages Atomic Batch Acquisition.

  • Agents should request ALL needed resources for a unit of work at once.
  • If ANY resource is unavailable/conflicted:
    • If WAIT is allowed: Wait for all.
    • If DIE is triggered: Release ALL implicitly held provisional locks and retry later.
  • Hold-and-Wait Prevention: An agent holding Resource X SHOULD NOT request Resource Y in a separate network call if ignoring Y would cause a deadlock. (Standard: Request all upfront).

6. Failure Modes

  • Starvation: A "young" agent keeps dying because an "old" reliable agent is hogging the resource.
    • Mitigation: The "young" agent keeps its timestamp. Eventually, it becomes the "oldest" agent in the system relative to new incoming requestors.

7. Security Considerations

  • Priority Manipulation: Agents SHOULD NOT be able to self-assign arbitrary timestamps (e.g., T=0). Timestamps MUST be assigned by the trusted Control Plane.

8. Non-Goals

  • Resource Preemption: KLIS-4 does not define "Wound-Wait" (where the older agent kills the younger agent's lease forcibly) as the default, though it is a valid alternative implementation. Wait-Die is preferred for simpler rollback logic.