Skip to content

Sync Protocol

DSM replication has two related layers: the runtime control plane and the repair-oriented data plane.

End-To-End Flow

1. application calls register.put(...) / lease.acquire(...) / crdt.update(...)
2. DsmRuntime writes the change locally
3. RuntimePlatformSyncService emits a platform envelope
4. peers apply the delta if they are already in sync
5. if a peer is behind, RuntimeDataPlaneReplicationService repairs it
6. cluster returns to a converged state

Two-Layer Picture

			   normal path

local write -> control plane delta -> peers apply update

			   recovery path

peer falls behind -> digest check -> snapshot/replay -> peer catches up

Control Plane

RuntimePlatformSyncService replicates register, lease, and CRDT deltas over platform envelopes. This is the normal propagation path after the application mutates a collection handle.

ASCII view:

local mutation
	|
	v
[DsmRuntime]
	|
	v
[RuntimePlatformSyncService]
	|
	+--> peer-b applies delta
	+--> peer-c applies delta

This path is for the healthy, steady-state case where peers are online and current.

Example mental model:

register.put(route-hint)
	-> local runtime commits it
	-> sync service emits one delta envelope
	-> healthy peers apply the same update

Data Plane Repair

RuntimeDataPlaneReplicationService repairs lagging peers using:

  • digest exchange
  • snapshot transfer
  • replay flows

That repair path matters when peers miss messages, rejoin after downtime, or need to reconcile state after cluster instability.

ASCII repair view:

peer-c falls behind
	|
	v
digest check -> state differs
	|
	+--> snapshot if peer is far behind
	|
	+--> replay if peer can catch up from history
	v
peer-c returns to current cluster state

Repair is why DSM does not rely only on best-effort delta delivery.

Typical cases that trigger repair:

  • a node restarted and missed some deltas
  • a node was partitioned briefly
  • a peer joined after the latest state already moved forward

Why The Split Exists

The control plane handles normal steady-state mutation flow. The data plane exists to restore correctness and convergence when normal flow is not enough.

More concretely:

  • control plane optimizes for fast mutation propagation
  • repair plane optimizes for correctness after loss, delay, or rejoin

Without the repair plane, a missed message could leave a node permanently stale.

Collection-Specific Behavior

  • registers replicate metadata-backed entity updates
  • leases replicate ownership state and lease transitions
  • CRDTs replicate updates plus state repair toward convergence

Examples:

  • register: route-hints delta announces a new address for edge-eu-west-1
  • lease: shard-owner delta announces owner change or renew result for shard-17
  • CRDT: request-counter delta carries update intent while repair transfers merged state when needed

Sequence Sketch

node-a                    node-b
	|                         |
	| put(route-hint)         |
	|---- delta envelope ---->|
	|                         | apply update
	|                         |

if node-b missed it:

node-a                    node-b
	|                         |
	|<--- digest mismatch ----|
	|---- replay/snapshot --->|
	|                         | catch up

Where To Read Executable Behavior

If you want to see the protocol working in code rather than prose, start with:

  • dsm-integration-test/.../TwoNodeIntegrationTest for two-node register replication
  • dsm-integration-test/.../RuntimeIntegrationTest for register, lease, and CRDT behavior together