Skip to content

Observability

DSM uses DsmMetrics as its metrics SPI instead of coupling the runtime to one telemetry framework. The practical model is:

text
runtime event
	|
	+--> diagnostics snapshot for immediate inspection
	|
	+--> DsmMetrics callback for long-window visibility
	|
	+--> your metrics backend / alerting system

Core Signals

At minimum the SPI exposes:

  • recordSyncLatency(long latencyMs)
  • reportClusterSize(int count)

Additional callbacks cover:

  • authentication failures
  • replay rejection
  • cluster admission denial
  • service ID mismatch
  • LWW discards and dropped messages
  • suspected partitions
  • lease acquisition, renewal, transfer, release, and verification
  • fencing rejection
  • active lease holder counts
  • backpressure decisions and queue depth

Signal Groups

Cluster Health

  • reportClusterSize
  • recordServiceIdMismatch
  • recordClusterAdmissionDenied
  • recordPartitionSuspected

Interpretation:

text
cluster size drops
	+ serviceId mismatch rises
		-> likely isolation / discovery mismatch

cluster size drops
	+ partition suspected rises
		-> likely membership or network instability

Security And Transport Integrity

  • recordAuthFailure
  • recordReplayRejection
  • recordMessageDropped

Interpretation:

text
auth failures spike
	-> cluster secret or signing mismatch

replay rejection spikes
	-> nonce / replay protection issue or duplicate traffic pattern

Lease Health

  • recordLeaseAcquire
  • recordLeaseRenew
  • recordLeaseTransfer
  • recordLeaseRelease
  • recordLeaseVerify
  • recordFencingReject
  • reportActiveLeaseHolders

Interpretation:

text
renew failures rise
	+ fencing rejects rise
		-> investigate lease thrash and stale holders

active holders unstable
	+ acquire rejections high
		-> investigate ownership churn or timing windows

Backpressure And Queueing

  • recordBackpressureDecision
  • reportQueueDepth

Interpretation:

text
queue depth grows steadily
	+ backpressure decisions increase
		-> runtime under sustained pressure or downstream transport lag

How To Correlate Metrics With Diagnostics

text
alert fires
	|
	+--> check DsmMetrics series for trend direction
	|
	+--> query runtime.diagnostics() on affected node
	|
	+--> confirm whether issue is:
			cluster membership
			wrong collection registration
			lease churn
			transport/security rejection

No-Op Default

DsmMetrics.noop() is available when you need a runtime that operates without external metrics wiring.

Map the SPI into your existing metrics system with these shapes:

  • timers for sync latency
  • gauges for cluster size, active holders, and queue depth
  • counters for failures, mismatches, rejections, and drops

First Alerts To Add

  • cluster size below expected baseline
  • sustained replay rejection or auth failure
  • rising lease renew rejection or fencing rejection
  • queue depth above normal operating range
  • sustained backpressure decisions on one collection locator