Observability
DSM uses DsmMetrics as its metrics SPI instead of coupling the runtime to one telemetry framework. The practical model is:
text
runtime event
|
+--> diagnostics snapshot for immediate inspection
|
+--> DsmMetrics callback for long-window visibility
|
+--> your metrics backend / alerting systemCore Signals
At minimum the SPI exposes:
recordSyncLatency(long latencyMs)reportClusterSize(int count)
Additional callbacks cover:
- authentication failures
- replay rejection
- cluster admission denial
- service ID mismatch
- LWW discards and dropped messages
- suspected partitions
- lease acquisition, renewal, transfer, release, and verification
- fencing rejection
- active lease holder counts
- backpressure decisions and queue depth
Signal Groups
Cluster Health
reportClusterSizerecordServiceIdMismatchrecordClusterAdmissionDeniedrecordPartitionSuspected
Interpretation:
text
cluster size drops
+ serviceId mismatch rises
-> likely isolation / discovery mismatch
cluster size drops
+ partition suspected rises
-> likely membership or network instabilitySecurity And Transport Integrity
recordAuthFailurerecordReplayRejectionrecordMessageDropped
Interpretation:
text
auth failures spike
-> cluster secret or signing mismatch
replay rejection spikes
-> nonce / replay protection issue or duplicate traffic patternLease Health
recordLeaseAcquirerecordLeaseRenewrecordLeaseTransferrecordLeaseReleaserecordLeaseVerifyrecordFencingRejectreportActiveLeaseHolders
Interpretation:
text
renew failures rise
+ fencing rejects rise
-> investigate lease thrash and stale holders
active holders unstable
+ acquire rejections high
-> investigate ownership churn or timing windowsBackpressure And Queueing
recordBackpressureDecisionreportQueueDepth
Interpretation:
text
queue depth grows steadily
+ backpressure decisions increase
-> runtime under sustained pressure or downstream transport lagHow To Correlate Metrics With Diagnostics
text
alert fires
|
+--> check DsmMetrics series for trend direction
|
+--> query runtime.diagnostics() on affected node
|
+--> confirm whether issue is:
cluster membership
wrong collection registration
lease churn
transport/security rejectionNo-Op Default
DsmMetrics.noop() is available when you need a runtime that operates without external metrics wiring.
Recommended Adapter Pattern
Map the SPI into your existing metrics system with these shapes:
- timers for sync latency
- gauges for cluster size, active holders, and queue depth
- counters for failures, mismatches, rejections, and drops
First Alerts To Add
- cluster size below expected baseline
- sustained replay rejection or auth failure
- rising lease renew rejection or fencing rejection
- queue depth above normal operating range
- sustained backpressure decisions on one collection locator