Troubleshooting
This page is the operator-facing runbook for common DSM integration failures. Start with the symptom you see, then use diagnostics and observability together.
Universal Triage Flow
text
DSM issue reported
|
+--> runtime.diagnostics().state()
| |
| +--> not ready -> fix lifecycle first
|
+--> runtime.diagnostics().clusterView()
| |
| +--> peers wrong/missing -> inspect clusterId, serviceId, membership
|
+--> runtime.diagnostics().collections()
| |
| +--> locator missing/wrong -> inspect registration or Spring config
|
+--> runtime.diagnostics().leaseCollections()
| |
| +--> churn/rejects -> inspect lease timings and fencing flow
|
+--> DsmMetrics trends
|
+--> security, replay, partition, backpressure, queue depthSymptom: Lease Thrashing
Typical signs:
- ownership changes too often
renewRejectCountgrowsacquireRejectCountanduncertainAcquireCountgrow- downstream sees fencing rejections
ASCII picture:
text
worker-a acquire -> renew fails -> worker-b acquire -> worker-a retries -> churnWhat to check:
leaseCollections()for renew and acquire rejection counters- whether
renew-skewis too small relative to scheduling jitter - whether
termis too short for the runtime environment - whether downstream processing ignores fencing and keeps acting as a stale owner
Likely causes:
- renewals scheduled too late
- overloaded node misses renewal window
- duplicate workers competing for the same keys
- environment pauses causing expiry churn
Fix direction:
- increase lease term if operationally appropriate
- keep renewals comfortably inside the renew-skew window
- verify only one worker should target the key at a time
- inspect
recordLeaseRenew,recordLeaseAcquire, andrecordFencingRejectmetrics
Symptom: Wrong Or Missing Locator
Typical signs:
- one node has entries but peers do not
- expected collection does not appear in
collections() - data appears isolated even though the runtime is running
ASCII picture:
text
node-a -> shared/gateway/route-hints
node-b -> shared/worker/route-hints
same entry key
different locator
no shared collection stateWhat to check:
collections()on every affected node- locator triplet:
tenantId/applicationId/collectionId schemaId- collection consistency tier
Likely causes:
- typo in one locator segment
- one service registered under a different application domain
- Spring Boot property mismatch between environments
- one side changed
schemaIdor consistency tier incompatibly
Fix direction:
- make the locator identical on all intended peers
- keep
schemaIdaligned for compatible payloads - centralize registration to avoid drift between modules
Symptom: Spring Bean Validation Failure At Startup
Typical signs:
- application fails during Spring Boot startup
- error mentions missing codec bean, missing entity factory, duplicate locator, or wrong consistency tier
ASCII picture:
text
Spring startup
|
+--> bind dsm.* properties
+--> validate collection definitions
+--> resolve supporting beans
+--> build DsmRuntime
failure happens here ^ before runtime is usableWhat to check:
- required common fields are present:
tenant-id,application-id,collection-id,schema-id,codec-bean consistency-tiermatchestype- lease collections set
lease.entity-factory-bean - CRDT collections set
state-codec-bean,initial-state-bean, andmerger-bean - explicit
bean-namevalues are unique
Likely causes:
- bean name typo
- missing supporting bean definition
- copied register config reused for lease or CRDT without required nested fields
- duplicate locator definitions in one application
Fix direction:
- compare the failing config to Spring Properties
- compare the target workload to the matching cookbook page
- keep one centralized configuration source for all collection definitions
Symptom: clusterId Or serviceId Mismatch
Typical signs:
- node sees no peers or fewer peers than expected
recordServiceIdMismatchrises- peers appear present on the network but do not join the same DSM fabric
ASCII picture:
text
node-a: clusterId=prod-eu-west, serviceId=gateway-service
node-b: clusterId=prod-eu-west, serviceId=worker-service
same network
different service family
membership or replication traffic is rejected/ignoredWhat to check:
clusterView().clusterId()andclusterView().serviceId()on healthy and unhealthy nodes- deployment-time environment variables or property overrides
- metrics for
recordServiceIdMismatchand cluster admission denial
Likely causes:
- wrong environment profile
- copied service config from another service family
- one node deployed with stale configuration
Fix direction:
- keep
clusterIdstable per environment and cluster - keep
serviceIdstable per service family - validate both values during deployment rollout, not after startup
Symptom: Security Or Replay Rejections
Typical signs:
recordAuthFailureorrecordReplayRejectiongrows- replication traffic is dropped or rejected
What to check:
- cluster secret alignment
- nonce window and clock drift settings
- whether one deployment rolled with a different signing configuration
Fix direction:
- align security settings across the cluster
- verify time synchronization assumptions
- inspect the security integration test pattern before changing runtime behavior
Symptom: Queue Growth Or Backpressure
Typical signs:
reportQueueDepthgrows steadilyrecordBackpressureDecisionincreases- updates appear delayed under load
What to check:
- which locator is showing queue growth
- whether the workload matches the selected collection type and QoS profile
- whether traffic volume is much higher than expected for control-plane usage
Fix direction:
- reduce unexpected write amplification
- confirm you are not using DSM as a bulk data plane
- inspect transport and downstream processing lag