Troubleshooting

This page is the operator-facing runbook for common DSM integration failures. Start with the symptom you see, then use diagnostics and observability together.

Universal Triage Flow

text

DSM issue reported
    |
    +--> runtime.diagnostics().state()
    |       |
    |       +--> not ready -> fix lifecycle first
    |
    +--> runtime.diagnostics().clusterView()
    |       |
    |       +--> peers wrong/missing -> inspect clusterId, serviceId, membership
    |
    +--> runtime.diagnostics().collections()
    |       |
    |       +--> locator missing/wrong -> inspect registration or Spring config
    |
    +--> runtime.diagnostics().leaseCollections()
    |       |
    |       +--> churn/rejects -> inspect lease timings and fencing flow
    |
    +--> DsmMetrics trends
            |
            +--> security, replay, partition, backpressure, queue depth

Symptom: Lease Thrashing

Typical signs:

ownership changes too often
renewRejectCount grows
acquireRejectCount and uncertainAcquireCount grow
downstream sees fencing rejections

ASCII picture:

text

worker-a acquire -> renew fails -> worker-b acquire -> worker-a retries -> churn

What to check:

leaseCollections() for renew and acquire rejection counters
whether renew-skew is too small relative to scheduling jitter
whether term is too short for the runtime environment
whether downstream processing ignores fencing and keeps acting as a stale owner

Likely causes:

renewals scheduled too late
overloaded node misses renewal window
duplicate workers competing for the same keys
environment pauses causing expiry churn

Fix direction:

increase lease term if operationally appropriate
keep renewals comfortably inside the renew-skew window
verify only one worker should target the key at a time
inspect recordLeaseRenew, recordLeaseAcquire, and recordFencingReject metrics

Symptom: Wrong Or Missing Locator

Typical signs:

one node has entries but peers do not
expected collection does not appear in collections()
data appears isolated even though the runtime is running

ASCII picture:

text

node-a -> shared/gateway/route-hints
node-b -> shared/worker/route-hints

same entry key
different locator
no shared collection state

What to check:

collections() on every affected node
locator triplet: tenantId/applicationId/collectionId
schemaId
collection consistency tier

Likely causes:

typo in one locator segment
one service registered under a different application domain
Spring Boot property mismatch between environments
one side changed schemaId or consistency tier incompatibly

Fix direction:

make the locator identical on all intended peers
keep schemaId aligned for compatible payloads
centralize registration to avoid drift between modules

Symptom: Spring Bean Validation Failure At Startup

Typical signs:

application fails during Spring Boot startup
error mentions missing codec bean, missing entity factory, duplicate locator, or wrong consistency tier

ASCII picture:

text

Spring startup
    |
    +--> bind dsm.* properties
    +--> validate collection definitions
    +--> resolve supporting beans
    +--> build DsmRuntime

failure happens here ^ before runtime is usable

What to check:

required common fields are present: tenant-id, application-id, collection-id, schema-id, codec-bean
consistency-tier matches type
lease collections set lease.entity-factory-bean
CRDT collections set state-codec-bean, initial-state-bean, and merger-bean
explicit bean-name values are unique

Likely causes:

bean name typo
missing supporting bean definition
copied register config reused for lease or CRDT without required nested fields
duplicate locator definitions in one application

Fix direction:

compare the failing config to Spring Properties
compare the target workload to the matching cookbook page
keep one centralized configuration source for all collection definitions

Symptom: `clusterId` Or `serviceId` Mismatch

Typical signs:

node sees no peers or fewer peers than expected
recordServiceIdMismatch rises
peers appear present on the network but do not join the same DSM fabric

ASCII picture:

text

node-a: clusterId=prod-eu-west, serviceId=gateway-service
node-b: clusterId=prod-eu-west, serviceId=worker-service

same network
different service family
membership or replication traffic is rejected/ignored

What to check:

clusterView().clusterId() and clusterView().serviceId() on healthy and unhealthy nodes
deployment-time environment variables or property overrides
metrics for recordServiceIdMismatch and cluster admission denial

Likely causes:

wrong environment profile
copied service config from another service family
one node deployed with stale configuration

Fix direction:

keep clusterId stable per environment and cluster
keep serviceId stable per service family
validate both values during deployment rollout, not after startup

Symptom: Security Or Replay Rejections

Typical signs:

recordAuthFailure or recordReplayRejection grows
replication traffic is dropped or rejected

What to check:

cluster secret alignment
nonce window and clock drift settings
whether one deployment rolled with a different signing configuration

Fix direction:

align security settings across the cluster
verify time synchronization assumptions
inspect the security integration test pattern before changing runtime behavior

Symptom: Queue Growth Or Backpressure

Typical signs:

reportQueueDepth grows steadily
recordBackpressureDecision increases
updates appear delayed under load

What to check:

which locator is showing queue growth
whether the workload matches the selected collection type and QoS profile
whether traffic volume is much higher than expected for control-plane usage

Fix direction:

reduce unexpected write amplification
confirm you are not using DSM as a bulk data plane
inspect transport and downstream processing lag

Troubleshooting ​

Universal Triage Flow ​

Symptom: Lease Thrashing ​

Symptom: Wrong Or Missing Locator ​

Symptom: Spring Bean Validation Failure At Startup ​

Symptom: clusterId Or serviceId Mismatch ​

Symptom: Security Or Replay Rejections ​

Symptom: Queue Growth Or Backpressure ​

Read Next ​

Troubleshooting

Universal Triage Flow

Symptom: Lease Thrashing

Symptom: Wrong Or Missing Locator

Symptom: Spring Bean Validation Failure At Startup

Symptom: `clusterId` Or `serviceId` Mismatch

Symptom: Security Or Replay Rejections

Symptom: Queue Growth Or Backpressure

Read Next