Skip to content

Troubleshooting

This page is the operator-facing runbook for common DSM integration failures. Start with the symptom you see, then use diagnostics and observability together.

Universal Triage Flow

text
DSM issue reported
    |
    +--> runtime.diagnostics().state()
    |       |
    |       +--> not ready -> fix lifecycle first
    |
    +--> runtime.diagnostics().clusterView()
    |       |
    |       +--> peers wrong/missing -> inspect clusterId, serviceId, membership
    |
    +--> runtime.diagnostics().collections()
    |       |
    |       +--> locator missing/wrong -> inspect registration or Spring config
    |
    +--> runtime.diagnostics().leaseCollections()
    |       |
    |       +--> churn/rejects -> inspect lease timings and fencing flow
    |
    +--> DsmMetrics trends
            |
            +--> security, replay, partition, backpressure, queue depth

Symptom: Lease Thrashing

Typical signs:

  • ownership changes too often
  • renewRejectCount grows
  • acquireRejectCount and uncertainAcquireCount grow
  • downstream sees fencing rejections

ASCII picture:

text
worker-a acquire -> renew fails -> worker-b acquire -> worker-a retries -> churn

What to check:

  1. leaseCollections() for renew and acquire rejection counters
  2. whether renew-skew is too small relative to scheduling jitter
  3. whether term is too short for the runtime environment
  4. whether downstream processing ignores fencing and keeps acting as a stale owner

Likely causes:

  • renewals scheduled too late
  • overloaded node misses renewal window
  • duplicate workers competing for the same keys
  • environment pauses causing expiry churn

Fix direction:

  • increase lease term if operationally appropriate
  • keep renewals comfortably inside the renew-skew window
  • verify only one worker should target the key at a time
  • inspect recordLeaseRenew, recordLeaseAcquire, and recordFencingReject metrics

Symptom: Wrong Or Missing Locator

Typical signs:

  • one node has entries but peers do not
  • expected collection does not appear in collections()
  • data appears isolated even though the runtime is running

ASCII picture:

text
node-a -> shared/gateway/route-hints
node-b -> shared/worker/route-hints

same entry key
different locator
no shared collection state

What to check:

  1. collections() on every affected node
  2. locator triplet: tenantId/applicationId/collectionId
  3. schemaId
  4. collection consistency tier

Likely causes:

  • typo in one locator segment
  • one service registered under a different application domain
  • Spring Boot property mismatch between environments
  • one side changed schemaId or consistency tier incompatibly

Fix direction:

  • make the locator identical on all intended peers
  • keep schemaId aligned for compatible payloads
  • centralize registration to avoid drift between modules

Symptom: Spring Bean Validation Failure At Startup

Typical signs:

  • application fails during Spring Boot startup
  • error mentions missing codec bean, missing entity factory, duplicate locator, or wrong consistency tier

ASCII picture:

text
Spring startup
    |
    +--> bind dsm.* properties
    +--> validate collection definitions
    +--> resolve supporting beans
    +--> build DsmRuntime

failure happens here ^ before runtime is usable

What to check:

  1. required common fields are present: tenant-id, application-id, collection-id, schema-id, codec-bean
  2. consistency-tier matches type
  3. lease collections set lease.entity-factory-bean
  4. CRDT collections set state-codec-bean, initial-state-bean, and merger-bean
  5. explicit bean-name values are unique

Likely causes:

  • bean name typo
  • missing supporting bean definition
  • copied register config reused for lease or CRDT without required nested fields
  • duplicate locator definitions in one application

Fix direction:

  • compare the failing config to Spring Properties
  • compare the target workload to the matching cookbook page
  • keep one centralized configuration source for all collection definitions

Symptom: clusterId Or serviceId Mismatch

Typical signs:

  • node sees no peers or fewer peers than expected
  • recordServiceIdMismatch rises
  • peers appear present on the network but do not join the same DSM fabric

ASCII picture:

text
node-a: clusterId=prod-eu-west, serviceId=gateway-service
node-b: clusterId=prod-eu-west, serviceId=worker-service

same network
different service family
membership or replication traffic is rejected/ignored

What to check:

  1. clusterView().clusterId() and clusterView().serviceId() on healthy and unhealthy nodes
  2. deployment-time environment variables or property overrides
  3. metrics for recordServiceIdMismatch and cluster admission denial

Likely causes:

  • wrong environment profile
  • copied service config from another service family
  • one node deployed with stale configuration

Fix direction:

  • keep clusterId stable per environment and cluster
  • keep serviceId stable per service family
  • validate both values during deployment rollout, not after startup

Symptom: Security Or Replay Rejections

Typical signs:

  • recordAuthFailure or recordReplayRejection grows
  • replication traffic is dropped or rejected

What to check:

  1. cluster secret alignment
  2. nonce window and clock drift settings
  3. whether one deployment rolled with a different signing configuration

Fix direction:

  • align security settings across the cluster
  • verify time synchronization assumptions
  • inspect the security integration test pattern before changing runtime behavior

Symptom: Queue Growth Or Backpressure

Typical signs:

  • reportQueueDepth grows steadily
  • recordBackpressureDecision increases
  • updates appear delayed under load

What to check:

  1. which locator is showing queue growth
  2. whether the workload matches the selected collection type and QoS profile
  3. whether traffic volume is much higher than expected for control-plane usage

Fix direction:

  • reduce unexpected write amplification
  • confirm you are not using DSM as a bulk data plane
  • inspect transport and downstream processing lag