Things service cannot reconnect to mongo #1861
Replies: 1 comment 2 replies
-
Oh, interesting, thanks for sharing @dimabarbul We run MongoDB Atlas and never faced such an issue during cluster upgrade in Mongo. I think the best way to fix this would be that All |
Beta Was this translation helpful? Give feedback.
-
We deploy Ditto (tried 3.4.0 and 3.4.4) using helm chart. Sometimes, for example, during cluster upgrade, we face following issue.
Things service loses connection to mongo (this is expected as during cluster upgrade mongo might be temporarily unavailable). But later when mongo is back online things service does not reconnect. In logs we can see errors
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'.
But /alive and /ready health check endpoints return OK. Request to http://ditto-gateway:8080/health returns 503 as expected. We waited for more than an hour and the issue did not go away.The only way that we found where we can see this error (apart from logs) is request to http://ditto-things:8080/status/health. The problem with this approach is that port 8080 is not declared in helm chart and thus is not accessible for k8s health probes.
Restarting the pod fixes the issue, but the problem is that this restart has to be manual as k8s does not see any issues.
Logs from things service that might be related
{"@timestamp":"2024-01-11T12:52:00.028+01:00","@Version":"1","message":"Stream failed","logger_name":"org.eclipse.ditto.internal.utils.persistentactors.cleanup.PersistenceCleanupActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-14","level":"ERROR","level_value":40000,"stack_trace":"java.util.concurrent.CompletionException: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)\n\tat java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)\n\tat java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\tat scala.concurrent.impl.FutureConvertersImpl$CF.apply(FutureConvertersImpl.scala:26)\n\tat scala.concurrent.impl.FutureConvertersImpl$CF.apply(FutureConvertersImpl.scala:23)\n\tat scala.concurrent.impl.Promise$Transformation.run(Promise.scala:484)\n\tat scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222)\n\tat scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:429)\n\tat scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:338)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:285)\n\tat scala.concurrent.impl.Promise$Transformation.run(Promise.scala:504)\n\tat scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222)\n\tat scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:429)\n\tat scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:338)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:285)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:278)\n\tat scala.concurrent.Promise.tryFailure(Promise.scala:117)\n\tat scala.concurrent.Promise.tryFailure$(Promise.scala:117)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:104)\n\tat org.apache.pekko.stream.impl.SeqStage$$anon$3.onUpstreamFailure(Sinks.scala:274)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:537)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:400)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:662)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$SimpleBoundaryEvent.execute(ActorGraphInterpreter.scala:71)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$SimpleBoundaryEvent.execute$(ActorGraphInterpreter.scala:67)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$BatchingActorInputBoundary$OnError.execute(ActorGraphInterpreter.scala:92)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:637)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter.org$apache$pekko$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:813)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:831)\n\tat org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)\n\tat org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:729)\n\tat org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)\n\tat org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)\n\tat org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)\n\tat org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)\n\tat kamon.instrumentation.executor.ExecutorInstrumentation$InstrumentedForkJoinPool$TimingRunnable.run(ExecutorInstrumentation.scala:698)\n\tat org.apache.pekko.dispatch.ForkJoinExecutorConfigurator$PekkoForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:57)\n\tat java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)\n\tat java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)\n\tat java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)\n\tat java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)\n\tat java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)\nCaused by: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:430)\n\tat com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:325)\n\tat com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)\n\tat com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:505)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n","sourceThread":"Thread-37","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/persistenceCleanup"} {"@timestamp":"2024-01-11T12:52:00.029+01:00","@Version":"1","message":"Stream failed or shutdown. Next stream in starting from <>","logger_name":"org.eclipse.ditto.internal.utils.persistentactors.cleanup.PersistenceCleanupActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-14","level":"INFO","level_value":20000,"sourceThread":"ditto-cluster-pekko.actor.default-dispatcher-11","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/persistenceCleanup"} {"@timestamp":"2024-01-11T12:52:07.879+01:00","@Version":"1","message":"Sending the health of this system as requested: StatusInfo [status=DOWN, details=[StatusDetailMessage [level=ERROR, message=\"See detailed messages for: persistence.\"]], children=[StatusInfo [status=DOWN, details=[StatusDetailMessage [level=ERROR, message=\"com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\"]], children=[], label=persistence]], label=null]","logger_name":"org.eclipse.ditto.internal.utils.health.status.StatusSupplierActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-6","level":"INFO","level_value":20000,"sourceThread":"Thread-38","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/statusSupplier"} {"@timestamp":"2024-01-11T12:52:12.009+01:00","@Version":"1","message":"Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]","logger_name":"org.eclipse.ditto.internal.utils.persistence.mongo.MongoHealthChecker","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-11","level":"ERROR","level_value":40000,"stack_trace":"com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:430)\n\tat com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:325)\n\tat com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)\n\tat com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:505)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n","sourceThread":"Thread-39","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/healthCheckingActor/persistence"}Beta Was this translation helpful? Give feedback.
All reactions