ODistributedRecordLockedException Timeout (100ms) on acquiring lock on record #xx:-1 on server 'xxx'. It is locked by request null #8974

freeart · 2019-08-02T09:31:26Z

OrientDB Version: 3.1.0 from develop branch

Java Version: docker openjdk:8-jdk-alpine

OS: docker openjdk:8-jdk-alpine

We have cluster of 2 orientdb in production and we get an error

com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (100ms) on acquiring lock on record #82:-1 on server 'odb1'. It is locked by request null

It happens 2 times in 1 month
Id records is always #82:-1 or #83:-1 and request is null
After that message the table is blocked for inserting until restart orientdb server
We did kill -5 before restart on both of dbs
The incident was about 1pm we think
In attachment you can find our config, docker-compose file, docker logs and kill -5 logs
issue.zip

The text was updated successfully, but these errors were encountered:

jonsalvas · 2019-08-06T16:44:57Z

Hi guys,

we are facing the same issue with 3.0.22. The warning itself is not the biggest issue as we implemented a retry logic to retry x times (as described in the documentation) until the lock is released and the transaction can be committed. The main issue is that after a certain amount of time (and after continuously inserting data into the database with JMeter and 10 threads in parallel) the database is completely locking for some time while the above WARN msg appear in log. Until the response from the database there is no way to read from or write into the database at all!!! Even if we try to update/insert other vertices. It is expected behavior that the update transaction of that specific vertex is locking others trying to write into that vertex but it seems that it is simply locking everything ! After a certain amount of time (may take up to one minute) the database is accepting requests again is however still logging the above message. By the way it makes no difference whether we are using "synchronous" or "asynchronous" or "undefined", the error still appears. In our preproduction database we also noticed some deadlock errors which we unfortunately were unable to reproduce locally.

We wrote a minimal script to reproduce the issue. Prerequisite is a schema with Class "TEST" and one Vertex with attribute with value=5. We attached the minimal javascript, some thread dumps, the used configs, logs and the JMeter Test for deeper analysis. As little side information: We are running the recent orientdb 3.0.22 image in kubernetes (using a helm chart from https://github.com/helm/charts/tree/master/incubator/orientdb).

debugging-orientdb.zip

@lvca @luigidellaquila Could you please assist with this issue as we have a production release approaching in a couple of weeks and this issue is a real blocker for us running orientdb in distributed mode.

There are several similar issues such as #8691 #7856 #8663 #8742 but they don't seem to be fixed or adressed.

freeart · 2019-08-08T08:56:00Z

@luigidellaquila We use orientdb in the public health system in a big EU country and we can't just restart db by every problem because of the regulatory agency. When it is happening the table is blocked to insert. It is very critical issue.

freeart · 2019-08-20T09:15:20Z

@luigidellaquila I wrote something stupid? Why nobody answers? We still have this issue and can't use orientdb literally as production product

wolf4ood · 2019-08-20T09:25:22Z

Hi @freeart

sorry for the delay, first of all i see that you are you using 3.1 from develop branch.

Is there a reason for that?

Are you using OrientDB from Java client?

luigidellaquila · 2019-08-20T09:26:13Z

Hi @freeart

Sorry for my late reply, I was away for two weeks and I still have some backlog...
Let me check it, I'll give you a feedback ASAP

Thanks

Luigi

freeart · 2019-08-20T09:35:13Z

@wolf4ood Hi, thanks for your reply. If you look how many problems I found in the stable branch you can imagine why we use develop version. Issues what I found
I'm using the rest api because of many problems in nodejs driver for 3.x version. I just wrote my tiny rest wrapper on orientdb rest api.
@luigidellaquila Thanks. Bosses want to "kill" me for my choice of db=) Maybe it's our fault in config, who knows. We have no strong devops.

wolf4ood · 2019-08-20T09:47:20Z

Hi @freeart

3.0.x as 3.1.x contains always the latest fix. 3.1.x . could be general more unstable since it's development of the new version of ODB.

Which kind of issue did you get with the nodejs driver?

Are you using some kind of load balancing for the 2 nodes?

Thanks

freeart · 2019-08-20T10:14:50Z

@wolf4ood I started use orientdb 3.x from beta and nodejs driver was super unstable and didn't know about new features 3.x and I was tired to read stackoverflow every problem I found in driver and I wrote my own.
Things like those are fundamental to me and I couldn't wait the stable because of lag between the develop branch and the stable snapshots:
#8662
#8658
#8647
#8646
First month I tried to use stable branch but issues was so critical like I needed to implement something in 2 days but db didn't do properly like #8543 I needed to do two queries insert and select instead of single insert return.

We didn't use load balancing. Should we use it?

wolf4ood · 2019-08-20T10:17:34Z

@freeart

i just wanted to know if you are writing concurrently in each node

Thanks

freeart · 2019-08-20T10:19:11Z

@wolf4ood I understand what you mean. I will ask our devops

freeart · 2019-08-20T11:52:30Z

@wolf4ood I confirm. We have orientdb in cluster on two real servers (there are in the same network) and we are writing to the first one and reading as well.

wolf4ood · 2019-08-20T12:21:49Z

@freeart

rids #82:-1 or #83:-1 means that there is some problem with insert operation. since those are temporary record id.

The strange thing is the deadlock after the exception. The exception itself it can happen while waiting for getting the lock on a resource. Usually it is solved by using a retry mechanism.

are you able to reproduce this on a test cluster?

freeart · 2019-08-20T12:41:51Z

@wolf4ood No, we can't reproduce it. But on production it happens often. I attached logs and seems like nodes of cluster have lost connection between them for 2 days and has happened deadlock after and I don't know it was cause or not.

freeart · 2019-08-20T12:48:11Z

Maybe @jonsalvas can help us. Seems like he wrote a test scenario. I don't understand exactly what is it in his attachment

wolf4ood · 2019-08-20T12:55:21Z

@jonsalvas use case is slightly different since it can be cause due load balancing which trigger concurrent writes on each node, but we exclude that for your use case.

wolf4ood · 2019-08-20T13:00:08Z

@freeart

one info that can help us identifying the deadlock is that when it happens, execute a thread dump of both of the nodes and share it with us

https://dzone.com/articles/how-to-take-thread-dumps-7-options

freeart · 2019-08-20T13:02:54Z

@wolf4ood Ok I will. Can you explain me when I write to first node and has happened network problem between nodes. What happened after it has connected? The second one needs to get new records from the first one to sync? If we are still writing to the first one can it be cause of issue?

wolf4ood · 2019-08-20T13:11:20Z

Hi @freeart

Not sure about the network problem, Did you experience some physical networking issue?

It should not cause this issue, writes are blocked meanwhile the records are being transfert, at least with community edition.

freeart · 2019-08-20T13:25:40Z

@wolf4ood I don't know what cause is it. Just see some times in logs
WARNI [odb2]->[odb1] Server 'odb1' did not respond to the gossip message (db=nvd, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
Some times: WARNI [odb2] Timeout (45001ms) on waiting for synchronous responses from nodes=[odb1] responsesSoFar=[] request=(id=0.107015 task=upd_db_status) [ODistributedDatabaseImpl]
So, next time when deadlock happens I will do jstack -l pid > ./threadDump.txt on both docker machines

jonsalvas · 2019-08-23T08:05:30Z

This morning I received the message in our Production environment:

What happened:

one of the nodes (in distributed cluster with 3 nodes) fell down due to OOM
everything worked fine, data was still received by the two other nodes
I restarted the node
Now one of the nodes started sending chunks to the restarted node. Meanwhile the restarted node already received data and threw a lot of 'Node is not online yet' in the log (which is fine I guess).
When finished the restarted node reported "ONLINE" to all others but received " Updated node status to 'ONLINE' [OHazelcastPlugin]Cannot route TX operation against distributed node
java.lang.NullPointerException"
Finally I got a lot of messages like: com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (100ms) on acquiring lock on record Add trigger feature [moved] #3:-1 on server 'xxxx-odb-orientdb-1'. It is locked by request null
Regarding "retry": we have a retry logic implemented, which does not help (it is retrying 100 times) and then failing and requeing in our rabbitmq.
The database is not accepting any data anymore it seems that the database is in a deadlock situation from which it is not able to recover from. I have to restart everything which is really really bad!

I attached the log.

logs-from-orientdb-in-prd-xxx-odb-orientdb-1.txt

freeart · 2019-08-23T08:10:03Z

@jonsalvas Did you do the memory dump via jstack -l pid > ./threadDump.txt?

jonsalvas · 2019-08-23T08:19:25Z

@freeart No sorry forgot to do that. I was too nervous about getting our production database running again :-). Next time it happens I will definitely do it.

freeart · 2019-08-23T08:25:40Z

@jonsalvas We stopped to use cluster mode on production and we are looking for an another db as "plan B" solution. I don't want to wait for the "next time", because we can lose our project.

jonsalvas · 2019-08-23T08:50:17Z

@freeart That's sad to hear. To be honest: We were also thinking about switching to ArangoDB as Plan B. The problem is that we now have a buy-in due to orient-specific code which we would need to migrate. We therefore hope we can fix our problems asap.

wolf4ood · 2019-08-23T15:59:33Z

Hi @jonsalvas

We did a couple of fix on distributed, which could lead to the deadlock situation.

They should be available in 3.0.23.
Let me know if you are able to upgrade it and feedback if this issue is solved

Thanks

ygyg70 · 2019-08-31T10:21:18Z

I updated to 3.0.23, still experiencing one of the issues mentioned here:
#8742
See also
#8994
I am currently blocked

ahill134 · 2019-09-04T14:36:07Z

Also was running into similar issues and upgraded to 3.0.23. I am no longer receiving the acquiring lock issue however i am now receiving this:

INFO: Caught Network I/O errors on SERVERNAME:2424/INDEXNAME, trying an automatic reconnection... (error: com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException; local class incompatible: stream classdesc serialVersionUID = 7723318576895821763, local class serialVersionUID = -6271099426689941140) com.orientechnologies.common.io.OIOException: com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException; local class incompatible: stream classdesc serialVersionUID = 7723318576895821763, local class serialVersionUID = -6271099426689941140 at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:450) at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationRetryTimeout(OStorageRemote.java:331) at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationNoRetry(OStorageRemote.java:361) at com.orientechnologies.orient.client.remote.OStorageRemote.execute(OStorageRemote.java:1080) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentRemote.execute(ODatabaseDocumentRemote.java:434) at com.optum.tech360.orient.OrientDB3Util.executeQuery(OrientDB3Util.java:202) at com.optum.crispr.util.EsUtil.buildRecordsList(EsUtil.java:92) at com.optum.crispr.util.EsUtil.getIndexRecords(EsUtil.java:117) at com.optum.crispr.ThreadedOrientLoader.runOrientLoader(ThreadedOrientLoader.java:68) at com.optum.crispr.ThreadedOrientLoader.run(ThreadedOrientLoader.java:37) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.InvalidClassException: com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException; local class incompatible: stream classdesc serialVersionUID = 7723318576895821763, local class serialVersionUID = -6271099426689941140 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.deserializeException(OChannelBinaryAsynchClient.java:322) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:284) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:307) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:196) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:158) at com.orientechnologies.orient.client.remote.OStorageRemote.beginResponse(OStorageRemote.java:1953) at com.orientechnologies.orient.client.remote.OStorageRemote.lambda$networkOperationRetryTimeout$2(OStorageRemote.java:348) at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:407) ... 14 more

Java Version:
openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-b09)
OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode)

OS:
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"

Orient DB Version: 3.0.23

After receiving this error I cannot insert into the database at all but I can create new classes/properties. It is very similar symptoms to the acquiring lock/deadlock issue, so i believe it is coming from the same source.

freeart · 2019-10-02T08:55:25Z

I got the issue about deadlock in single node configuration. Seems like it's not a problem about communication between nodes @wolf4ood

wolf4ood · 2019-10-02T09:04:35Z

@freeart

can you capture the thread dump when is in deadlock and send it to me at e.risa@sap.com ?

You can use on of the following methods

https://dzone.com/articles/how-to-take-thread-dumps-7-options

Thanks

freeart · 2019-10-08T08:19:11Z

@wolf4ood I can't capture the dump because of docker

PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
  1     0 root     S    5835m  71%   3   0% /usr/lib/jvm/java-1.8-openjdk/bin/java -server -Xms4G -Xmx4G -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -Djava.awt.headless=true -Dfile.encoding=UTF8 -D

/orientdb # jstack -l 1 > ./dump.txt
1: Unable to get pid of LinuxThreads manager thread

jonsalvas · 2019-11-21T11:36:21Z

@jonsalvas use case is slightly different since it can be cause due load balancing which trigger concurrent writes on each node, but we exclude that for your use case.

@wolf4ood Could you please elaborate on this? How is this intended in a HA environment? Do we have to send all our write operations to a single node only, instead of a load balancer? How to handle HA, if the node goes down? Should the client handle that? etc. etc. Would be nice to see a best practice architecture. The documentation is not really helpful in this case.

We are currently testing 3.0.25 in production. It seems to run more stable, but we keep monitoring. As soon as the issue reappears I will attach the heap dump

ozgursucu · 2022-02-26T13:36:59Z

@tglman what is the current situation of this problem? We are also facing this problem frequently in our production environment.

tglman self-assigned this Jun 1, 2020

tglman added the bug label Jun 1, 2020

tglman added this to the 3.1.x milestone Jun 1, 2020

andrii0lomakin closed this as completed Aug 6, 2021

tglman modified the milestones: 3.1.x, 3.1.16 Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ODistributedRecordLockedException Timeout (100ms) on acquiring lock on record #xx:-1 on server 'xxx'. It is locked by request null #8974

ODistributedRecordLockedException Timeout (100ms) on acquiring lock on record #xx:-1 on server 'xxx'. It is locked by request null #8974

freeart commented Aug 2, 2019

jonsalvas commented Aug 6, 2019 •

edited

Loading

freeart commented Aug 8, 2019

freeart commented Aug 20, 2019

wolf4ood commented Aug 20, 2019

luigidellaquila commented Aug 20, 2019

freeart commented Aug 20, 2019

wolf4ood commented Aug 20, 2019

freeart commented Aug 20, 2019 •

edited

Loading

wolf4ood commented Aug 20, 2019

freeart commented Aug 20, 2019

freeart commented Aug 20, 2019

wolf4ood commented Aug 20, 2019

freeart commented Aug 20, 2019

freeart commented Aug 20, 2019

wolf4ood commented Aug 20, 2019

wolf4ood commented Aug 20, 2019

freeart commented Aug 20, 2019 •

edited

Loading

wolf4ood commented Aug 20, 2019

freeart commented Aug 20, 2019

jonsalvas commented Aug 23, 2019

freeart commented Aug 23, 2019

jonsalvas commented Aug 23, 2019

freeart commented Aug 23, 2019

jonsalvas commented Aug 23, 2019

wolf4ood commented Aug 23, 2019

ygyg70 commented Aug 31, 2019

ahill134 commented Sep 4, 2019 •

edited

Loading

freeart commented Oct 2, 2019

wolf4ood commented Oct 2, 2019

freeart commented Oct 8, 2019

jonsalvas commented Nov 21, 2019 •

edited

Loading

ozgursucu commented Feb 26, 2022 •

edited

Loading

ODistributedRecordLockedException Timeout (100ms) on acquiring lock on record #xx:-1 on server 'xxx'. It is locked by request null #8974

ODistributedRecordLockedException Timeout (100ms) on acquiring lock on record #xx:-1 on server 'xxx'. It is locked by request null #8974

Comments

freeart commented Aug 2, 2019

OrientDB Version: 3.1.0 from develop branch

Java Version: docker openjdk:8-jdk-alpine

OS: docker openjdk:8-jdk-alpine

jonsalvas commented Aug 6, 2019 • edited Loading

freeart commented Aug 8, 2019

freeart commented Aug 20, 2019

wolf4ood commented Aug 20, 2019

luigidellaquila commented Aug 20, 2019

freeart commented Aug 20, 2019

wolf4ood commented Aug 20, 2019

freeart commented Aug 20, 2019 • edited Loading

wolf4ood commented Aug 20, 2019

freeart commented Aug 20, 2019

freeart commented Aug 20, 2019

wolf4ood commented Aug 20, 2019

freeart commented Aug 20, 2019

freeart commented Aug 20, 2019

wolf4ood commented Aug 20, 2019

wolf4ood commented Aug 20, 2019

freeart commented Aug 20, 2019 • edited Loading

wolf4ood commented Aug 20, 2019

freeart commented Aug 20, 2019

jonsalvas commented Aug 23, 2019

freeart commented Aug 23, 2019

jonsalvas commented Aug 23, 2019

freeart commented Aug 23, 2019

jonsalvas commented Aug 23, 2019

wolf4ood commented Aug 23, 2019

ygyg70 commented Aug 31, 2019

ahill134 commented Sep 4, 2019 • edited Loading

freeart commented Oct 2, 2019

wolf4ood commented Oct 2, 2019

freeart commented Oct 8, 2019

jonsalvas commented Nov 21, 2019 • edited Loading

ozgursucu commented Feb 26, 2022 • edited Loading

jonsalvas commented Aug 6, 2019 •

edited

Loading

freeart commented Aug 20, 2019 •

edited

Loading

freeart commented Aug 20, 2019 •

edited

Loading

ahill134 commented Sep 4, 2019 •

edited

Loading

jonsalvas commented Nov 21, 2019 •

edited

Loading

ozgursucu commented Feb 26, 2022 •

edited

Loading