-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ODistributedRecordLockedException Timeout (100ms) on acquiring lock on record #xx:-1 on server 'xxx'. It is locked by request null #8974
Comments
Hi guys, we are facing the same issue with 3.0.22. The warning itself is not the biggest issue as we implemented a retry logic to retry x times (as described in the documentation) until the lock is released and the transaction can be committed. The main issue is that after a certain amount of time (and after continuously inserting data into the database with JMeter and 10 threads in parallel) the database is completely locking for some time while the above WARN msg appear in log. Until the response from the database there is no way to read from or write into the database at all!!! Even if we try to update/insert other vertices. It is expected behavior that the update transaction of that specific vertex is locking others trying to write into that vertex but it seems that it is simply locking everything ! After a certain amount of time (may take up to one minute) the database is accepting requests again is however still logging the above message. By the way it makes no difference whether we are using "synchronous" or "asynchronous" or "undefined", the error still appears. In our preproduction database we also noticed some deadlock errors which we unfortunately were unable to reproduce locally. We wrote a minimal script to reproduce the issue. Prerequisite is a schema with Class "TEST" and one Vertex with attribute with value=5. We attached the minimal javascript, some thread dumps, the used configs, logs and the JMeter Test for deeper analysis. As little side information: We are running the recent orientdb 3.0.22 image in kubernetes (using a helm chart from https://github.com/helm/charts/tree/master/incubator/orientdb). @lvca @luigidellaquila Could you please assist with this issue as we have a production release approaching in a couple of weeks and this issue is a real blocker for us running orientdb in distributed mode. There are several similar issues such as #8691 #7856 #8663 #8742 but they don't seem to be fixed or adressed. |
@luigidellaquila We use orientdb in the public health system in a big EU country and we can't just restart db by every problem because of the regulatory agency. When it is happening the table is blocked to insert. It is very critical issue. |
@luigidellaquila I wrote something stupid? Why nobody answers? We still have this issue and can't use orientdb literally as production product |
Hi @freeart sorry for the delay, first of all i see that you are you using 3.1 from develop branch. Is there a reason for that? Are you using OrientDB from Java client? |
Hi @freeart Sorry for my late reply, I was away for two weeks and I still have some backlog... Thanks Luigi |
@wolf4ood Hi, thanks for your reply. If you look how many problems I found in the stable branch you can imagine why we use develop version. Issues what I found |
Hi @freeart 3.0.x as 3.1.x contains always the latest fix. 3.1.x . could be general more unstable since it's development of the new version of ODB. Which kind of issue did you get with the nodejs driver? Are you using some kind of load balancing for the 2 nodes? Thanks |
@wolf4ood I started use orientdb 3.x from beta and nodejs driver was super unstable and didn't know about new features 3.x and I was tired to read stackoverflow every problem I found in driver and I wrote my own. We didn't use load balancing. Should we use it? |
i just wanted to know if you are writing concurrently in each node Thanks |
@wolf4ood I understand what you mean. I will ask our devops |
@wolf4ood I confirm. We have orientdb in cluster on two real servers (there are in the same network) and we are writing to the first one and reading as well. |
rids The strange thing is the deadlock after the exception. The exception itself it can happen while waiting for getting the lock on a resource. Usually it is solved by using a retry mechanism. are you able to reproduce this on a test cluster? |
@wolf4ood No, we can't reproduce it. But on production it happens often. I attached logs and seems like nodes of cluster have lost connection between them for 2 days and has happened deadlock after and I don't know it was cause or not. |
Maybe @jonsalvas can help us. Seems like he wrote a test scenario. I don't understand exactly what is it in his attachment |
@jonsalvas use case is slightly different since it can be cause due load balancing which trigger concurrent writes on each node, but we exclude that for your use case. |
one info that can help us identifying the deadlock is that when it happens, execute a thread dump of both of the nodes and share it with us https://dzone.com/articles/how-to-take-thread-dumps-7-options |
@wolf4ood Ok I will. Can you explain me when I write to first node and has happened network problem between nodes. What happened after it has connected? The second one needs to get new records from the first one to sync? If we are still writing to the first one can it be cause of issue? |
Hi @freeart Not sure about the network problem, Did you experience some physical networking issue? It should not cause this issue, writes are blocked meanwhile the records are being transfert, at least with community edition. |
@wolf4ood I don't know what cause is it. Just see some times in logs |
This morning I received the message in our Production environment: What happened:
I attached the log. |
@jonsalvas Did you do the memory dump via |
@freeart No sorry forgot to do that. I was too nervous about getting our production database running again :-). Next time it happens I will definitely do it. |
@jonsalvas We stopped to use cluster mode on production and we are looking for an another db as "plan B" solution. I don't want to wait for the "next time", because we can lose our project. |
@freeart That's sad to hear. To be honest: We were also thinking about switching to ArangoDB as Plan B. The problem is that we now have a buy-in due to orient-specific code which we would need to migrate. We therefore hope we can fix our problems asap. |
Hi @jonsalvas We did a couple of fix on distributed, which could lead to the deadlock situation. They should be available in 3.0.23. Thanks |
Also was running into similar issues and upgraded to 3.0.23. I am no longer receiving the acquiring lock issue however i am now receiving this:
Java Version: OS: Orient DB Version: 3.0.23 After receiving this error I cannot insert into the database at all but I can create new classes/properties. It is very similar symptoms to the acquiring lock/deadlock issue, so i believe it is coming from the same source. |
I got the issue about deadlock in single node configuration. Seems like it's not a problem about communication between nodes @wolf4ood |
can you capture the thread dump when is in deadlock and send it to me at You can use on of the following methods https://dzone.com/articles/how-to-take-thread-dumps-7-options Thanks |
@wolf4ood I can't capture the dump because of docker PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
1 0 root S 5835m 71% 3 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -server -Xms4G -Xmx4G -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -Djava.awt.headless=true -Dfile.encoding=UTF8 -D
/orientdb # jstack -l 1 > ./dump.txt
1: Unable to get pid of LinuxThreads manager thread |
@wolf4ood Could you please elaborate on this? How is this intended in a HA environment? Do we have to send all our write operations to a single node only, instead of a load balancer? How to handle HA, if the node goes down? Should the client handle that? etc. etc. Would be nice to see a best practice architecture. The documentation is not really helpful in this case. We are currently testing 3.0.25 in production. It seems to run more stable, but we keep monitoring. As soon as the issue reappears I will attach the heap dump |
@tglman what is the current situation of this problem? We are also facing this problem frequently in our production environment. |
OrientDB Version: 3.1.0 from develop branch
Java Version: docker openjdk:8-jdk-alpine
OS: docker openjdk:8-jdk-alpine
We have cluster of 2 orientdb in production and we get an error
It happens 2 times in 1 month
Id records is always #82:-1 or #83:-1 and request is null
After that message the table is blocked for inserting until restart orientdb server
We did kill -5 before restart on both of dbs
The incident was about 1pm we think
In attachment you can find our config, docker-compose file, docker logs and kill -5 logs
issue.zip
The text was updated successfully, but these errors were encountered: