Distributed Orientdb loses data when a node is restarted while under load #8162

KluSe · 2018-03-19T09:47:13Z

OrientDB Version: 2.3.30

Java Version: 1.8.0_45

OS: Ubuntu 14.0.4.3 LTS

Distributed Config: 3 nodes, no sharding

Expected behavior

When a node is rebooted and misses writes because of that, these writes should be replicated when the node enters the cluster again.

Actual behavior

Rebooting a node while there are concurrent write accesses on all nodes sometimes causes the node to lose a record and not replicate it afterwards.

Steps to reproduce

We run a test cluster with three nodes behind a load balancer, each with a java-based application server (Apache Karaf). We start a load test that accesses a java-based REST-API which then uses the Orientdb-client to make a write-access at the local node. Our load test runs 10 threads in parallel load balanced across all nodes. We use one technical user, and the writes consist of two new vertexes and an edge connecting the two wrapped in a transaction, without unique indexes.

When we reboot the second node (clean shutdown, no kill -9 or poweroff), the other two nodes continue working. But after the second node rejoins the cluster, sometimes certain records are missing:

Query on node 1 ("ErzeugtAm" is the createDate):

orientdb {db=b-OS}> SELECT @Rid, ErzeugtAm FROM TAAAngebot WHERE ErzeugtAm < '2018-03-19 08:39:35' AND ErzeugtAm > '2018-03-19 08:39:20';

+----+-------+-------------------+
|# |RID |ErzeugtAm |
+----+-------+-------------------+
|0 |#57:747|2018-03-19 08:39:28|
|1 |#58:384|2018-03-19 08:39:23|
|2 |#58:385|2018-03-19 08:39:31|
|3 |#59:436|2018-03-19 08:39:21|
|4 |#59:437|2018-03-19 08:39:29|
|5 |#59:438|2018-03-19 08:39:34|
+----+-------+-------------------+

6 item(s) found. Query executed in 0.025 sec(s).

Query on node 2:

orientdb {db=b-OS}> SELECT @Rid, ErzeugtAm FROM TAAAngebot WHERE ErzeugtAm < '2018-03-19 08:39:35' AND ErzeugtAm > '2018-03-19 08:39:20';

+----+-------+-------------------+
|# |RID |ErzeugtAm |
+----+-------+-------------------+ <--- record #57:747 is missing!
|0 |#58:384|2018-03-19 08:39:23|
|1 |#58:385|2018-03-19 08:39:31|
|2 |#59:436|2018-03-19 08:39:21|
|3 |#59:437|2018-03-19 08:39:29|
|4 |#59:438|2018-03-19 08:39:34|
+----+-------+-------------------+

5 item(s) found. Query executed in 0.066 sec(s).

Query on node 3:

orientdb {db=b-OS}> SELECT @Rid, ErzeugtAm FROM TAAAngebot WHERE ErzeugtAm < '2018-03-19 08:39:35' AND ErzeugtAm > '2018-03-19 08:39:20';

+----+-------+-------------------+
|# |RID |ErzeugtAm |
+----+-------+-------------------+
|0 |#57:747|2018-03-19 08:39:28|
|1 |#58:384|2018-03-19 08:39:23|
|2 |#58:385|2018-03-19 08:39:31|
|3 |#59:436|2018-03-19 08:39:21|
|4 |#59:437|2018-03-19 08:39:29|
|5 |#59:438|2018-03-19 08:39:34|
+----+-------+-------------------+

6 item(s) found. Query executed in 0.02 sec(s).

orientdb {db=b-OS}> list clusters

CLUSTERS (collections)
+----+--------------------------------------+----+------------------------------------+-----+------------+---------------+--------------------+
|# |NAME | ID|CLASS |COUNT|OWNER_SERVER| OTHER_SERVERS |AUTO_DEPLOY_NEW_NODE|
+----+--------------------------------------+----+------------------------------------+-----+------------+---------------+--------------------+
...
|87 |taaangebot | 57|TAAAngebot | 179| b-OS-3 |[b-OS-2,b-OS-1]| true |
|88 |taaangebot_1 | 58|TAAAngebot | 95| b-OS-1 |[b-OS-2,b-OS-3]| true |
|89 |taaangebot_2 | 59|TAAAngebot | 94| b-OS-1 |[b-OS-2,b-OS-3]| true |
|90 |taaangebot_3 | 60|TAAAngebot | 18| b-OS-2 |[b-OS-1,b-OS-3]| true |
...

There are no visible errors in the logfiles:
server-1.log
server-2.log
server-3.log

Regards,
KluSe

tglman · 2018-04-16T15:48:16Z

Hi @KluSe,

Similar issue have been already fixed in recent hotfixes, please update to the latest(2.2.34) hotfix.

Let us know if you can still reproduce the problem with the last hotfix.

Regards

denislemercier · 2018-08-01T16:18:36Z

Hello,

I had the same issue with Orientdb 3.0.4, few documents are missing after a node restart.

I found this :
When a node is down, if I remove the entire database instance folder and restart the node, it works fine (= transfers the entire database).

Maybe the function OSyncDatabaseDeltaTask causes a problem.

tglman self-assigned this Apr 16, 2018

andrii0lomakin added the distributed label May 6, 2018

andrii0lomakin removed the distributed label Sep 30, 2019

andrii0lomakin closed this as completed Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Orientdb loses data when a node is restarted while under load #8162

Distributed Orientdb loses data when a node is restarted while under load #8162

KluSe commented Mar 19, 2018

tglman commented Apr 16, 2018

denislemercier commented Aug 1, 2018

Distributed Orientdb loses data when a node is restarted while under load #8162

Distributed Orientdb loses data when a node is restarted while under load #8162

Comments

KluSe commented Mar 19, 2018

OrientDB Version: 2.3.30

Java Version: 1.8.0_45

OS: Ubuntu 14.0.4.3 LTS

Distributed Config: 3 nodes, no sharding

Expected behavior

Actual behavior

Steps to reproduce

Query on node 1 ("ErzeugtAm" is the createDate):

Query on node 2:

Query on node 3:

tglman commented Apr 16, 2018

denislemercier commented Aug 1, 2018