Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue in setting up distributed sharded cluster #7622

Closed
kishorel opened this issue Aug 7, 2017 · 17 comments
Closed

issue in setting up distributed sharded cluster #7622

kishorel opened this issue Aug 7, 2017 · 17 comments
Assignees

Comments

@kishorel
Copy link

kishorel commented Aug 7, 2017

OrientDB Version: <2.2.25>

Java Version: <1.8.0_60>

OS: <CentOS release 6.9>

Expected behavior

I'm trying to setup on 3 servers and trying to setup a class "users", with 3 clusters "users_0",
"users_1", and "users_2". My servers are called node1, node2, and
node3. I want the clusters arranged such that I have 2 copies of each
cluster, so if 1 node goes down I still have access to all the data.

I've tried setting this up with the following steps:

Download OrientDB 2.2.25 Community and extract onto the 3 servers.

ran dserver.sh and configured the root password and set the node name for all the three machines and stopped the instance(CTRL+C).

Edit default-distributed-db-config.json on node1 as follows :

{
"autoDeploy": true,
"hotAlignment": false,
"executionMode": "undefined",
"readQuorum": 1,
"writeQuorum": 2,
"failureAvailableNodesLessQuorum": false,
"readYourWrites": true,
"clusters": {
"internal": {
},
"index": {
},
"users_0": {
"servers" : [ "node1","node2" ]
},
"users_1": {
"servers" : [ "node2","node3" ]
},
"users_2": {
"servers" : [ "node3","node1" ]
},
"*": {
"servers" : [ "<NEW_NODE>" ]
}
}
}

and hazelcast.xml as:

<tcp-ip enabled="true">
	<member>IP1</member>
	<member>IP2</member>
	<member>IP3</member>
</tcp-ip>

Start node1 with dserver.sh.

Create a database using console on node1:
connect remote:localhost root password
create database remote:localhost/global root password plocal graph
Create a class and rename the default cluster:
create class users extends v (this has created no.of clusters equal to the no.of cores of the server, so removed all of the except users)
alter cluster users name users_0
Startup node2 with dserver.sh, wait for database to auto deploy, then
startup node3 and wait for deploy

After starting the 3rd node, i have seen error as below.

2017-08-07 16:37:17:193 WARNI [node3] - writing chunk #1 offset=0 size=441b [orientechnologies][node3] error on transferring database 'global' to
 '/tmp/orientdb/backup_global_users_1_server3_toInstall.zip'
com.orientechnologies.orient.core.exception.OConfigurationException: Database '/home/tomcat/orientdb/databases/global' is not configured on serve
r (home=/home/tomcat/orientdb/databases/)
        at com.orientechnologies.orient.server.OServer.getStoragePath(OServer.java:575)
        at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:935)
        at com.orientechnologies.orient.server.distributed.sql.OCommandExecutorSQLHASyncCluster.replaceCluster(OCommandExecutorSQLHASyncCluster.j
ava:209)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.installDatabaseFromNetwork(ODistributedAbstractPlugin.
java:1517)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.requestDatabaseFullSync(ODistributedAbstractPlugin.jav
a:1232)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.requestFullDatabase(ODistributedAbstractPlugin.java:10
02)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin$3.call(ODistributedAbstractPlugin.java:950)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin$3.call(ODistributedAbstractPlugin.java:878)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.executeInDistributedDatabaseLock(ODistributedAbstractP
lugin.java:1686)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.installDatabase(ODistributedAbstractPlugin.java:877)
        at com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker.checkServerStatus(OClusterHealthChecker.java:190)
        at com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker.run(OClusterHealthChecker.java:53)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
2017-08-07 16:37:17:196 INFO  [node3] Setting new distributed configuration for database: global (version=52)

CLUSTER CONFIGURATION [wQuorum: true] (LEGEND: X = Owner, o = Copy)
+-----------+-----------+----------+-------+------+------+-----------------+
|           |           |          |MASTER |MASTER|MASTER|     MASTER      |
|           |           |          |ONLINE |ONLINE|ONLINE|  NOT_AVAILABLE  |
|           |           |          |dynamic|static|static|     static      |
+-----------+-----------+----------+-------+------+------+-----------------+
|CLUSTER    |writeQuorum|readQuorum| node1 |node2 |node3 |node1502103884969|
+-----------+-----------+----------+-------+------+------+-----------------+
|*          |     3     |    1     |   X   |  o   |  o   |        o        |
|e_0        |     3     |    1     |   o   |  o   |  X   |        o        |
|e_2        |     3     |    1     |   o   |  X   |  o   |        o        |
|e_4        |     3     |    1     |   o   |  X   |  o   |        o        |
|e_5        |     3     |    1     |   o   |  X   |  o   |        o        |
|e_6        |     3     |    1     |   o   |  o   |  o   |        X        |
|e_7        |     3     |    1     |   o   |  o   |  X   |        o        |
|internal   |     3     |    1     |       |      |      |                 |
|ofunction_0|     3     |    1     |   o   |  X   |  o   |        o        |
|ofunction_1|     3     |    1     |   o   |  o   |  X   |        o        |
|orole_0    |     3     |    1     |   o   |  X   |  o   |        o        |
|orole_1    |     3     |    1     |   o   |  o   |  X   |        o        |
|oschedule_0|     3     |    1     |   o   |  X   |  o   |        o        |
|oschedule_1|     3     |    1     |   o   |  o   |  X   |        o        |
|osequence_0|     3     |    1     |   o   |  X   |  o   |        o        |
|osequence_1|     3     |    1     |   o   |  o   |  X   |        o        |
|ouser_0    |     3     |    1     |   o   |  X   |  o   |        o        |
|ouser_1    |     3     |    1     |   o   |  o   |  X   |        o        |
|user_0     |     3     |    1     |   X   |  o   |      |                 |
|user_1     |     3     |    1     |       |  X   |  o   |                 |
|user_2     |     3     |    1     |   o   |      |  X   |                 |
|v_0        |     3     |    1     |   o   |  o   |  X   |        o        |
|v_3        |     3     |    1     |   o   |  X   |  o   |        o        |
|v_4        |     3     |    1     |   o   |  o   |  X   |        o        |
|v_5        |     3     |    1     |   o   |  X   |  o   |        o        |
|v_6        |     3     |    1     |   o   |  X   |  o   |        o        |
|v_7        |     3     |    1     |   o   |  o   |  o   |        X        |
+-----------+-----------+----------+-------+------+------+-----------------+

 [ODistributedStorage]
2017-08-07 16:37:17:197 INFO  [node3] Broadcasting new distributed configuration for database: global (version=52)
 [OHazelcastPlugin]Error on checking cluster health
com.orientechnologies.orient.server.distributed.ODistributedException: Error on transferring database
        at com.orientechnologies.orient.server.distributed.sql.OCommandExecutorSQLHASyncCluster.replaceCluster(OCommandExecutorSQLHASyncCluster.j
ava:251)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.installDatabaseFromNetwork(ODistributedAbstractPlugin.
java:1517)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.requestDatabaseFullSync(ODistributedAbstractPlugin.jav
a:1232)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.requestFullDatabase(ODistributedAbstractPlugin.java:10
02)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin$3.call(ODistributedAbstractPlugin.java:950)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin$3.call(ODistributedAbstractPlugin.java:878)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.executeInDistributedDatabaseLock(ODistributedAbstractP
lugin.java:1686)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.installDatabase(ODistributedAbstractPlugin.java:877)
        at com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker.checkServerStatus(OClusterHealthChecker.java:190)
        at com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker.run(OClusterHealthChecker.java:53)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
Caused by: com.orientechnologies.orient.core.exception.OConfigurationException: Database '/home/tomcat/orientdb/databases/global' is not configured on server (home=/home/tomcat/orientdb/databases/)
        at com.orientechnologies.orient.server.OServer.getStoragePath(OServer.java:575)
        at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:935)
        at com.orientechnologies.orient.server.distributed.sql.OCommandExecutorSQLHASyncCluster.replaceCluster(OCommandExecutorSQLHASyncCluster.java:209)
        ... 11 more


But at this point I have a database on 3 nodes, with a class called "users"
with only one cluster "users_0".

On node2, add the users_1 cluster:
alter class users addcluster users_1
Similarly, on node3:
alter class users addcluster users_2
If I reconnect all console sessions and execute "clusters" I now see
all 3 clusters of the users class on each node. When i  insert a record from each
node I see that records are creating on Node1 and replicating to node2 , when creating the record on node2, its created but not replicated to node3 and when i create a record, its created on node 3 and replicated node1.

i could see the wanring from node2 as below.

2017-08-07 20:54:55:339 WARNI [node2] Timeout (10004ms) on waiting for synchronous responses from nodes=[node3, node1] responsesSoFar=[node1] request=(id=1.1606 task=gossip timestamp: 1502119485335 lockManagerServer: node1) [ODistributedDatabaseImpl]
2017-08-07 20:54:55:339 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=GratefulDeadConcerts, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:55:05:342 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=global, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:55:15:347 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=GratefulDeadConcerts, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:55:25:351 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=global, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:55:35:356 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=GratefulDeadConcerts, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:55:45:359 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=global, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:55:55:364 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=GratefulDeadConcerts, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:56:05:367 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=global, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:56:15:373 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=GratefulDeadConcerts, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:56:25:376 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=global, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:56:35:381 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=GratefulDeadConcerts, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:56:45:385 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=global, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:56:55:390 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=GratefulDeadConcerts, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:57:05:393 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=global, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:57:15:398 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=GratefulDeadConcerts, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2017-08-07 20:57:25:402 WARNI [node2]->[node3] Server 'node3' did not respond to the gossip message (db=global, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]

Is there something wrong in the way I've created/configured the database or
is this a bug?

I even tried interchanging the nodes but i could see the same error when ever i add third node.

@kishorel
Copy link
Author

kishorel commented Aug 9, 2017

Does anyone faced similar issue? It works fine with two nodes but adding third one throws errors.

@SixDirection
Copy link

I have also encountered the same problem

@lvca
Copy link
Member

lvca commented Aug 28, 2017

@kishorel please use OrientDB >=2.2.26 and restart the servers with such version with the database you already created.

@lvca lvca self-assigned this Aug 28, 2017
@kishorel
Copy link
Author

@lvca Thanks for the reply. Just to understand, is this a bug? we actually thought of going on production but in the initial setup stage encountered issues.What is the stable release(if any).

@lvca
Copy link
Member

lvca commented Aug 29, 2017

It was a bug. 2.2.26 is stable with HA. With your configuration I suggest to use "majority" in the writeQuorum, not just 2. With 2-3 servers, majority is still 2, but at least if you're starting with 1 server you can still work with it.

@donhoff
Copy link

donhoff commented Nov 24, 2017

I meet the same problem on version 2.2.29. Does this bug still exist on v2.2.29? Thanks!

@xavier66
Copy link

I meet the same problem on version 2.2.30 !!! please help , or any method to scale up write?

@xavier66
Copy link

@lvca @kishorel @DongQingHe @prjhub CC

@randomandy
Copy link

Having the same problem in 2.2.32

@hossein-md
Copy link

I have something like this problem in 2.2.33 with select query (#8238)

@cegprakash
Copy link

Facing the same issue in 2.2.36

@LianaN
Copy link

LianaN commented Oct 17, 2018

Facing the same issue with OrientDB 3 and writeQuorum set to majority.

@cegprakash
Copy link

I am using OrientDB Rest to avoid this issue. (Just in case if you are getting started you can give that a try)

@persisharma
Copy link

Any update on this issue? Facing same issue in OrientDB 3.0.12 also

@hxgxs1
Copy link

hxgxs1 commented Jan 25, 2019

@persisharma Yes, I am facing this issue in 3.0.13 as well. Can somebody provide a solution for this

@madmac2501
Copy link

I'm also affected in OrientDB 3.0.30

@darkpey
Copy link

darkpey commented Aug 14, 2021

Same issue in 3.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests