Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding 2nd node in cluster: dbase copy hang with "node not online yet" #6666

Closed
1 task done
rdelangh opened this issue Sep 7, 2016 · 155 comments
Closed
1 task done
Assignees
Labels
Milestone

Comments

@rdelangh
Copy link

rdelangh commented Sep 7, 2016

OrientDB Version, operating system, or hardware.

  • v2.2.7

Operating System

  • Linux

I have my database "cdrarch" running on a first server node, which has been started in distributed mode, and is ONLINE:
orientdb {db=cdrarch}> list servers

CONFIGURED SERVERS
+----+-----------------+------+-----------+-----------------------+-----------------+-----------------+---------------+---------------+---------+
|#   |Name             |Status|Connections|StartedOn              |Binary           |HTTP             |UsedMemory     |FreeMemory     |MaxMemory|
+----+-----------------+------+-----------+-----------------------+-----------------+-----------------+---------------+---------------+---------+
|0   |node1473107743544|ONLINE|1          |2016-09-07 10:27:24.808|10.49.69.190:2424|10.49.69.190:2480|5.51GB (38.74%)|1.71GB (12.05%)|14.22GB  |
+----+-----------------+------+-----------+-----------------------+-----------------+-----------------+---------------+---------------+---------+

I have copied the "hazelcast.xml" file to a second server, and started that server as well.

The logs on both nodes say that they get a socket connection: extract from the logs on the second node:

2016-09-07 12:18:51:293 WARNI Assigning distributed node name: node1473243531293 [OHazelcastPlugin]

2016-09-07 12:18:52:399 INFO  [10.49.69.189]:2434 [cdrcontrols] [3.6.3] Creating MulticastJoiner [Node]
2016-09-07 12:18:52:405 INFO  [10.49.69.189]:2434 [cdrcontrols] [3.6.3] Address[10.49.69.189]:2434 is STARTING [LifecycleService]
...
2016-09-07 12:18:52:668 INFO  [10.49.69.189]:2434 [cdrcontrols] [3.6.3] Connecting to /10.49.69.190:2434, timeout: 0, bind-any: true [InitConnectionTask]
2016-09-07 12:18:52:675 INFO  [10.49.69.189]:2434 [cdrcontrols] [3.6.3] Established socket connection between /10.49.69.189:42394 and /10.49.69.190:2434 [TcpIpConnectionManager]
2016-09-07 12:18:58:692 INFO  [10.49.69.189]:2434 [cdrcontrols] [3.6.3]

Members [2] {
        Member [10.49.69.190]:2434
        Member [10.49.69.189]:2434 this
}

Then the database 'GratefulDeadConcerts' gets replicated:

2016-09-07 12:19:03:593 INFO  [node1473243531293] Current node started as MASTER for database 'GratefulDeadConcerts' [OHazelcastPlugin]
2016-09-07 12:19:03:596 INFO  [node1473243531293] New distributed configuration for database: GratefulDeadConcerts (version=1)

CLUSTER CONFIGURATION (LEGEND: X = Owner, o = Copy)
+--------+-----------+----------+-----------------+
|        |           |          |     MASTER      |
|        |           |          |     ONLINE      |
+--------+-----------+----------+-----------------+
|CLUSTER |writeQuorum|readQuorum|node1473107743544|
+--------+-----------+----------+-----------------+
|*       |     1     |    1     |        X        |
|internal|     1     |    1     |                 |
+--------+-----------+----------+-----------------+

 [OHazelcastPlugin]
2016-09-07 12:19:03:596 INFO  [node1473243531293] Saving distributed configuration file for database 'GratefulDeadConcerts' to: /opt/orientdb/databases/GratefulDeadConcerts/distribut
ed-config.json [OHazelcastPlugin]
2016-09-07 12:19:03:600 INFO  [node1473243531293] Distributed servers status:

+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+-------------------------+
|Name              |Status  |Databases                           |Conns|StartedOn|Binary           |HTTP             |UsedMemory               |
+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+-------------------------+
|node1473243531293*|STARTING|                                    |0    |12:18:50 |10.49.69.189:2424|10.49.69.189:2480|75.77MB/491.00MB (15.43%)|
|node1473107743544 |ONLINE  |GratefulDeadConcerts=ONLINE (MASTER)|3    |10:27:24 |10.49.69.190:2424|10.49.69.190:2480|5.35GB/14.22GB (37.63%)  |
|                  |        |bvdd-test=ONLINE (MASTER)           |     |         |                 |                 |                         |
+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+-------------------------+
...
2016-09-07 12:19:05:102 INFO  [node1473243531293] Publishing ONLINE status for database node1473243531293.GratefulDeadConcerts... [ODistributedDatabaseImpl]
2016-09-07 12:19:05:109 INFO  [node1473243531293] New distributed configuration for database: GratefulDeadConcerts (version=31)

CLUSTER CONFIGURATION (LEGEND: X = Owner, o = Copy)
+-------------+----+-----------+----------+-----------------+-----------------+
|             |    |           |          |     MASTER      |     MASTER      |
|             |    |           |          |     ONLINE      |     ONLINE      |
+-------------+----+-----------+----------+-----------------+-----------------+
|CLUSTER      |  id|writeQuorum|readQuorum|node1473107743544|node1473243531293|
+-------------+----+-----------+----------+-----------------+-----------------+
|*            |    |     2     |    1     |        X        |        o        |
|_studio_1    |  50|     2     |    1     |        o        |        X        |
|_studio_5    |  54|     2     |    1     |        o        |        X        |
|_studio_6    |  55|     2     |    1     |        o        |        X        |
|_studio_7    |  56|     2     |    1     |        o        |        X        |
|e_4          |  21|     2     |    1     |        o        |        X        |
|e_5          |  22|     2     |    1     |        o        |        X        |
...
|written_by_7 |  40|     2     |    1     |        o        |        X        |
+-------------+----+-----------+----------+-----------------+-----------------+

 [OHazelcastPlugin]
2016-09-07 12:19:05:109 INFO  [node1473243531293] Saving distributed configuration file for database 'GratefulDeadConcerts' to: /opt/orientdb/databases/GratefulDeadConcerts/distribut
ed-config.json [OHazelcastPlugin]

So far so good, it seems.

Then it starts with my big (300GB) database "cdrarch", but here we get problems:

2016-09-07 12:19:05:119 INFO  [node1473243531293] Current node started as MASTER for database 'cdrarch' [OHazelcastPlugin]
2016-09-07 12:19:05:120 INFO  [node1473243531293] New distributed configuration for database: cdrarch (version=1)

CLUSTER CONFIGURATION (LEGEND: X = Owner, o = Copy)
+--------+-----------+----------+-----------------+
|        |           |          |     MASTER      |
|        |           |          |     ONLINE      |
+--------+-----------+----------+-----------------+
|CLUSTER |writeQuorum|readQuorum|node1473107743544|
+--------+-----------+----------+-----------------+
|*       |     1     |    1     |        X        |
|internal|     1     |    1     |                 |
+--------+-----------+----------+-----------------+
2016-09-07 12:19:05:121 INFO  [node1473243531293] Saving distributed configuration file for database 'cdrarch' to: /opt/orientdb/databases/cdrarch/distributed-config.json [OHazelcast
Plugin]
2016-09-07 12:19:05:124 INFO  [node1473243531293] Distributed servers status:

+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+-------------------------+
|Name              |Status  |Databases                           |Conns|StartedOn|Binary           |HTTP             |UsedMemory               |
+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+-------------------------+
|node1473243531293*|STARTING|                                    |0    |12:18:50 |10.49.69.189:2424|10.49.69.189:2480|75.77MB/491.00MB (15.43%)|
|node1473107743544 |ONLINE  |GratefulDeadConcerts=ONLINE (MASTER)|3    |10:27:24 |10.49.69.190:2424|10.49.69.190:2480|5.35GB/14.22GB (37.63%)  |
|                  |        |cdrarch=ONLINE (MASTER)             |     |         |                 |                 |                         |
|                  |        |bvdd-test=ONLINE (MASTER)           |     |         |                 |                 |                         |
+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+-------------------------+
 [OHazelcastPlugin]
2016-09-07 12:19:05:131 WARNI [node1473243531293]->[[node1473107743544]] Requesting deploy of database 'cdrarch' on local server... [OHazelcastPlugin]
2016-09-07 12:19:05:444 WARNI [node1473243531293] Moving existent database 'cdrarch' in '/opt/orientdb/databases/cdrarch' to '/opt/orientdb/databases//../backup/databases/cdrarch' an
d get a fresh copy from a remote node... [OHazelcastPlugin]
2016-09-07 12:19:05:445 SEVER [node1473243531293] Error on moving existent database 'cdrarch' located in '/opt/orientdb/databases/cdrarch' to '/opt/orientdb/databases/../backup/datab
ases/cdrarch'. Deleting old database... [OHazelcastPlugin]
2016-09-07 12:19:05:445 INFO  [node1473243531293]<-[node1473107743544] Copying remote database 'cdrarch' to: /tmp/orientdb/install_cdrarch.zip [OHazelcastPlugin]
2016-09-07 12:19:05:446 INFO  [node1473243531293]<-[node1473107743544] Installing database 'cdrarch' to: /opt/orientdb/databases/cdrarch... [OHazelcastPlugin]
2016-09-07 12:19:05:447 INFO  [node1473243531293] - writing chunk #1 offset=0 size=0b [OHazelcastPlugin]
2016-09-07 12:19:05:448 INFO  [node1473243531293] Database copied correctly, size=0b [ODistributedAbstractPlugin$3]
2016-09-07 12:19:07:476 INFO  Node is not online yet (status=STARTING), blocking the command until it is online (retry=1, queue=0 worker=0) [ODistributedWorker]
2016-09-07 12:19:09:476 INFO  Node is not online yet (status=STARTING), blocking the command until it is online (retry=2, queue=0 worker=0) [ODistributedWorker]
...

-> what's happening?

@rdelangh
Copy link
Author

rdelangh commented Sep 8, 2016

Did another (more patient) attempt:

  • stopped server entirely (on single node)
  • started (with "dserver.sh") ODB on first server node, and waited till my 300GB dbase "cdrarch" is accessible via the "console.sh" tool ; in server logs:
    2016-09-08 08:34:45:767 INFO OrientDB Server is active v2.2.7 (build 2.2.x@rdcab5af4dce4b538bdb4b372abba46e3fc9f19b7; 2016-08-11 15:17:33+0000). [OServer]
  • started (with "dserver.sh") ODB on second server node, and notice the logs on the 1st server which indicate that both nodes are connected and start synchronizing:

2016-09-08 08:45:15:558 INFO [node1473107743544] Distributed servers status:

+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+------------------------+
|Name |Status |Databases |Conns|StartedOn|Binary |HTTP |UsedMemory |
+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+------------------------+
|node1473243531293 |STARTING| |0 |08:45:04 |10.49.69.189:2424|10.49.69.189:2480|100.87MB/14.22GB (0.69%)|
|node1473107743544*|ONLINE |GratefulDeadConcerts=ONLINE (MASTER)|1 |08:30:05 |10.49.69.190:2424|10.49.69.190:2480|342.71MB/14.22GB (2.35%)|
| | |cdrarch=ONLINE (MASTER) | | | | | |
| | |bvdd-test=ONLINE (MASTER) | | | | | |
| | |c5qp=ONLINE (MASTER) | | | | | |
+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+------------------------+

  • a first (small) dbase "bvdd-test" gets exchanged, it seems:

2016-09-08 08:45:15:597 INFO [node1473107743544]<-[node1473243531293] Received new status node1473243531293.bvdd-test=SYNCHRONIZING [OHazelcastPlugin]
2016-09-08 08:45:15:702 INFO [node1473107743544] Received updated status node1473243531293.bvdd-test=SYNCHRONIZING [OHazelcastPlugin]
2016-09-08 08:45:15:704 INFO [node1473107743544] Received updated status node1473107743544.bvdd-test=SYNCHRONIZING [OHazelcastPlugin]
2016-09-08 08:45:15:704 INFO [node1473107743544]->[node1473243531293] Deploying database bvdd-test... [OSyncDatabaseTask]
2016-09-08 08:45:15:710 INFO [node1473107743544]->[node1473243531293] Creating backup of database 'bvdd-test' (compressionRate=7) in directory: /tmp/orientdb/backup_bvdd-t
est.zip. LSN=OLogSequenceNumber{segment=0, position=2723661}... [OSyncDatabaseTask]
2016-09-08 08:45:16:159 INFO [node1473107743544]->[node1473243531293] Backup of database 'bvdd-test' completed. lastOperationId=1.0... [OSyncDatabaseTask$1]
2016-09-08 08:45:16:314 INFO [node1473107743544]->[node1473243531293] - transferring chunk #1 offset=0 size=168274... [OSyncDatabaseTask]
2016-09-08 08:45:16:315 INFO [node1473107743544] Received updated status node1473107743544.bvdd-test=ONLINE [OHazelcastPlugin]
2016-09-08 08:45:16:317 INFO [node1473107743544]->[node1473243531293] Deploy database task completed [OSyncDatabaseTask]
2016-09-08 08:45:16:318 INFO [node1473107743544] Distributed servers status:
...
2016-09-08 08:45:17:931 INFO [node1473107743544]<-[node1473243531293] Received updated status node1473243531293.bvdd-test=ONLINE [OHazelcastPlugin]
2016-09-08 08:45:17:942 INFO [node1473107743544]<-[node1473243531293] Updated configuration db=bvdd-test [OHazelcastPlugin]
2016-09-08 08:45:17:945 INFO [node1473107743544] New distributed configuration for database: bvdd-test (version=12)...
2016-09-08 08:45:17:946 INFO [node1473107743544] Saving distributed configuration file for database 'bvdd-test' to: /opt/orientdb/databases/bvdd-test/distributed-config.js
on [OHazelcastPlugin]

  • but then this message is disturbing :
    2016-09-08 08:45:17:953 INFO [node1473107743544] Cannot install database 'bvdd-test' on local node, because no servers are ONLINE [OHazelcastPlugin]
  • for dbase "GratefulDeadConcerts" it seems to go all fine :
    ...
    2016-09-08 08:45:19:568 INFO [node1473107743544] Saving distributed configuration file for database 'GratefulDeadConcerts' to: /opt/orientdb/databases/GratefulDeadConcerts
    /distributed-config.json [OHazelcastPlugin]
    2016-09-08 08:45:19:574 INFO [node1473107743544] Current node started as MASTER for database 'GratefulDeadConcerts' [OHazelcastPlugin]
  • for the big dbase "cdrarch", it gets stuck :

2016-09-08 08:45:19:577 INFO [node1473107743544]<-[node1473243531293] Received new status node1473243531293.cdrarch=SYNCHRONIZING [OHazelcastPlugin]
2016-09-08 08:45:19:587 INFO [node1473107743544] Received updated status node1473243531293.cdrarch=SYNCHRONIZING [OHazelcastPlugin]
2016-09-08 08:45:19:588 INFO [node1473107743544] Received updated status node1473107743544.cdrarch=SYNCHRONIZING [OHazelcastPlugin]
2016-09-08 08:45:19:589 INFO [node1473107743544]->[node1473243531293] Deploying database cdrarch... [OSyncDatabaseTask]
...

2016-09-08 08:45:19:591 INFO [node1473107743544]->[node1473243531293] Creating backup of database 'cdrarch' (compressionRate=7) in directory: /tmp/orientdb/backup_cdrarch.
zip. LSN=OLogSequenceNumber{segment=17575, position=70007445}... [OSyncDatabaseTask]
2016-09-08 08:45:19:893 INFO [node1473107743544]->[node1473243531293] - transferring chunk #1 offset=0 size=0... [OSyncDatabaseTask]
2016-09-08 08:45:19:894 INFO [node1473107743544] Received updated status node1473107743544.cdrarch=ONLINE [OHazelcastPlugin]
2016-09-08 08:45:19:897 INFO [node1473107743544]->[node1473243531293] Deploy database task completed [OSyncDatabaseTask]
2016-09-08 08:45:42:814 WARNI [node1473107743544] Timeout (20002ms) on waiting for synchronous responses from nodes=[node1473243531293] responsesSoFar=[] request=(id=0.0 ta
sk=heartbeat timestamp: 1473317122803) [ODistributedDatabaseImpl]
2016-09-08 08:45:42:815 WARNI [node1473107743544]->[node1473243531293] Server 'node1473243531293' did not respond to the heartbeat message (db=GratefulDeadConcerts, timeout
=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2016-09-08 08:45:52:818 WARNI [node1473107743544]->[node1473243531293] Server 'node1473243531293' did not respond to the heartbeat message (db=bvdd-test, timeout=10000ms),
but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2016-09-08 08:45:52:828 INFO [node1473107743544] Distributed servers status:
+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+------------------------+
|Name |Status |Databases |Conns|StartedOn|Binary |HTTP |UsedMemory |
+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+------------------------+
|node1473243531293 |STARTING| |0 |08:45:04 |10.49.69.189:2424|10.49.69.189:2480|100.87MB/14.22GB (0.69%)|
|node1473107743544*|ONLINE |GratefulDeadConcerts=ONLINE (MASTER)|3 |08:30:05 |10.49.69.190:2424|10.49.69.190:2480|401.81MB/14.22GB (2.76%)|
| | |cdrarch=ONLINE (MASTER) | | | | | |
| | |bvdd-test=ONLINE (MASTER) | | | | | |
| | |c5qp=ONLINE (MASTER) | | | | | |
+------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+------------------------+
[OHazelcastPlugin]
2016-09-08 08:46:02:828 WARNI [node1473107743544]->[node1473243531293] Server 'node1473243531293' did not respond to the heartbeat message (db=GratefulDeadConcerts, timeout
=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
2016-09-08 08:46:12:829 WARNI [node1473107743544]->[node1473243531293] Server 'node1473243531293' did not respond to the heartbeat message (db=bvdd-test, timeout=10000ms),
but cannot be set OFFLINE by configuration [OClusterHealthChecker]
...(and so on)

  • after about 45 mins, the situation is still the same:
    2016-09-08 09:29:44:069 WARNI [node1473107743544] Timeout (10001ms) on waiting for synchronous responses from nodes=[node1473243531293] responsesSoFar=[] request=(id=0.264
    task=heartbeat timestamp: 1473319774067) [ODistributedDatabaseImpl]
    2016-09-08 09:29:44:070 WARNI [node1473107743544]->[node1473243531293] Server 'node1473243531293' did not respond to the heartbeat message (db=GratefulDeadConcerts, timeout
    =10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
    2016-09-08 09:29:54:070 WARNI [node1473107743544]->[node1473243531293] Server 'node1473243531293' did not respond to the heartbeat message (db=bvdd-test, timeout=10000ms),
    but cannot be set OFFLINE by configuration [OClusterHealthChecker]
  • on that second node, the logs are this:
    ...
    2016-09-08 09:31:33:550 INFO Node is not online yet (status=STARTING), blocking the command until it is online (retry=1321, queue=1 worker=3) [ODistributedWorker]
    2016-09-08 09:31:33:550 INFO Node is not online yet (status=STARTING), blocking the command until it is online (retry=1361, queue=1 worker=1) [ODistributedWorker]
    2016-09-08 09:31:33:550 INFO Node is not online yet (status=STARTING), blocking the command until it is online (retry=1386, queue=123 worker=0) [ODistributedWorker]
    ... (and so on)

=> are they stuck, or is something (supposed to be) happening that simply takes an (awfull) lot of time?

@rdelangh
Copy link
Author

rdelangh commented Sep 8, 2016

Found why :

  • in the "default-distributed-db-config.json" file, I configured explicitly nodenames, which did not correspond with the random nodenames assigned to each server when I launched "dserver.sh"
  • so I stopped all servers, trashed the "distributed-db-config.json" files which were modified by the startup, set the environment variable ORIENTDB_NODE_NAME on each server, and then started "dserver.sh" again

Situation is better now (not this "Node is not online yet" loop anymore) but only the first node gets in status ONLINE, the 2nd and 3rd nodes remain in status STARTING ...

@rdelangh
Copy link
Author

rdelangh commented Sep 9, 2016

Can the logging level be increased so that hopefully any reason becomes visible why a 2nd and a 3rd node remain in status STARTING ?

The docs do not describe anything about setting logging levels ...

@rdelangh
Copy link
Author

rdelangh commented Sep 9, 2016

Seems the same issue as closed issues #4176 and #4789 ; however these scenarios seem to be a conversion from a standalone server into a distributed setup, where according to the docs the OPs copy their entire database to the new nodes.
In our case, database of 350GB, that is an unrealistic approach, and as such also documented to be an alternative way of working (i.e. not copying first the dbase).

1st node gets ONLINE,
2nd node is STARTING and is copying clusters, then remains in STARTING saying "Node is not online yet",
3rd node does exactly the same thing (remains in STARTING).

@lvca
Copy link
Member

lvca commented Sep 10, 2016

Are the 2nd and 3rd nodes starting at the same time?

@rdelangh
Copy link
Author

@lvca : It seems to make no difference , I did a few different combinations : start 1st, wait till connectivity with dbase is possible; then start 2nd node, see it synchronizing some clusters until it gets into that loop ("Node is not online yet"), the start 3rd node to see if that helps gettings clusters distributed, but to no avail.

@lvca lvca self-assigned this Sep 12, 2016
@lvca
Copy link
Member

lvca commented Sep 12, 2016

Do you have logs about the last attempt with nodes that starts progressively? Could you share the log files in some way? GIST is fine too. Thanks.

@rdelangh
Copy link
Author

@lvca, ok, I managed to upload the logfiles of the first and 2nd server to GIST:
[https://gist.github.com/rdelangh/192a9fd0c759121e305d2c5ab35dc9fa]

  • the logfile on GIST for the 1st server is named "orient1-orient-server.log.0"
  • the logfile on GIST for the 2nd server is named "orient2-orient-server.log.0"

@lvca
Copy link
Member

lvca commented Sep 19, 2016

Ok, this error should be fixed in the last 2.2.10. Could you please retry?

@rdelangh
Copy link
Author

@lvca
Hi Luca, I tried with this version, but even the "dserver.sh" startup on a 1st server node fails. The logs are clearly indicating that it's not starting as it should be:
https://gist.github.com/rdelangh/4510549d4a2c03027af58fcace82d235

@lvca
Copy link
Member

lvca commented Sep 24, 2016

Do you have the logs of the other servers that cannot join?

@rdelangh
Copy link
Author

@lvca the first server cannot even start, so I did not attempt to startup any additional server.

@rdelangh
Copy link
Author

I started the first server with version 2.2.7 again, and this one starts fine. So there seems some issue with version 2.2.10

@taburet
Copy link
Contributor

taburet commented Sep 26, 2016

Hi @rdelangh! Regarding 2016-09-24 01:32:56:927 SEVER OCachePointer.finalize: readers != 0 [OCachePointer] entries in the latest log. Could you please enable leak detector using -Dmemory.directMemory.trackMode=true -Djava.util.logging.manager=com.orientechnologies.common.log.OLogManager$DebugLogManager and do a full start-work-stop cycle on the node, log entries marked with DIRECT-TRACK label should appear in the log if a leak will be detected. BTW, are those OCachePointer.finalize: readers != 0 entries consistently reproducible or they are random?

@rdelangh
Copy link
Author

@taburet
These "OCachePointer.finalize: readers != 0" log entries start appearing at extreme high speed (zillions of entries, continuously) in the logfile some 30 seconds after I started the server, and before any client has been launched to use that server.

@lvca lvca added the bug label Sep 26, 2016
@lvca lvca assigned taburet and unassigned lvca Sep 26, 2016
@taburet
Copy link
Contributor

taburet commented Sep 26, 2016

Tested locally on a primitive 3-node setup – no leaks. Let's see what leak detector will uncover on your side.

@rdelangh
Copy link
Author

@taburet

  1. awaiting a timeslot that our current (v2.2.7) can be shut, will do then the startup of v2.2.10 with the leak detector settings
  2. meanwhile I found that the v2.2.7 "server.sh" script contains options for Java, which are not included in the v2.2.10 bundled "server.sh" script, I don't know if they can be relevant for this issue:
    In the v2.2.7 "server.sh" script:
    JAVA_OPTS_SCRIPT="-Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=512g -Djava.awt.headless=true -
    Dfile.encoding=UTF8 -Drhino.opt.level=9"

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

@lvca
so, I try to shutdown both servers again to change this timeout parameter.

  • on "orient6" (that was trying to get a replica of the dbases from "orient5"), that went without problems
  • on "orient5", this shutdown attempt is failing, blocking on some status conflicts: its logs:
...
2017-03-08 17:36:40:094 WARNI Shutting down node 'orient5'... [OHazelcastPlugin]
2017-03-08 17:36:40:094 WARNI [orient5] Updated node status to 'SHUTTINGDOWN' [OHazelcastPlugin]
2017-03-08 17:36:40:116 INFO  [orient5] Shutting down distributed database manager 'OSystem'. Pending objects: txs=0 locks=0 [ODistributedDatabaseImpl]
2017-03-08 17:36:40:140 INFO  [orient5] Shutting down distributed database manager 'navi'. Pending objects: txs=0 locks=0 [ODistributedDatabaseImpl]
2017-03-08 17:37:10:532 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] System clock apparently jumped from 2017-03-08T17:36:43.905 to 2017-03-08T17:37:10.530 since last heartbeat (+21625 ms) [ClusterHeartbeatManager]
2017-03-08 17:37:10:532 WARNI [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 21625 ms, Heartbeat-Timeout: 30000 ms [ClusterHeartbeatManager]
2017-03-08 17:37:31:137 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] System clock apparently jumped from 2017-03-08T17:37:10.530 to 2017-03-08T17:37:31.136 since last heartbeat (+15606 ms) [ClusterHeartbeatManager]
2017-03-08 17:37:31:137 WARNI [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 15606 ms, Heartbeat-Timeout: 30000 ms [ClusterHeartbeatManager]
2017-03-08 17:37:48:067 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] System clock apparently jumped from 2017-03-08T17:37:31.136 to 2017-03-08T17:37:48.067 since last heartbeat (+11931 ms) [ClusterHeartbeatManager]
2017-03-08 17:38:37:966 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] System clock apparently jumped from 2017-03-08T17:38:13.661 to 2017-03-08T17:38:37.966 since last heartbeat (+19305 ms) [ClusterHeartbeatManager]
2017-03-08 17:38:37:966 WARNI [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 19305 ms, Heartbeat-Timeout: 30000 ms [ClusterHeartbeatManager]
2017-03-08 17:38:37:966 INFO  [orient5] Shutting down distributed database manager 'cdrarch'. Pending objects: txs=0 locks=0 [ODistributedDatabaseImpl]
2017-03-08 17:39:35:304 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] System clock apparently jumped from 2017-03-08T17:38:37.966 to 2017-03-08T17:38:58.894 since last heartbeat (+15928 ms) [ClusterHeartbeatManager]
2017-03-08 17:40:20:618 WARNI [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 15928 ms, Heartbeat-Timeout: 30000 ms [ClusterHeartbeatManager]
2017-03-08 17:40:20:618 INFO  [orient5] Shutting down distributed database manager 'GratefulDeadConcerts'. Pending objects: txs=0 locks=0 [ODistributedDatabaseImpl]
2017-03-08 17:41:06:160 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] System clock apparently jumped from 2017-03-08T17:38:58.894 to 2017-03-08T17:40:20.618 since last heartbeat (+76724 ms) [ClusterHeartbeatManager]
2017-03-08 17:41:06:160 WARNI [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 76724 ms, Heartbeat-Timeout: 30000 ms [ClusterHeartbeatManager]
2017-03-08 17:41:46:296 INFO  [orient5] Node is not online yet (status=SHUTTINGDOWN), blocking the command until it is online (retry=1, queue=0 worker=7) [ODistributedWorker]
2017-03-08 17:42:10:473 INFO  [orient5] Node is not online yet (status=SHUTTINGDOWN), blocking the command until it is online (retry=1, queue=0 worker=6) [ODistributedWorker]
2017-03-08 17:42:10:473 INFO  [orient5] Node is not online yet (status=SHUTTINGDOWN), blocking the command until it is online (retry=1, queue=0 worker=5) [ODistributedWorker]

-> please advise.

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

Mind you that these messages of "System clock apparently jumped" make no sense: both servers are very well NTP-synchronized, their NTP-logfile shows no single jump of the system clock with more than 1/1000 second !

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

Got these extra log messages on "orient5" :

...

2017-03-08 18:04:28:135 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] System clock apparently jumped from 2017-03-08T17:40:20.618 to 2017-03-08T17:41:06.161 since last heartbeat (+40543 ms) [ClusterHeartbeatManager]
2017-03-08 18:05:34:343 WARNI [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 40543 ms, Heartbeat-Timeout: 30000 ms [ClusterHeartbeatManager]
2017-03-08 18:05:34:344 INFO  [orient5] Removing server 'orient6' from all the databases (removeOnlyDynamicServers=true)... [OHazelcastPlugin]
2017-03-08 18:05:34:344 INFO  [orient5] Shutting down distributed database manager 'mobile'. Pending objects: txs=0 locks=0 [ODistributedDatabaseImpl][orient5]<-[orient6] Error on executing distributed request 1.70972: (command_sql(create cluster `cdr_af_20160911_1`)) worker=7
java.lang.OutOfMemoryError: GC overhead limit exceeded
$ANSI{green {db=cdrarch}} [orient5]<-[orient6] Error on executing distributed request 1.70971: (command_sql(create cluster `cdr_af_20160911_1`)) worker=6
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.concurrent.ConcurrentLinkedQueue.offer(ConcurrentLinkedQueue.java:328)
        at java.util.concurrent.ConcurrentLinkedQueue.add(ConcurrentLinkedQueue.java:297)
        at com.orientechnologies.common.concur.lock.OReadersWriterSpinLock.acquireReadLock(OReadersWriterSpinLock.java:84)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.OLocalPaginatedStorage.getConfiguration(OLocalPaginatedStorage.java:218)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.open(ODatabaseDocumentTx.java:242)
        at com.orientechnologies.orient.server.OServer.openDatabaseBypassingSecurity(OServer.java:980)
        at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:947)
        at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:932)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedDatabaseImpl.getDatabaseInstance(ODistributedDatabaseImpl.java:775)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.initDatabaseInstance(ODistributedWorker.java:140)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.onMessage(ODistributedWorker.java:297)
        at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.run(ODistributedWorker.java:101)
$ANSI{green {db=cdrarch}} [orient5]<-[orient6] Error on executing distributed request 1.70969: (command_sql(create cluster `cdr_af_20160911_1`)) worker=4
java.lang.OutOfMemoryError: GC overhead limit exceeded

2017-03-08 18:06:07:969 WARNI [orient5] Updated node status to 'OFFLINE' [OHazelcastPlugin]
2017-03-08 18:06:07:969 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] System clock apparently jumped from 2017-03-08T17:41:06.161 to 2017-03-08T18:05:34.344 since last heartbeat (+1463183 ms) [ClusterHeartbeatManager]
2017-03-08 18:06:07:969 INFO  [orient5] Removing server 'orient6' from database configuration 'navi' (removeOnlyDynamicServers=true)... [OHazelcastPlugin]Error during fuzzy checkpoint
java.lang.OutOfMemoryError: GC overhead limit exceeded

2017-03-08 18:06:07:970 WARNI [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Resetting master confirmation timestamps because of huge system clock jump! Clock-Jump: 1463183 ms, Master-Confirmation-Timeout: 350000 ms [ClusterHeartbeatManager]
2017-03-08 18:07:37:069 WARNI [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 1463183 ms, Heartbeat-Timeout: 30000 ms [ClusterHeartbeatManager]
2017-03-08 18:07:37:069 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Address[10.100.22.125]:2434 is SHUTTING_DOWN [LifecycleService]
2017-03-08 18:07:37:070 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] System clock apparently jumped from 2017-03-08T18:05:34.344 to 2017-03-08T18:07:37.070 since last heartbeat (+117726 ms) [ClusterHeartbeatManager]
2017-03-08 18:07:37:070 WARNI [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 117726 ms, Heartbeat-Timeout: 30000 ms [ClusterHeartbeatManager]

and still not shut down

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

So, there was a GC overhead limit exceeded.

Could this have been (again) the problem for the replication from "orient5" to "orient6" ?

Current settings in "server.sh" are such that the server process is started like this:

java -server -Xms4G -Xmx4G -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=512g -Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9 -Ddistributed=true -Djava.util.logging.config.file=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientdb-server-log.properties -Dorientdb.config.file=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientdb-server-config.xml -Dorientdb.www.path=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/www -Dorientdb.build.number=UNKNOWN@r469bb1923cc9601a8264fda6ea79f0f9bd8448e6; 2017-02-23 10:30:12+0000 -cp /opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/orientdb-server-2.2.18-SNAPSHOT.jar:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/*:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/plugins/* com.orientechnologies.orient.server.OServerMain

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

I notice that the command-line options for this server process contain "-Xms4G -Xmx4G"

However the "server.sh" script contains this:

if [ -z "$ORIENTDB_OPTS_MEMORY" ] ; then
    #ORIENTDB_OPTS_MEMORY="-Xms2G -Xmx2G"
    ORIENTDB_OPTS_MEMORY="-Xms1G -Xmx10G"
fi

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

ok, it's down:

...

2017-03-08 18:12:00:248 WARNI [orient5] Node removed id=Member [10.100.22.124]:2434 name=orient6 [OHazelcastPlugin]
2017-03-08 18:12:27:175 INFO  - shutdown storage: cdrarch... [Orient]
2017-03-08 18:12:27:598 INFO  - shutdown storage: GratefulDeadConcerts... [Orient][orient5]<-[orient6] Error on executing distributed request 1.70965 on local node: command_sql(create cluster `cdr_af_20160911_1`)
java.lang.OutOfMemoryError: GC overhead limit exceeded

2017-03-08 18:12:27:647 INFO  - shutdown storage: mobile... [Orient]
2017-03-08 18:12:40:241 INFO  OrientDB Engine shutdown complete [Orient]
2017-03-08 18:12:40:241 INFO  OrientDB Server shutdown complete
 [OServer]
2017-03-08 18:12:40:257 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Running shutdown hook... Current state: ACTIVE [Node]

So, question is: despite my settings in "server.sh" to tune "Xms" and "Xmx" values, the process gets started with "Xms4G" and "Xmx4G" which apparently are not ok for a big database.
They might be ok for small databases, but these are another scope of software products, let's assume...

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

I found that this problem exists in "orientdb-community-2.2.18-20170223.103119-11.tar.gz" , not (anymore) in "orientdb-community-2.2.18-20170304.005202-22.tar.gz" ...

@rdelangh
Copy link
Author

Correction: this problem still exists in "orientdb-community-2.2.18-20170304.005202-22.tar.gz" !

The script "bin/server.sh" contains customisations for the heap sizes:

more bin/server.sh
...
if [ -z "$ORIENTDB_OPTS_MEMORY" ] ; then
    ORIENTDB_OPTS_MEMORY="-Xms1G -Xmx10G"
fi
...
exec "$JAVA" $JAVA_OPTS \
    $ORIENTDB_OPTS_MEMORY \
    $JAVA_OPTS_SCRIPT \
    $ORIENTDB_SETTINGS \

However the program seems to ignore these settings and uses 4G for both parameters:

java -server -Xms4G -Xmx4G -Djna.nosys=true ...

Separate issue opened: #7226

@rdelangh
Copy link
Author

The way of working to override environment variables outside of "server.sh" script, as suggested in #7226, has the desired result: the server process starts with the new heap parameters ("-Xms4G -Xmx12G")

Retrying again to add a 2nd server node "orient6" to existing server "orient5":

  1. both servers have the ODB version orientdb.build.number=UNKNOWN@rf31f2e10de758cbdef4cee27451b4065b94d9ce2

  2. config files of both server installations are identical except for the nodename value; the file "config/default-distributed-db-config.json" contains this on both servers:

{
  "autoDeploy": true,
  "readQuorum": 1,
  "writeQuorum": "majority",
  "executionMode": "undefined",
  "readYourWrites": true,
  "newNodeStrategy": "static",
  "servers": {
    "*": "master"
  },
  "clusters": {
    "internal": {
    },
    "cl_*_p0": { "servers": [ "orient1", "orient2", "orient4" ] },
    "cl_*_p1": { "servers": [ "orient2", "orient3", "orient5" ] },
    "cl_*_p2": { "servers": [ "orient3", "orient4", "orient6" ] },
    "cl_*_p3": { "servers": [ "orient4", "orient5", "orient1" ] },
    "cl_*_p4": { "servers": [ "orient5", "orient6", "orient2" ] },
    "cl_*_p5": { "servers": [ "orient6", "orient1", "orient3" ] },
    "*": {
      "servers": ["orient5"]
    }
  }
}

the FINE logging has been set in "config/orientdb-server-log.properties":

.level = FINE
com.orientechnologies.orient.server.distributed.level = FINE
java.util.logging.FileHandler.level = FINE
  1. both servers have a customized "bin/server.sh" script with the following explicit settings:
    JAVA_OPTS_SCRIPT="-Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=512g -Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9 -Dstorage.openFiles.limit=1024 -Denvironment.dumpCfgAtStartup=true "
    JAVA_OPTS_SCRIPT="$JAVA_OPTS_SCRIPT -Ddistributed.backupDirectory=\"\" "
    JAVA_OPTS_SCRIPT="$JAVA_OPTS_SCRIPT -Dquery.parallelAuto=true "
    JAVA_OPTS_SCRIPT="$JAVA_OPTS_SCRIPT -Dquery.parallelMinimumRecords=50000"
    JAVA_OPTS_SCRIPT="$JAVA_OPTS_SCRIPT -Dstorage.wal.maxSize=51200"
    JAVA_OPTS_SCRIPT="$JAVA_OPTS_SCRIPT -Ddistributed.deployDbTaskTimeout=3600000"
    ORIENTDB_SETTINGS="-Dstorage.diskCache.bufferSize=48000 -Dmemory.chunk.size=33554432"
  1. server "orient5" has the databases ONLINE, server "orient6" has an empty "databases" directory ; in server "orient5", there are no files "databases/*/distributed-db-config.json"

  2. both servers are started with

ulimit -n 10000
ORIENTDB_OPTS_MEMORY="-Xms4G -Xmx12G" ; export ORIENTDB_OPTS_MEMORY
TZ=MET; export TZ
nohup bin/dserver.sh > log/dstartup.out 2>&1 </dev/null &
  1. NTP sync is active on both servers, and system-time is identical

  2. first server "orient5" is started:

...
2017-03-21 16:06:11:848 INFO  OrientDB Server is active v2.2.18-SNAPSHOT (build UNKNOWN@rf31f2e10de758cbdef4cee27451b4065b94d9ce2; 2017-03-04 00:50:53+0000). [OServer]
2017-03-21 16:06:16:851 INFO  [orient5] Distributed servers status:
+--------+------+------------------------------------+-----+---------+------------------+------------------+-----------------------+
|Name    |Status|Databases                           |Conns|StartedOn|Binary            |HTTP              |UsedMemory             |
+--------+------+------------------------------------+-----+---------+------------------+------------------+-----------------------+
|orient5*|ONLINE|navi=ONLINE (MASTER)                |0    |16:04:45 |10.100.22.125:2424|10.100.22.125:2480|1.12GB/10.67GB (10.53%)|
|        |      |cdrarch=ONLINE (MASTER)             |     |         |                  |                  |                       |
|        |      |GratefulDeadConcerts=ONLINE (MASTER)|     |         |                  |                  |                       |
|        |      |mobile=ONLINE (MASTER)              |     |         |                  |                  |                       |
+--------+------+------------------------------------+-----+---------+------------------+------------------+-----------------------+
 [OHazelcastPlugin]
  1. now we start "orient6"/10.100.22.124 and wait for the replications to complete with "orient5"/10.100.22.125 (if ever... sigh!):
    On "orient6":
...
2017-03-21 16:08:59:844 INFO  [10.100.22.124]:2434 [cdrcontrols] [3.6.5] Established socket connection between /10.100.22.124:51305 and /10.100.22.125:2434 [TcpIpConnectionManager]
...

on "orient5":

...
2017-03-21 16:08:59:849 INFO  [10.100.22.125]:2434 [cdrcontrols] [3.6.5] Established socket connection between /10.100.22.125:2434 and /10.100.22.124:51
305 [TcpIpConnectionManager]
...

Let's wait once more how this goes...

@lvca
Copy link
Member

lvca commented Mar 24, 2017

@rdelangh If you put both servers on the same server (different ports) works? It looks like a connection problem, maybe is about your network settings. In orientdb-server-config.xml are you binding with 0.0.0.0 or 10.100.22.*?

@lvca
Copy link
Member

lvca commented Mar 31, 2017

@rdelangh any news on this?

@rdelangh
Copy link
Author

hello @lvca
I seems to be running better now, but not yet entirely successful.
Please see the logs in attach, which I reduced by filtering out the lines contains 'writing chunk', 'transferring chunk', and 'Node is not online yet (status=STARTING), blocking the command until it is online'
orient5-dserver.txt
orient6-dserver.txt

@rdelangh
Copy link
Author

rdelangh commented Mar 31, 2017

I noticed some issues with some cluster in dbase "cdrarch", which I think can be dropped.
However, while trying to connect now on "orient5", using "console.sh", to database "cdrarch", that seems to fail :

orientdb> CONNECT remote:localhost/cdrarch admin mypwd

Connecting to database [remote:localhost/cdrarch] with user 'admin'...

I stopped the ODB server on the 2nd node "orient6", but this connection attempt still hangs.
Logs on "orient5" keep on scrolling with

...
2017-03-31 12:52:12:999 FINE  Checking cluster health... [OClusterHealthChecker]
2017-03-31 12:52:12:999 FINE  Cluster health checking completed [OClusterHealthChecker]
2017-03-31 12:52:23:000 FINE  Checking cluster health... [OClusterHealthChecker]
2017-03-31 12:52:23:000 FINE  Cluster health checking completed [OClusterHealthChecker]
...

I can run a "jstat" on this server "orient5", but not sure if this indicates a problem situation:

$ ps -aef|grep java
orientdb   871     1 99 Mar28 ?        5-02:44:08 java -server -Xms14G -Xmn3G -Xmx20G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:Comp             ressedClassSpaceSize=12m -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops -Djna.nosys=true -XX:MaxDirectMemorySize=512g -Djava.awt.headless=             true -Dfile.encoding=UTF8 -Drhino.opt.level=9 -Dstorage.openFiles.limit=1024 -Denvironment.dumpCfgAtStartup=true -Dquery.parallelAuto=true -D             query.parallelMinimumRecords=10000 -Dquery.parallelResultQueueSize=200000 -Dstorage.wal.maxSize=51200 -Dstorage.diskCache.bufferSize=54000 -D             memory.chunk.size=33554432 -Ddistributed=true -Djava.util.logging.config.file=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientd             b-server-log.properties -Dorientdb.config.file=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientdb-server-config.xml -Dorientdb.             www.path=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/www -Dorientdb.build.number=UNKNOWN@rf31f2e10de758cbdef4cee27451b4065b94d9ce2; 2017             -03-04 00:50:53+0000 -cp /opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/orientdb-server-2.2.18-SNAPSHOT.jar:/opt/orientdb/orientdb-comm             unity-2.2.18-SNAPSHOT/lib/*:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/plugins/* com.orientechnologies.orient.server.OServerMain
orientdb  2416  2405  0 10:53 pts/4    00:00:00 grep --color=auto java

orientdb@orient5:~$ /usr/lib/jvm/jdk1.8.0_92/bin/jstat -gc -h10 871 10s
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
393216.0 393216.0 1936.1  0.0   2359296.0 1542743.2 11534336.0 4565555.9  47104.0 46055.5 5888.0 5553.2  17994  459.700   0      0.000  459.700
393216.0 393216.0 1936.1  0.0   2359296.0 1542743.2 11534336.0 4565555.9  47104.0 46055.5 5888.0 5553.2  17994  459.700   0      0.000  459.700
393216.0 393216.0 1936.1  0.0   2359296.0 1542745.2 11534336.0 4565555.9  47104.0 46055.5 5888.0 5553.2  17994  459.700   0      0.000  459.700
393216.0 393216.0 1936.1  0.0   2359296.0 1542745.2 11534336.0 4565555.9  47104.0 46055.5 5888.0 5553.2  17994  459.700   0      0.000  459.700
...

@rdelangh
Copy link
Author

hi,
can I please get any feedback on this issue?

@rdelangh
Copy link
Author

hello,
is there any progress or alternative paths through which we can identify the reason why a replication of big data set fails ?

@lvca
Copy link
Member

lvca commented Apr 26, 2017

Ho @rdelangh I've pushed some fix in 2.2.19 that we just released. Could you please tell me if the problem has been fixed? Thanks.

@rdelangh
Copy link
Author

ok, thx for that.
I downloaded this version, will launch a replication attempt now and inform you of the outcome.

@lvca
Copy link
Member

lvca commented Apr 27, 2017

Cool, thanks.

@rdelangh
Copy link
Author

rdelangh commented May 3, 2017

Finally... that replication was succesfull. This was a long saga, thanks Luca for your persistence ;-)

I will do more tests now and let you know. Ultimately, I want to have up to 6 servers running together.

@lvca lvca closed this as completed May 3, 2017
@lvca
Copy link
Member

lvca commented May 3, 2017

Awesome. @rdelangh thank you for YOUR patience! I'm closing the issue, but in case you experience the same problem, don't hesitate to reopen/comment this. Now we have dynamic timeouts that are balanced with the workload (thanks to a new table of latencies we keep in RAM). Thanks.

@lvca lvca modified the milestones: 2.2.19, 2.2.14 May 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

8 participants