-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding 2nd node in cluster: dbase copy hang with "node not online yet" #6666
Comments
Did another (more patient) attempt:
2016-09-08 08:45:15:558 INFO [node1473107743544] Distributed servers status: +------------------+--------+------------------------------------+-----+---------+-----------------+-----------------+------------------------+
2016-09-08 08:45:15:597 INFO [node1473107743544]<-[node1473243531293] Received new status node1473243531293.bvdd-test=SYNCHRONIZING [OHazelcastPlugin]
2016-09-08 08:45:19:577 INFO [node1473107743544]<-[node1473243531293] Received new status node1473243531293.cdrarch=SYNCHRONIZING [OHazelcastPlugin] 2016-09-08 08:45:19:591 INFO [node1473107743544]->[node1473243531293] Creating backup of database 'cdrarch' (compressionRate=7) in directory: /tmp/orientdb/backup_cdrarch.
=> are they stuck, or is something (supposed to be) happening that simply takes an (awfull) lot of time? |
Found why :
Situation is better now (not this "Node is not online yet" loop anymore) but only the first node gets in status ONLINE, the 2nd and 3rd nodes remain in status STARTING ... |
Can the logging level be increased so that hopefully any reason becomes visible why a 2nd and a 3rd node remain in status STARTING ? The docs do not describe anything about setting logging levels ... |
Seems the same issue as closed issues #4176 and #4789 ; however these scenarios seem to be a conversion from a standalone server into a distributed setup, where according to the docs the OPs copy their entire database to the new nodes. 1st node gets ONLINE, |
Are the 2nd and 3rd nodes starting at the same time? |
@lvca : It seems to make no difference , I did a few different combinations : start 1st, wait till connectivity with dbase is possible; then start 2nd node, see it synchronizing some clusters until it gets into that loop ("Node is not online yet"), the start 3rd node to see if that helps gettings clusters distributed, but to no avail. |
Do you have logs about the last attempt with nodes that starts progressively? Could you share the log files in some way? GIST is fine too. Thanks. |
@lvca, ok, I managed to upload the logfiles of the first and 2nd server to GIST:
|
Ok, this error should be fixed in the last 2.2.10. Could you please retry? |
@lvca |
Do you have the logs of the other servers that cannot join? |
@lvca the first server cannot even start, so I did not attempt to startup any additional server. |
I started the first server with version 2.2.7 again, and this one starts fine. So there seems some issue with version 2.2.10 |
Hi @rdelangh! Regarding |
@taburet |
Tested locally on a primitive 3-node setup – no leaks. Let's see what leak detector will uncover on your side. |
|
@lvca
-> please advise. |
Mind you that these messages of "System clock apparently jumped" make no sense: both servers are very well NTP-synchronized, their NTP-logfile shows no single jump of the system clock with more than 1/1000 second ! |
Got these extra log messages on "orient5" :
and still not shut down |
So, there was a GC overhead limit exceeded. Could this have been (again) the problem for the replication from "orient5" to "orient6" ? Current settings in "server.sh" are such that the server process is started like this:
|
I notice that the command-line options for this server process contain "-Xms4G -Xmx4G" However the "server.sh" script contains this:
|
ok, it's down:
So, question is: despite my settings in "server.sh" to tune "Xms" and "Xmx" values, the process gets started with "Xms4G" and "Xmx4G" which apparently are not ok for a big database. |
I found that this problem exists in "orientdb-community-2.2.18-20170223.103119-11.tar.gz" , not (anymore) in "orientdb-community-2.2.18-20170304.005202-22.tar.gz" ... |
Correction: this problem still exists in "orientdb-community-2.2.18-20170304.005202-22.tar.gz" ! The script "bin/server.sh" contains customisations for the heap sizes:
However the program seems to ignore these settings and uses 4G for both parameters:
Separate issue opened: #7226 |
The way of working to override environment variables outside of "server.sh" script, as suggested in #7226, has the desired result: the server process starts with the new heap parameters ("-Xms4G -Xmx12G") Retrying again to add a 2nd server node "orient6" to existing server "orient5":
the FINE logging has been set in "config/orientdb-server-log.properties":
on "orient5":
Let's wait once more how this goes... |
@rdelangh If you put both servers on the same server (different ports) works? It looks like a connection problem, maybe is about your network settings. In |
@rdelangh any news on this? |
hello @lvca |
I noticed some issues with some cluster in dbase "cdrarch", which I think can be dropped.
I stopped the ODB server on the 2nd node "orient6", but this connection attempt still hangs.
I can run a "jstat" on this server "orient5", but not sure if this indicates a problem situation:
|
hi, |
hello, |
Ho @rdelangh I've pushed some fix in 2.2.19 that we just released. Could you please tell me if the problem has been fixed? Thanks. |
ok, thx for that. |
Cool, thanks. |
Finally... that replication was succesfull. This was a long saga, thanks Luca for your persistence ;-) I will do more tests now and let you know. Ultimately, I want to have up to 6 servers running together. |
Awesome. @rdelangh thank you for YOUR patience! I'm closing the issue, but in case you experience the same problem, don't hesitate to reopen/comment this. Now we have dynamic timeouts that are balanced with the workload (thanks to a new table of latencies we keep in RAM). Thanks. |
OrientDB Version, operating system, or hardware.
Operating System
I have my database "cdrarch" running on a first server node, which has been started in distributed mode, and is ONLINE:
orientdb {db=cdrarch}> list servers
I have copied the "hazelcast.xml" file to a second server, and started that server as well.
The logs on both nodes say that they get a socket connection: extract from the logs on the second node:
Then the database 'GratefulDeadConcerts' gets replicated:
So far so good, it seems.
Then it starts with my big (300GB) database "cdrarch", but here we get problems:
-> what's happening?
The text was updated successfully, but these errors were encountered: