-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server completely irresponsive after heavy query, no logs that give status, not possible to interact #7545
Comments
@rdelangh, can you please take thread dump and attach that one? I see Orient team at their best when they have the required data points. It could be some deadlock or so in very specific scenario. btw I am using Orient 3.0 just want to make sure this issue doesn't exist for 3.0 and if it exists I don't want to discover this on my own sometime in the future. :-). |
Hi @rdelangh Thank you for opening this issue, and thank you @careerscale for your comment, a thread dump would help a lot indeed. My feeling is that the problem here is the size of the result set (or the intermediate result sets) that overloads the memory and the GC, but it's hard to say without a working example or a thread dump. About v 3.0, the query model is completely changed, queries are paginated and the result sets are almost never kept in memory, so the risk of GC overhead is very limited compared to 2.2. Thanks Luigi |
hi @luigidellaquila For the record: because this server was running version 2.2.18 (built early March 2017), I decided to take a try with version 2.2.23 which is happily running on my other machine for another database.
|
So the process died... typically because of the launching of "jstack" |
is version 3.0 already safe to start using for a production database ? |
Meanwhile, how come that I can not even start the dbase with version 2.2.23 ? It runs into OutOfMemory when I try to open the dbase with a "console.sh" session... (see above) |
Hi @rdelangh 3.0 is still under development, it's not yet safe for production. Thanks Luigi |
hi @luigidellaquila After a while, the console attempt failed with again the same message:
These are the current UNIX limits for the user "orientdb" which runs the server process:
|
I don't know if the number of open files would be an issue, like I had before (that's why the limit is raised from default 1024 to now 10000) : in case that would be the reason, the FYI the number of all files in the "cdrarch" dbase directory is:
(if the server would have an issue opening ALL these files, then this would have been an issue also with running the previous version 2.2.18 till now; but that was not a problem). Anyhow, I will try to revert back to running the server with version 2.2.18 ... |
hi @rdelangh, From the error Regards |
hi all, as presented in the "vmstat" outputs, there is
To make sure nothing in Ubuntu's kernel would be unreleased (maybe an Ubuntu bug), I rebooted the physical server and retried, but to no avail: the first lines of the Java server's exception trace shows Lucene... Could it be that there are problems with the Lucene's getting too big ? "console.sh" output (as before, unchanged):
server logs:
|
hi @rdelangh, From the error really seems that too many threads are created, I feel something is getting stack while boot, could you try to get some thread dump while booting orient, so we can quickly see if there are too many threads starting and what they are. Kind Regards |
please find in attach the thread dump, after the server ran out of memory with the client trying the open the dbase. |
From the command "ps -eaFm ", I seem to have 490 threads running in the kernel (all processes together). I startup the server again, without client trying to open a database (yet), the threads of this server process with PID 1157 are like:
Now I start the client "console.sh" and attempt to open the database, after 10 seconds I run the "jstack" command on the server process again; the output of that one is in attach: "jstack.out2.gz" Then after 20 more seconds, a few sec before the OOM error occurs again, I managed to take another stack dump; see attach "jstack.out3.gz" |
gentlemen, any more feedback on this issue please ? many thanks for any further investigations! |
You have too much lucene indexes. They exhaust the resources. |
hi @robfrank ,
So I guess this is a catch-22 situation:
Which limit in the resources are we hitting, with the number of Lucene indexes in the database? |
Hi @rdelangh, Thanks for using OrientDB and for opening this issue. I see quite some risk in implementing such a big solution (with a 2.8Tb database) without considering a Support Contract with us. We offer different Support and Consultancy options. I believe you may benefit from having direct access to our Support Team (that is backed by the OrientDB Engineering Team) - or discuss a data model review with us, etc. Have you ever considered this opportunity? http://orientdb.com/support/ Your use case seems quite interesting, and I believe it could be a win-win if you open a discussion with our Sales and Support Teams. That said, our Engineering has worked during last weekend to refactor, in OrientDB v.3.0, the lifecycle of lucene indexes to better handle situations like yours (#7555). The changes have not been pushed yet as they are still under testing. I believe backporting this fix to v. 2.2 can be considered, but this is something I feel that it's better to discuss with our Sales first. In the hope this helps, Santo Leto, |
Changes introduced in 3.0.0-snaphot (Issue #7555) has now been backported to version 2.2.24-snapshot. We are still making some tests and working to finalize release 2.2.24. Thanks, |
With latest changes in the Lucene-jar, and a few rebuilds of broken unique (non-Lucene) and non-unique (Lucene) indexes, the v2.2.23 server is operational again and can accept new data-loading and query requests. |
That's great @rdelangh, I am closing this issue now. We are releasing official 2.2.24 version. Again, please do consider an enterprise support subscription, to have direct access to our support and engineering Teams, reduce the risk involved with your project, and support our open source Project, and the Company behind it. Thanks, Santo Leto, |
Cool this has been fixed. Thanks @rdelangh for your contribution. |
I still encounter problems with file-locks on the Lucene indexes:
The method to resolve this temporarily, is by
Typically the problem happens again when data has been loading for a while (by 3 parallel loading programs), then some minutes without loading, then the loading is started again but the problem with the file-lock aborts it immediately. |
again same error occured:
-> need to shutdown this server, then restart it
|
@rdelangh which kind of file system are you using to store the index? Are you using NFS or a SAN? Because I tested the load scenario you depicted and I wasn't able to reproduce this kind of error. |
@robfrank the fastest filesystem you can get: ZFS, with about 40GB cached in RAM, so lightspeed fast. No, the type of filesystem will not be the reason, why otherwise can you explain that the "SELECT count(*) ..." from the index returns the results in 24 seconds, whereas a "SELECT set(rid) AS x ..." from the same index returns the results only after 1830 seconds? |
just installed version 2.2.25, and rebuilt the index "idx_cdr_af_20170709_5" which was mentioned in the errors, but still get the same errors in the server output, while programs try to load records into this dbase:
the errors that the loading (client) programs get, now mentions still another index "idx_cdr_af_20170709_1":
|
It is getting worse:
|
I ran some more tests: This is what I see happening:
By running the "lsof" command in a loop per second, I could see the following phenomenom happening with the file locks:
Notice that it concerns a lock on the single file which was not open anymore: Are there maybe some (failing) race-conditions happening? |
gentlemen, any update or feedback on this, please? The database is not usable in this state. |
Hi, just finished a session of tests on 2.2.15-SNAPSHOT and 3.0.0-SNAPSHOT with this scenario:
I think the scenario is similar to yours. Can you confirm?
|
hi @robfrank , many thanks for following-up on this! |
is the ilde-time after which a Lucene index file gets closed, configurable? So, instead of the current idle-time of (I guess) a few 10's of seconds, I would like to raise it to a few hours (or 1 full day). |
hello @robfrank , is the idle-time after which a Lucene index file gets closed, configurable? If so, can you please tell me in which file, or via which command-line arguments, it can be adjusted? |
Yes, it is documented: BTW, I worked on a test, trying to reproduce your use case, but indexes are open and closed without problems. I defined 2 indexes on a class, and I put records with both properties and records with only one. |
thanks again @robfrank |
No, I'm afraid. At the moment these values are set at creation time. Our internals doesn't allow these settings to be dynamic at the moment. |
ok then, bad news. |
Yes, you're right. I will check with the rest of the team how and if we can enable this scenario. |
Finally the dbase is alive again by dropping some 10's of Lucene indexes and recreating them with the metadata
With that, the server starts up, and not too many indexes are opened (lazy behaviour), but they are also not closed too quickly and reopened very soon after that again. I assume that this too fast close/open sequence is the cause of some internal synchronisation problem. |
OrientDB Version: 2.2.18
Java Version: 1.8.0_92
OS: Ubuntu 16.04
Expected behavior
A server process which is capable to arrange its resources in such a way that it never gets stuck and irresponsive.
Actual behavior
The server was running since early May without noticable problems, after much tweaking of its memory resources:
The process is running now with following arguments:
One of our users launched a not-so-abnormal query yesterday July-12 via our client-program that communicates with the server via REST, which eventually returned an timeout error after waiting for one full hour.
A new query today, failed again with a timeout after waiting for one hour.
Checking the server logs, I can see nothing meaningful apart from a message about these queries yesterday (I removed part of the -very long- query statement to keep it readable here):
After this moment, the server has become unresponsive, not accepting any call via REST anymore, not allowing to login via "console.sh", not taking any connection from the web-GUI, not accepting a connection from the "shutdown" script...
It was still writing messages in the logs about the memory settings:
Since it was totally irresponsive, we tried to send it a plain UNIX 'SIGTERM' signal, but nothing appeared in its logs until about 10 minutes later, when the server wrote:
But since then (now 1.5 hours later), the server has still not shutdown, and is still unresponsive to any client connection.
And most frustrating: nothing in its logs or output.
Only a few minutes ago, the server process stopped. Still: nothing at all in its logs or output... It simply went away.
So what we want to ask here:
The text was updated successfully, but these errors were encountered: