Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server completely irresponsive after heavy query, no logs that give status, not possible to interact #7545

Closed
rdelangh opened this issue Jul 13, 2017 · 40 comments
Assignees

Comments

@rdelangh
Copy link

OrientDB Version: 2.2.18

Java Version: 1.8.0_92

OS: Ubuntu 16.04

Expected behavior

A server process which is capable to arrange its resources in such a way that it never gets stuck and irresponsive.

Actual behavior

The server was running since early May without noticable problems, after much tweaking of its memory resources:
The process is running now with following arguments:

 java -server -Xms14G -Xmn3G -Xmx24G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m \
-XX:CompressedClassSpaceSize=12m -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops \
-Djna.nosys=true -XX:MaxDirectMemorySize=512g -Djava.awt.headless=true -Dfile.encoding=UTF8 \
-Drhino.opt.level=9 -Dstorage.openFiles.limit=1024 -Denvironment.dumpCfgAtStartup=true \
-Dquery.parallelAuto=true -Dquery.parallelMinimumRecords=10000 \
-Dquery.parallelResultQueueSize=200000 -Dstorage.wal.maxSize=51200 \
-Dstorage.diskCache.bufferSize=54000 -Dmemory.chunk.size=33554432 \
-Djava.util.logging.config.file=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientdb-server-log.properties \
-Dorientdb.config.file=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientdb-server-config.xml \
-Dorientdb.www.path=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/www \
-Dorientdb.build.number=UNKNOWN@rf31f2e10de758cbdef4cee27451b4065b94d9ce2; 2017-03-04 00:50:53+0000 \
-cp /opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/orientdb-server-2.2.18-SNAPSHOT.jar:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/*:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/plugins/* com.orientechnologies.orient.server.OServerMain

One of our users launched a not-so-abnormal query yesterday July-12 via our client-program that communicates with the server via REST, which eventually returned an timeout error after waiting for one full hour.
A new query today, failed again with a timeout after waiting for one hour.

Checking the server logs, I can see nothing meaningful apart from a message about these queries yesterday (I removed part of the -very long- query statement to keep it readable here):

...
2017-07-12 12:23:21:517 INFO  {db=cdrarch} [TIP] Query 'SELECT expand(DISTINCT(rowid)) FROM \
(SELECT rowid FROM (SELECT unionAll($var1.x, $var2.x) AS rowid \
LET var1 = (SELECT unionAll($var_cdr_af_20170709_1.x, $var_cdr_af_20170709_2.x) AS x \
LET var_cdr_af_20170709_1 = (SELECT set(rid) AS x FROM INDEX:idx_cdr_af_20170709_4 \
WHERE key LUCENE ' *70210045* '), var_cdr_af_20170709_2 = (SELECT set(rid) AS x FROM \
INDEX:idx_cdr_af_20170709_3 WHERE key LUCENE ' *70210045* ')), var2 = (SELECT set(rid) AS x \
FROM INDEX:idx_cdr_icx_20170709_3 WHERE key LUCENE ' *70210045* ')) UNWIND rowid)' \
returned a result set with more than 10000 records. Check if you really need all these records, or \
reduce the resultset by using a LIMIT to improve both performance and used RAM [OProfilerStub]Error on \
formatting message '$ANSI{green {db=cdrarch}} [TIP] Query 'SELECT ifnull(if(eval(" FILENAME LIKE 'ICX%' "), \
PRT_DATE.format('ddMMyy'), DATE_START), '') AS DATE_START, ifnull(if(eval(" FILENAME LIKE 'ICX%' "), \
PRT_DATE.format('hhmmss'), TIME_START), '') AS TIME_START, ... AND PRT_DATE BETWEEN '2017-07-11 00:00:00.000' AND '2017-07-11 23:59:59.999'' \
returned a result set with more than 10000 records. Check if you really need all these records, or reduce \
the resultset by using a LIMIT to improve both performance and used RAM'. \
Exception: java.util.UnknownFormatConversionException: Conversion = '''

After this moment, the server has become unresponsive, not accepting any call via REST anymore, not allowing to login via "console.sh", not taking any connection from the web-GUI, not accepting a connection from the "shutdown" script...

It was still writing messages in the logs about the memory settings:

...
2017-07-12 12:24:25:010 INFO  Database 'OSystem' uses 45,832MB/54,000MB of DISKCACHE memory, while Heap is not completely used (usedHeap=13973MB maxHeap=24192MB). To improv
e performance set maxHeap to 16768MB and DISKCACHE to 61423MB [OAbstractProfiler$MemoryChecker]
2017-07-12 12:24:25:010 INFO  -> Open server.sh (or server.bat on Windows) and change the following variables: 1) MAXHEAP=-Xmx16768M 2) MAXDISKCACHE=61423 [OAbstractProfile
r$MemoryChecker]
...
2017-07-13 12:39:38:098 INFO  Database 'cdrarch' uses 45,832MB/54,000MB of DISKCACHE memory, while Heap is not completely used (usedHeap=14279MB maxHeap=24192MB). To improve performance set maxHeap to 17134MB and DISKCACHE to 61057MB [OAbstractProfiler$MemoryChecker]
2017-07-13 12:39:38:098 INFO  -> Open server.sh (or server.bat on Windows) and change the following variables: 1) MAXHEAP=-Xmx17134M 2) MAXDISKCACHE=61057 [OAbstractProfiler$MemoryChecker]

Since it was totally irresponsive, we tried to send it a plain UNIX 'SIGTERM' signal, but nothing appeared in its logs until about 10 minutes later, when the server wrote:

...
2017-07-13 12:40:33:008 WARNI Received signal: SIGTERM [OSignalHandler]
2017-07-13 12:45:45:503 WARNI Received signal: SIGTERM [OSignalHandler]
2017-07-13 12:46:25:980 WARNI Received signal: SIGTERM [OSignalHandler]
2017-07-13 12:46:52:789 INFO  Database 'OSystem' uses 45,832MB/54,000MB of DISKCACHE memory, while Heap is not completely used (usedHeap=14280MB maxHeap=24192MB). To improve performance set maxHeap to 17136MB and DISKCACHE to 61056MB [OAbstractProfiler$MemoryChecker]
2017-07-13 12:47:09:594 INFO  OrientDB Server is shutting down... [OServer]
2017-07-13 12:47:18:324 INFO  -> Open server.sh (or server.bat on Windows) and change the following variables: 1) MAXHEAP=-Xmx17136M 2) MAXDISKCACHE=61056 [OAbstractProfiler$MemoryChecker]
...

But since then (now 1.5 hours later), the server has still not shutdown, and is still unresponsive to any client connection.
And most frustrating: nothing in its logs or output.

Only a few minutes ago, the server process stopped. Still: nothing at all in its logs or output... It simply went away.

So what we want to ask here:

  1. have us less 'blind' to the status of the server: make it write more logs, or make it accept more client-connections which are requesting status information about the server; these Java mechanisms are simply not robust, I guess.
  2. seeing the timeline of events, that one query (with its many results) might be related. In case many records ('documents') would be collected in a query, this seems to impact the server stability. Is it running out of resources? Can the underlying Java mechanisms to deal with memory be kept more under control so that logging is possible about what's happening?
@careerscale
Copy link

@rdelangh, can you please take thread dump and attach that one? I see Orient team at their best when they have the required data points. It could be some deadlock or so in very specific scenario.

btw I am using Orient 3.0 just want to make sure this issue doesn't exist for 3.0 and if it exists I don't want to discover this on my own sometime in the future. :-).

@luigidellaquila
Copy link
Member

Hi @rdelangh

Thank you for opening this issue, and thank you @careerscale for your comment, a thread dump would help a lot indeed.

My feeling is that the problem here is the size of the result set (or the intermediate result sets) that overloads the memory and the GC, but it's hard to say without a working example or a thread dump.

About v 3.0, the query model is completely changed, queries are paginated and the result sets are almost never kept in memory, so the risk of GC overhead is very limited compared to 2.2.

Thanks

Luigi

@rdelangh
Copy link
Author

hi @luigidellaquila
As I wrote : the server is fully unresponsive, trying to take a thread dump was also impossible.

For the record: because this server was running version 2.2.18 (built early March 2017), I decided to take a try with version 2.2.23 which is happily running on my other machine for another database.
I copied the same start-script with the environment variable settings for heap etc, and launched it on this machine.
Then trying to 'open' this database "cdrarch"' again, failed with 'outofmemory' ...
Trying to shutdown this server again with the "shutdown" script, failed with 'outofmemory'...

$ bin/shutdown.sh -h orient2.bc -P 2424-2430 -u root -p $ourPwd ; tail -f log/orient-server.log.0
...
Sending shutdown command to remote OrientDB Server instance...
Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread
...

@rdelangh
Copy link
Author

$ ps -aef|grep java
orientdb  6887     1  4 15:19 ?        00:04:46 java -server -Xms14G -Xmn3G -Xmx24G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12m ...

$ jstack 6887
6887: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding

$ jstack -F 6887 |more
Attaching to process ID 6887, please wait...
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: cannot open binary file
sun.jvm.hotspot.debugger.DebuggerException: sun.jvm.hotspot.debugger.DebuggerException: cannot open binary file
...
$ ps -aef|grep java
$ 

So the process died... typically because of the launching of "jstack"

@rdelangh
Copy link
Author

is version 3.0 already safe to start using for a production database ?

@rdelangh
Copy link
Author

Meanwhile, how come that I can not even start the dbase with version 2.2.23 ? It runs into OutOfMemory when I try to open the dbase with a "console.sh" session... (see above)

@luigidellaquila
Copy link
Member

Hi @rdelangh

3.0 is still under development, it's not yet safe for production.
The OOM on database open is quite strange. Do you think you can share it (on a private channel)? Probably we can do some debugging on it and see what's the problem

Thanks

Luigi

@rdelangh
Copy link
Author

hi @luigidellaquila
What exactly would you want me to share? If you mean the dbase, then it won't be possible due to its size (2.8TB)
I think it is an out-of-memory on UNIX (Ubuntu) level:
I stopped the server and restarted it, and launched again an attempt from "console.sh" to open it;
in parallel, in other UNIX sessions, I had a "vmstat" running, as well as a "while" loop to show the RSS and VSZ sizes of the process, and a "jstat" to see the Java internals of the server process.

After a while, the console attempt failed with again the same message:

orientdb> CONNECT remote:localhost/cdrarch admin $ourpwd
Connecting to database [remote:localhost/cdrarch] with user 'admin'...
Error: com.orientechnologies.orient.enterprise.channel.binary.OResponseProcessingException: Exception during response processing
Error: java.lang.OutOfMemoryError: unable to create new native thread
orientdb> quit
  1. the system still had enough memory, as shown from the output of "vmstat":
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
...
 0  0      0 38162624 167088 16718172    0    0     0   142  457  907  0  0 100  0  0
 0  0      0 38159080 167088 16718180    0    0     0     0  239  515  0  0 100  0  0
 0  0      0 36670556 167088 16718180    0    0     0   137 1852 1407  1  4 95  0  0
 0  0      0 36663784 167088 16718004    0    0     0   138  423  951  0  0 100  0  0
 0  0      0 36663084 167088 16717100    0    0    15   146  551 1182  0  0 99  0  0
 0  0      0 36646640 167088 16717116    0    0     0     0  242  525  0  0 100  0  0
 0  0      0 36635464 167088 16717052    0    0     0   142  435  903  0  0 100  0  0
 0  0      0 36642324 167088 16717052    0    0     0     0  226  471  0  0 100  0  0
  1. the "jstat" output did not show dramatic activity from the server:
...
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
393216.0 393216.0  0.0    0.0   2359296.0 1841816.5 11534336.0    0.0     4480.0 773.9  384.0   75.8       0    0.000   0      0.000    0.000
393216.0 393216.0  0.0    0.0   2359296.0 1841816.5 11534336.0    0.0     4480.0 773.9  384.0   75.8       0    0.000   0      0.000    0.000
393216.0 393216.0  0.0    0.0   2359296.0 1841816.5 11534336.0    0.0     4480.0 773.9  384.0   75.8       0    0.000   0      0.000    0.000
393216.0 393216.0  0.0    0.0   2359296.0 1841816.5 11534336.0    0.0     4480.0 773.9  384.0   75.8       0    0.000   0      0.000    0.000
393216.0 393216.0  0.0    0.0   2359296.0 1841816.5 11534336.0    0.0     4480.0 773.9  384.0   75.8       0    0.000   0      0.000    0.000
393216.0 393216.0  0.0    0.0   2359296.0 1889002.4 11534336.0    0.0     4480.0 773.9  384.0   75.8       0    0.000   0      0.000    0.000
393216.0 393216.0  0.0    0.0   2359296.0 1889002.4 11534336.0    0.0     4480.0 773.9  384.0   75.8       0    0.000   0      0.000    0.000
393216.0 393216.0  0.0   393215.3 2359296.0 512756.2 11534336.0  428955.4  29440.0 28431.7 3584.0 3266.3      1    0.513   0      0.000    0.513
393216.0 393216.0  0.0   393215.3 2359296.0 1247741.8 11534336.0  428955.4  29440.0 28431.7 3584.0 3266.3      1    0.513   0      0.000    0.513
393216.0 393216.0  0.0   393215.3 2359296.0 2310393.6 11534336.0  428955.4  29440.0 28431.7 3584.0 3266.3      1    0.513   0      0.000    0.513
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
393216.0 393216.0 393214.4  0.0   2359296.0 680215.9 11534336.0 1566890.9  29952.0 28915.8 3584.0 3274.4      2    1.230   0      0.000    1.230
393216.0 393216.0 393214.4  0.0   2359296.0 1583793.1 11534336.0 1566890.9  29952.0 28915.8 3584.0 3274.4      2    1.230   0      0.000    1.230
393216.0 393216.0 393214.4  0.0   2359296.0 2265422.4 11534336.0 1566890.9  29952.0 28915.8 3584.0 3274.4      2    1.230   0      0.000    1.230
393216.0 393216.0 393214.4  0.0   2359296.0 2316541.0 11534336.0 1566890.9  29952.0 28915.8 3584.0 3274.4      2    1.230   0      0.000    1.230
393216.0 393216.0  0.0   393207.3 2359296.0 109861.7 11534336.0 3054344.8  30208.0 29396.1 3584.0 3311.7      3    1.608   0      0.000    1.608
393216.0 393216.0  0.0   393207.3 2359296.0 267287.9 11534336.0 3054344.8  30208.0 29396.1 3584.0 3311.7      3    1.608   0      0.000    1.608
393216.0 393216.0  0.0   393207.3 2359296.0 288782.6 11534336.0 3054344.8  30208.0 29396.1 3584.0 3311.7      3    1.608   0      0.000    1.608
393216.0 393216.0  0.0   393207.3 2359296.0 316776.0 11534336.0 3054344.8  30208.0 29396.1 3584.0 3311.7      3    1.608   0      0.000    1.608
  1. the "ps" output of the server process, suddenly could not be launched anymore from the parent shell:
$ while sleep 10;do ps -o pid,rss,vsz,args -p 7837;done
...
  PID   RSS    VSZ COMMAND
 7837 52424040 366744052 java -server -Xms14G -Xmn3G -Xmx24G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12
  PID   RSS    VSZ COMMAND
 7837 52581392 442541292 java -server -Xms14G -Xmn3G -Xmx24G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12
-bash: fork: retry: No child processes
  PID   RSS    VSZ COMMAND
 7837 52638196 477968672 java -server -Xms14G -Xmn3G -Xmx24G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
  PID   RSS    VSZ COMMAND
 7837 54142540 477968672 java -server -Xms14G -Xmn3G -Xmx24G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12
-bash: fork: retry: No child processes
...

These are the current UNIX limits for the user "orientdb" which runs the server process:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514442
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 10000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 514442
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@rdelangh
Copy link
Author

I don't know if the number of open files would be an issue, like I had before (that's why the limit is raised from default 1024 to now 10000) : in case that would be the reason, the FYI the number of all files in the "cdrarch" dbase directory is:

find databases/cdrarch/ |wc -l
36216

(if the server would have an issue opening ALL these files, then this would have been an issue also with running the previous version 2.2.18 till now; but that was not a problem).

Anyhow, I will try to revert back to running the server with version 2.2.18 ...

@tglman
Copy link
Member

tglman commented Jul 13, 2017

hi @rdelangh,

From the error java.lang.OutOfMemoryError: unable to create new native thread seems that you run out of memory on your host machine(notice that is not an heap error), so I would suggest to double check the heap+diskcache if they fit in the machine memory, I see your heap is pretty big (24GB) I guess reduce it size and the diskcache size may help.

Regards

@rdelangh
Copy link
Author

rdelangh commented Jul 13, 2017

hi all,

as presented in the "vmstat" outputs, there is

  • still plenty of "free" RAM on the physical host
  • no swap capacity used

To make sure nothing in Ubuntu's kernel would be unreleased (maybe an Ubuntu bug), I rebooted the physical server and retried, but to no avail: the first lines of the Java server's exception trace shows Lucene... Could it be that there are problems with the Lucene's getting too big ?
Any possibility to trace what the server is doing while it is opening that database?

"console.sh" output (as before, unchanged):

$ /opt/orientdb/orientdb-community-2.2.23-SNAPSHOT/bin/console.sh
OrientDB console v.2.2.23-SNAPSHOT (build 93db75f83d3fab88741c04bee27592e6b04aaa06) https://www.orientdb.com
Type 'help' to display all the supported commands.
Installing extensions for GREMLIN language v.2.6.0
orientdb> CONNECT remote:localhost/cdrarch admin $ourpwd
Connecting to database [remote:localhost/cdrarch] with user 'admin'...
Error: com.orientechnologies.orient.enterprise.channel.binary.OResponseProcessingException: Exception during response processing
Error: java.lang.OutOfMemoryError: unable to create new native thread
orientdb>

server logs:

2017-07-13 21:07:21:743 WARNI Authenticated clients can execute any kind of code into the server by using the following allowed languages: [sql] [OServerSideScriptInterpreter]
2017-07-13 21:07:21:744 INFO  OrientDB Studio available at http://10.49.69.189:2480/studio/index.html [OServer]
2017-07-13 21:07:21:744 INFO  OrientDB Server is active v2.2.23-SNAPSHOT (build 93db75f83d3fab88741c04bee27592e6b04aaa06). [OServer]$ANSI{green {db=cdrarch}} Exception `7304356C` in storage `cdrarch`
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.open(OLuceneIndexEngineAbstract.java:231)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.reOpen(OLuceneIndexEngineAbstract.java:209)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.init(OLuceneIndexEngineAbstract.java:174)
        at com.orientechnologies.lucene.engine.OLuceneFullTextIndexEngine.init(OLuceneFullTextIndexEngine.java:67)
        at com.orientechnologies.lucene.index.OLuceneIndexNotUnique$3.callEngine(OLuceneIndexNotUnique.java:205)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doCallIndexEngine(OAbstractPaginatedStorage.java:2197)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.callIndexEngine(OAbstractPaginatedStorage.java:2168)
        at com.orientechnologies.lucene.index.OLuceneIndexNotUnique.onIndexEngineChange(OLuceneIndexNotUnique.java:201)
        at com.orientechnologies.orient.core.index.OIndexAbstract.loadFromConfiguration(OIndexAbstract.java:336)
        at com.orientechnologies.orient.core.index.OIndexManagerShared.fromStream(OIndexManagerShared.java:401)
        at com.orientechnologies.orient.core.type.ODocumentWrapperNoClass.reload(ODocumentWrapperNoClass.java:74)
        at com.orientechnologies.orient.core.index.OIndexManagerAbstract.load(OIndexManagerAbstract.java:85)
        at com.orientechnologies.orient.core.index.OIndexManagerAbstract.load(OIndexManagerAbstract.java:57)
        at com.orientechnologies.orient.core.metadata.OMetadataDefault$2.call(OMetadataDefault.java:191)
        at com.orientechnologies.orient.core.metadata.OMetadataDefault$2.call(OMetadataDefault.java:181)
        at com.orientechnologies.common.concur.resource.OSharedContainerImpl.getResource(OSharedContainerImpl.java:62)
        at com.orientechnologies.orient.core.storage.OStorageAbstract.getResource(OStorageAbstract.java:143)
        at com.orientechnologies.orient.core.metadata.OMetadataDefault.init(OMetadataDefault.java:180)
        at com.orientechnologies.orient.core.metadata.OMetadataDefault.load(OMetadataDefault.java:89)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.initAtFirstOpen(ODatabaseDocumentTx.java:3179)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.open(ODatabaseDocumentTx.java:270)
        at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:962)
        at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:932)
        at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:923)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.openDatabase(ONetworkProtocolBinary.java:892)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.handshakeRequest(ONetworkProtocolBinary.java:234)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.execute(ONetworkProtocolBinary.java:199)
        at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77)

@tglman
Copy link
Member

tglman commented Jul 13, 2017

hi @rdelangh,

From the error really seems that too many threads are created, I feel something is getting stack while boot, could you try to get some thread dump while booting orient, so we can quickly see if there are too many threads starting and what they are.

Kind Regards

@rdelangh
Copy link
Author

please find in attach the thread dump, after the server ran out of memory with the client trying the open the dbase.
jstack.out.gz

@rdelangh
Copy link
Author

From the command "ps -eaFm ", I seem to have 490 threads running in the kernel (all processes together).
I had to stop some of the shell sessions to clear threads from the kernel, otherwise the "shutdown.sh" script could not contact the OrientDB server for shutting down. But now it can, and the server is down. The number of threads running now is reduced to 40.

I startup the server again, without client trying to open a database (yet), the threads of this server process with PID 1157 are like:

$ jstack 1157
2017-07-14 00:24:46
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode):

"Attach Listener" #30 daemon prio=9 os_prio=0 tid=0x00007f8dac001000 nid=0x555 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"OrientDB ONetworkProtocolHttpDb listen at 0.0.0.0:2480-2490" #28 prio=5 os_prio=0 tid=0x00007f8d64a0a800 nid=0x4c4 runnable [0x00007f8e49d05000]
   java.lang.Thread.State: RUNNABLE
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
        at java.net.ServerSocket.implAccept(ServerSocket.java:545)
        at java.net.ServerSocket.accept(ServerSocket.java:513)
        at com.orientechnologies.orient.server.network.OServerNetworkListener.run(OServerNetworkListener.java:189)

"OrientDB ONetworkProtocolBinary listen at 0.0.0.0:2424-2430" #26 prio=5 os_prio=0 tid=0x00007f8d64a0a000 nid=0x4c3 runnable [0x00007f8e49e06000]
   java.lang.Thread.State: RUNNABLE
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
        at java.net.ServerSocket.implAccept(ServerSocket.java:545)
        at java.net.ServerSocket.accept(ServerSocket.java:513)
        at com.orientechnologies.orient.server.network.OServerNetworkListener.run(OServerNetworkListener.java:189)

"Timer-0" #23 daemon prio=5 os_prio=0 tid=0x00007f8d64696000 nid=0x4c2 in Object.wait() [0x00007f8e4ad92000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000007492fb1b8> (a java.util.TaskQueue)
        at java.util.TimerThread.mainLoop(Timer.java:552)
        - locked <0x00000007492fb1b8> (a java.util.TaskQueue)
        at java.util.TimerThread.run(Timer.java:505)

"Thread-0" #21 prio=5 os_prio=0 tid=0x00007f8eb85e3000 nid=0x4c1 waiting on condition [0x00007f8e4b5f4000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000074a540900> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at com.orientechnologies.orient.server.OServer.waitForShutdown(OServer.java:528)
        at com.orientechnologies.orient.server.OServerMain$1.run(OServerMain.java:47)

"Service Thread" #20 daemon prio=9 os_prio=0 tid=0x00007f8eb8558000 nid=0x4bf runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread14" #19 daemon prio=9 os_prio=0 tid=0x00007f8eb8555000 nid=0x4be waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread13" #18 daemon prio=9 os_prio=0 tid=0x00007f8eb8553800 nid=0x4bd waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
...

Now I start the client "console.sh" and attempt to open the database, after 10 seconds I run the "jstack" command on the server process again; the output of that one is in attach: "jstack.out2.gz"
jstack.out2.gz

Then after 20 more seconds, a few sec before the OOM error occurs again, I managed to take another stack dump; see attach "jstack.out3.gz"
jstack.out3.gz

@rdelangh
Copy link
Author

gentlemen, any more feedback on this issue please ?
Our database can not startup anymore: not with the newer version 2.2.23, not with the version 2.2.18 that it was running on since 2 months...
Since the software (2.2.18) has not changed meanwhile, it must be a change inside my database.
I see in these thread dumps a lot of "Lucene" strings mentioned, can it be related to the Lucene indexes?

many thanks for any further investigations!

@robfrank
Copy link
Contributor

You have too much lucene indexes. They exhaust the resources.
For now, I don't have a solution to provide to you.
As a work-around, try to start the server without lucene support, just remove orientdb-lucene, and drop as many indexes as you can. I don't know if the server will start without the jar.
Then put orientdb-lucene.
I didn't try this solution, I don't know it can work..

@rdelangh
Copy link
Author

hi @robfrank ,
many thanks for your feedback.

  1. I renamed the file "./lib/orientdb-lucene-2.2.23-SNAPSHOT.jar" to "./lib/orientdb-lucene-2.2.23-SNAPSHOT.jar.orig", then started the 2.2.23 server
  2. then I started "console.sh" and attempted to open the dbase, but it failed immediately:
$ /opt/orientdb/orientdb-community-2.2.23-SNAPSHOT/bin/console.sh

OrientDB console v.2.2.23-SNAPSHOT (build 93db75f83d3fab88741c04bee27592e6b04aaa06) https://www.orientdb.com
Type 'help' to display all the supported commands.
Installing extensions for GREMLIN language v.2.6.0

orientdb> CONNECT remote:localhost/cdrarch admin $ourpwd

Connecting to database [remote:localhost/cdrarch] with user 'admin'...
Error: com.orientechnologies.orient.core.exception.OStorageException: Cannot open local storage '/opt/orientdb/databases/cdrarch' with mode=rw
        DB name="cdrarch"
        DB name="cdrarch"

Error: com.orientechnologies.orient.core.index.OIndexException: Index type FULLTEXT is not supported by the engine 'LUCENE'. Types are [NOTUNIQUE_HASH_INDEX,FULLTEXT,UNIQUE_HASH_INDEX,UNIQUE,DICTIONARY,DICTIONARY_HASH_INDEX,FULLTEXT_HASH_INDEX,NOTUNIQUE]
        DB name="cdrarch"

So I guess this is a catch-22 situation:

  • the database has too many Lucene indexes to be able to start
  • in order to remove indexes, the server should be started with Lucene support

Which limit in the resources are we hitting, with the number of Lucene indexes in the database?
If we know the limit, then maybe we can raise anything in the operating system to cope with that limit?

@santo-it
Copy link

Hi @rdelangh,

Thanks for using OrientDB and for opening this issue.

I see quite some risk in implementing such a big solution (with a 2.8Tb database) without considering a Support Contract with us. We offer different Support and Consultancy options. I believe you may benefit from having direct access to our Support Team (that is backed by the OrientDB Engineering Team) - or discuss a data model review with us, etc. Have you ever considered this opportunity?

http://orientdb.com/support/
http://orientdb.com/consulting/

Your use case seems quite interesting, and I believe it could be a win-win if you open a discussion with our Sales and Support Teams.

That said, our Engineering has worked during last weekend to refactor, in OrientDB v.3.0, the lifecycle of lucene indexes to better handle situations like yours (#7555). The changes have not been pushed yet as they are still under testing.

I believe backporting this fix to v. 2.2 can be considered, but this is something I feel that it's better to discuss with our Sales first.

In the hope this helps,

Santo Leto,
Director, Global Customer Support, OrientDB

@santo-it santo-it self-assigned this Jul 19, 2017
@santo-it santo-it added this to the 2.2.24 milestone Jul 19, 2017
@santo-it
Copy link

Changes introduced in 3.0.0-snaphot (Issue #7555) has now been backported to version 2.2.24-snapshot. We are still making some tests and working to finalize release 2.2.24.

Thanks,

@rdelangh
Copy link
Author

With latest changes in the Lucene-jar, and a few rebuilds of broken unique (non-Lucene) and non-unique (Lucene) indexes, the v2.2.23 server is operational again and can accept new data-loading and query requests.
The server remains well in tow now, not consuming awful lots of RAM anymore, and implements 'lazy' indexes now.
Thx a lot to @robfrank , he is a true guru !!

@santo-it
Copy link

That's great @rdelangh,

I am closing this issue now. We are releasing official 2.2.24 version.

Again, please do consider an enterprise support subscription, to have direct access to our support and engineering Teams, reduce the risk involved with your project, and support our open source Project, and the Company behind it.

Thanks,

Santo Leto,
Director, Global Customer Support, OrientDB

@lvca
Copy link
Member

lvca commented Jul 20, 2017

Cool this has been fixed. Thanks @rdelangh for your contribution.

@rdelangh
Copy link
Author

I still encounter problems with file-locks on the Lucene indexes:

com.orientechnologies.orient.core.exception.OStorageException: Cannot put key value entry in index
        DB name="cdrarch"
--> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
--> org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cd
rarch/luceneIndexes/idx_cdr_af_20170709_7/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-19
T19:35:31Z)

The method to resolve this temporarily, is by

  • shutdown the server
  • restart the server

Typically the problem happens again when data has been loading for a while (by 3 parallel loading programs), then some minutes without loading, then the loading is started again but the problem with the file-lock aborts it immediately.

@rdelangh
Copy link
Author

rdelangh commented Jul 26, 2017

again same error occured:

'com.orientechnologies.orient.core.exception.OStorageException: Cannot put key value entry in index
        DB name="cdrarch"
--> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
--> org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes
/idx_cdr_af_20170709_3/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-02T00:37:28Z)

-> need to shutdown this server, then restart it
During server shutdown, it generates following messages in its log:

...
2017-07-26 23:36:05:355 INFO  - shutdown storage: cdrarch... [Orient]Error on closing Lucene index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_4/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-02T00:37:32Z)
        at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:166)
        at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37)
        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:746)
        at org.apache.lucene.index.IndexFileDeleter.deletePendingFiles(IndexFileDeleter.java:530)
        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:638)
        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4408)
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2874)
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2977)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2944)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.commitAndCloseWriter(OLuceneIndexEngineAbstract.java:299)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.close(OLuceneIndexEngineAbstract.java:529)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doClose(OAbstractPaginatedStorage.java:4192)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.close(OAbstractPaginatedStorage.java:548)
        at com.orientechnologies.orient.core.storage.OStorageAbstract.shutdown(OStorageAbstract.java:222)
        at com.orientechnologies.orient.core.Orient.shutdownAllStorages(Orient.java:397)
        at com.orientechnologies.orient.core.Orient.access$500(Orient.java:56)
        at com.orientechnologies.orient.core.Orient$OShutdownEnginesHandler.shutdown(Orient.java:933)
        at com.orientechnologies.orient.core.Orient.shutdown(Orient.java:330)
        at com.orientechnologies.orient.server.OServer.deinit(OServer.java:514)
        at com.orientechnologies.orient.server.OServer.shutdown(OServer.java:440)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary$5.run(ONetworkProtocolBinary.java:2237)

2017-07-26 23:36:07:404 INFO  - shutdown storage: OSystem... [Orient]
2017-07-26 23:36:12:885 INFO  OrientDB Engine shutdown complete [Orient]
2017-07-26 23:36:12:886 INFO  OrientDB Server shutdown complete

@lvca lvca reopened this Jul 26, 2017
@robfrank
Copy link
Contributor

@rdelangh which kind of file system are you using to store the index? Are you using NFS or a SAN? Because I tested the load scenario you depicted and I wasn't able to reproduce this kind of error.

@rdelangh
Copy link
Author

@robfrank the fastest filesystem you can get: ZFS, with about 40GB cached in RAM, so lightspeed fast.

No, the type of filesystem will not be the reason, why otherwise can you explain that the "SELECT count(*) ..." from the index returns the results in 24 seconds, whereas a "SELECT set(rid) AS x ..." from the same index returns the results only after 1830 seconds?

@rdelangh
Copy link
Author

rdelangh commented Jul 27, 2017

just installed version 2.2.25, and rebuilt the index "idx_cdr_af_20170709_5" which was mentioned in the errors, but still get the same errors in the server output, while programs try to load records into this dbase:

org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-19T11:07:39Z)
        at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:166)
        at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37)
        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:746)
        at org.apache.lucene.index.IndexFileDeleter.deletePendingFiles(IndexFileDeleter.java:530)
        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:638)
        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4408)
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2874)
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2977)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2944)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.commitAndCloseWriter(OLuceneIndexEngineAbstract.java:302)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.close(OLuceneIndexEngineAbstract.java:531)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract$2.run(OLuceneIndexEngineAbstract.java:208)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
Error on flushing Lucene index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-19T11:07:39Z)
        at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:166)
        at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37)
        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:746)
        at org.apache.lucene.index.IndexFileDeleter.deletePendingFiles(IndexFileDeleter.java:530)
        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:638)
        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4408)
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2874)
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2977)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2944)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.flush(OLuceneIndexEngineAbstract.java:315)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract$2.run(OLuceneIndexEngineAbstract.java:213)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
Error on closing Lucene index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-19T11:07:39Z)
        at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:166)
        at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37)
        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:746)
        at org.apache.lucene.index.IndexFileDeleter.deletePendingFiles(IndexFileDeleter.java:530)
        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:638)
        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4408)
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2874)
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2977)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2944)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.commitAndCloseWriter(OLuceneIndexEngineAbstract.java:302)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.close(OLuceneIndexEngineAbstract.java:531)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract$2.run(OLuceneIndexEngineAbstract.java:208)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
Error on flushing Lucene index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-19T11:07:39Z)
        at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:166)
        at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37)
        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:746)
        at org.apache.lucene.index.IndexFileDeleter.deletePendingFiles(IndexFileDeleter.java:530)
        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:638)
        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4408)
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2874)
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2977)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2944)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.flush(OLuceneIndexEngineAbstract.java:315)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract$2.run(OLuceneIndexEngineAbstract.java:213)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
Error on closing Lucene index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-19T11:07:39Z)
        at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:166)
        at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37)
        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:746)
        at org.apache.lucene.index.IndexFileDeleter.deletePendingFiles(IndexFileDeleter.java:530)
        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:638)
        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4408)
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2874)
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2977)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2944)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.commitAndCloseWriter(OLuceneIndexEngineAbstract.java:302)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.close(OLuceneIndexEngineAbstract.java:531)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract$2.run(OLuceneIndexEngineAbstract.java:208)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
Error on flushing Lucene index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-19T11:07:39Z)
        at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:166)
        at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37)
        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:746)
        at org.apache.lucene.index.IndexFileDeleter.deletePendingFiles(IndexFileDeleter.java:530)
        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:638)
        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4408)
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2874)
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2977)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2944)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.flush(OLuceneIndexEngineAbstract.java:315)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract$2.run(OLuceneIndexEngineAbstract.java:213)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
Error on closing Lucene index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-19T11:07:39Z)
        at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:166)
        at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37)
        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:746)
        at org.apache.lucene.index.IndexFileDeleter.deletePendingFiles(IndexFileDeleter.java:530)
        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:638)
        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4408)
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2874)
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2977)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2944)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.commitAndCloseWriter(OLuceneIndexEngineAbstract.java:302)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.close(OLuceneIndexEngineAbstract.java:531)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract$2.run(OLuceneIndexEngineAbstract.java:208)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
Error on flushing Lucene index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-19T11:07:39Z)
        at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:166)
        at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37)
        at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:746)
        at org.apache.lucene.index.IndexFileDeleter.deletePendingFiles(IndexFileDeleter.java:530)
        at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:638)
        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4408)
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2874)
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2977)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2944)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.flush(OLuceneIndexEngineAbstract.java:315)
        at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract$2.run(OLuceneIndexEngineAbstract.java:213)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
...

the errors that the loading (client) programs get, now mentions still another index "idx_cdr_af_20170709_1":

com.orientechnologies.orient.core.exception.OStorageException: Cannot put key value entry in index
        DB name="cdrarch"
--> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
--> org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-19T11:07:39Z)

@rdelangh
Copy link
Author

It is getting worse:

  • I stop the server
  • replace it with version 2.2.25
  • startup the new server
  • open the database with a separate connection from the client "console.sh" (takes 7 mins)
  • launch a sequential program that loads files into the dbase: this runs fine for some 130 files, but then fails again with the familiar error:
com.orientechnologies.orient.core.exception.OStorageException: Cannot put key value entry in index
        DB name="cdrarch"
--> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
--> org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-27T11:28:18Z)

@rdelangh
Copy link
Author

I ran some more tests:
after having restarted (again) the 2.2.25 server, I launched a loop with each second the command "lsof -p $serverPID | grep luceneIndexes" in order to capture the evolution of which files from the "luceneIndexes" subdir are opened, and the lock status

This is what I see happening:

  • right after startup: 1129 open filedescriptors, none from "luceneIndexes"
  • after loading a first data-file:
lsof -p $PID | grep luceneIndexes
java    7273 orientdb  mem    REG               0,39    23373562  170329 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2hc.cfs
java    7273 orientdb  mem    REG               0,39    25273410  142570 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_25y.cfs
java    7273 orientdb  mem    REG               0,39    36202149  152274 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2cm.cfs
java    7273 orientdb  mem    REG               0,39   150883152  174511 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_20o.fdt
java    7273 orientdb  mem    REG               0,39    34477099  174931 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_20o.nvd
java    7273 orientdb  mem    REG               0,39   161027962  174718 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_20o_Lucene50_                  0.tim
java    7273 orientdb  mem    REG               0,39    62829559  174716 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_20o_Lucene50_                  0.doc
java    7273 orientdb  mem    REG               0,39    81735219  172505 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_1at.fdt
java    7273 orientdb  mem    REG               0,39    74520561  172586 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_1at_Lucene50_                  0.tim
java    7273 orientdb  mem    REG               0,39    24494565  172584 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_1at_Lucene50_                  0.doc
java    7273 orientdb  mem    REG               0,39   157414075   76965 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_3h.fdt
java    7273 orientdb  mem    REG               0,39    42767233   77456 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_3h.nvd
java    7273 orientdb  mem    REG               0,39   197322248   77053 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_3h_Lucene50_0                  .tim
java    7273 orientdb  mem    REG               0,39   106788951   77051 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_3h_Lucene50_0                  .doc
java    7273 orientdb  mem    REG               0,39     6295470  171784 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_2l.fdt
java    7273 orientdb  mem    REG               0,39     6745968  171800 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_2l_Lucene50_0                  .tim
java    7273 orientdb  mem    REG               0,39     7577496  171280 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_1r.fdt
java    7273 orientdb  mem    REG               0,39     7653125  171283 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_1r_Lucene50_0                  .tim
java    7273 orientdb  mem    REG               0,39     6962813  170699 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_v.fdt
java    7273 orientdb  mem    REG               0,39     7241739  170702 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_v_Lucene50_0.                  tim
java    7273 orientdb  mem    REG               0,39    12352931  169592 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2pz.cfs
java    7273 orientdb  mem    REG               0,39    15983352  170799 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2m2.cfs
java    7273 orientdb  mem    REG               0,39     1321696  171798 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_2l_Lucene50_0                  .doc
java    7273 orientdb  mem    REG               0,39      646727  171282 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_1r_Lucene50_0                  .pos
java    7273 orientdb  mem    REG               0,39     1639763  171281 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_1r_Lucene50_0                  .doc
java    7273 orientdb  mem    REG               0,39     1480494  170700 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_v_Lucene50_0.                  doc
java    7273 orientdb  mem    REG               0,39     4077199  168649 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2ls.cfs
java    7273 orientdb  mem    REG               0,39     6123198  173457 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_20e.cfs
java    7273 orientdb  mem    REG               0,39      595977  170701 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_v_Lucene50_0.                  pos
java    7273 orientdb  mem    REG               0,39      999005  174360 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2s7.cfs
java    7273 orientdb  mem    REG               0,39    13610790  150306 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_1z0.cfs
java    7273 orientdb  mem    REG               0,39      543609  171799 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_2l_Lucene50_0                  .pos
java    7273 orientdb  mem    REG               0,39     3250311  168415 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2pp.cfs
java    7273 orientdb  mem    REG               0,39     5863665  149261 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_1yr.cfs
java    7273 orientdb  mem    REG               0,39     7076068  171400 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_1aj.cfs
java    7273 orientdb  mem    REG               0,39      299881  174457 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_3f.cfs
java    7273 orientdb  mem    REG               0,39    16424415  172828 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_1at.nvd
java    7273 orientdb  mem    REG               0,39     3363741  178736 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2li.cfs
java    7273 orientdb  mem    REG               0,39     6024767  174717 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_20o_Lucene50_                  0.pos
java    7273 orientdb  mem    REG               0,39     7133397   77052 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_3h_Lucene50_0                  .pos
java    7273 orientdb  mem    REG               0,39          59  171802 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_2l.nvd
java    7273 orientdb  mem    REG               0,39          59  171285 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_1r.nvd
java    7273 orientdb  mem    REG               0,39          59  170704 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_v.nvd
java    7273 orientdb  mem    REG               0,39        2381  175034 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2se.cfs
java    7273 orientdb  mem    REG               0,39        6571  175015 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2sd.cfs
java    7273 orientdb  mem    REG               0,39       42132  174882 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2sc.cfs
java    7273 orientdb  mem    REG               0,39       42017  174669 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2sa.cfs
java    7273 orientdb  mem    REG               0,39       42008  174575 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2s9.cfs
java    7273 orientdb  mem    REG               0,39       50780  174480 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2s8.cfs
java    7273 orientdb  mem    REG               0,39      946937  173394 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2rx.cfs
java    7273 orientdb  mem    REG               0,39      761104  172406 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2rn.cfs
java    7273 orientdb  mem    REG               0,39     1733967  172585 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_1at_Lucene50_                  0.pos
java    7273 orientdb  mem    REG               0,39       38183  174762 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2sb.cfs
java    7273 orientdb  217wW  REG               0,39           0   75027 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock
java    7273 orientdb 1131w   REG               0,39           0  175053 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2sf.fdx
java    7273 orientdb 1132w   REG               0,39           0  175054 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2sf.fdt
java    7273 orientdb 1133wW  REG               0,39           0  170196 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/write.lock
java    7273 orientdb 1134w   REG               0,39           0  175057 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_3g.fdx
java    7273 orientdb 1135w   REG               0,39           0  175058 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/_3g.fdt
  • some 2 minutes later: same command now gives no single file open from directory "luceneIndexes"

  • loading another file, then during some 20 secs this is the output:

$ lsof -p $PID | grep luceneIndexes
java    7273 orientdb  mem    REG               0,39    23373562  170329 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2hc.cfs
java    7273 orientdb  mem    REG               0,39    25273410  142570 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_25y.cfs
...
java    7273 orientdb  mem    REG               0,39          59  190708 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_2/_et.nvd
java    7273 orientdb  mem    REG               0,39          59  189524 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_2/_r3.nvd
java    7273 orientdb  924wW  REG               0,39           0   75027 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock
java    7273 orientdb 1133wW  REG               0,39           0  174724 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_2/write.lock
java    7273 orientdb 1136wW  REG               0,39           0  174728 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_3/write.lock
java    7273 orientdb 1139wW  REG               0,39           0  174734 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_4/write.lock
java    7273 orientdb 1142wW  REG               0,39           0  174742 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_6/write.lock
  • then the next 2 second some locks are gone, but 2 remain in place:
$ lsof -p $PID | grep luceneIndexes
java    7273 orientdb  mem    REG               0,39    23373562  170329 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_2hc.cfs
java    7273 orientdb  mem    REG               0,39    25273410  142570 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/_25y.cfs
...
java    7273 orientdb  mem    REG               0,39          59  190708 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_2/_et.nvd
java    7273 orientdb  mem    REG               0,39          59  189524 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_2/_r3.nvd
java    7273 orientdb  mem       REG               0,39          59  155333 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_3/_25o.nvd
java    7273 orientdb  mem       REG               0,39          59  184670 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_3/_sq.nvd
java    7273 orientdb 1139wW     REG               0,39           0  174734 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_4/write.lock
java    7273 orientdb 1142wW     REG               0,39           0  174742 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_6/write.lock
  • then the next second, no more open files from the "luceneIndexes" directory

By running the "lsof" command in a loop per second, I could see the following phenomenom happening with the file locks:

  • initially, while the loading of records is ongoing with a single (not parallel) program, the open regular files are:
java    7273 orientdb  136wW  REG               0,39           0   75027 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock
java    7273 orientdb  149wW  REG               0,39           0  174724 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_2/write.lock
java    7273 orientdb  410w   REG               0,39           0  172920 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_2/_40x.fdx
java    7273 orientdb  577wW  REG               0,39           0  148481 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_7/write.lock
java    7273 orientdb  603w   REG               0,39        8191  172921 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_2/_40x.fdt
java    7273 orientdb  608w   REG               0,39           0  172922 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_3/_40t.fdx
java    7273 orientdb  615w   REG               0,39       16384  172923 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_3/_40t.fdt
java    7273 orientdb  616w   REG               0,39           0  172924 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_4/_1z6.fdx
java    7273 orientdb  620w   REG               0,39           0  172925 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_4/_1z6.fdt
java    7273 orientdb  640wW  REG               0,39           0  170196 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/write.lock
java    7273 orientdb  651w   REG               0,39           0  172926 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_6/_40q.fdx
java    7273 orientdb  861w   REG               0,39           0  172927 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_6/_40q.fdt
java    7273 orientdb  975wW  REG               0,39           0  174728 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_3/write.lock
java    7273 orientdb 1007wW  REG               0,39           0  174734 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_4/write.lock
java    7273 orientdb 1077wW  REG               0,39           0  174742 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_6/write.lock
  • then there seems to happen a little cleanup, because this is what remains in the output:
java    7273 orientdb  136wW  REG               0,39           0   75027 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock
java    7273 orientdb  149wW  REG               0,39           0  174724 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_2/write.lock
java    7273 orientdb  577wW  REG               0,39           0  148481 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_7/write.lock
java    7273 orientdb  640wW  REG               0,39           0  170196 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/write.lock
java    7273 orientdb  975wW  REG               0,39           0  174728 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_3/write.lock
java    7273 orientdb 1007wW  REG               0,39           0  174734 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_4/write.lock
java    7273 orientdb 1077wW  REG               0,39           0  174742 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_6/write.lock
  • then the next second, suddenly the file "/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/write.lock" is closed, as seen from the output of "lsof":
java    7273 orientdb  136wW  REG               0,39           0   75027 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_1/write.lock
java    7273 orientdb  149wW  REG               0,39           0  174724 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_2/write.lock
java    7273 orientdb  577wW  REG               0,39           0  148481 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_7/write.lock
java    7273 orientdb  975wW  REG               0,39           0  174728 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_3/write.lock
java    7273 orientdb 1007wW  REG               0,39           0  174734 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_4/write.lock
java    7273 orientdb 1077wW  REG               0,39           0  174742 /opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_6/write.lock
  • the exact same situation remains for some 12 seconds, then the file-loading program fails with the usual error:
com.orientechnologies.orient.core.exception.OStorageException: Cannot put key value entry in index
        DB name="cdrarch"
--> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
--> org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2017-07-27T11:28:18Z)

Notice that it concerns a lock on the single file which was not open anymore:
/opt/orientdb/databases/cdrarch/luceneIndexes/idx_cdr_af_20170709_5/write.lock

Are there maybe some (failing) race-conditions happening?

@rdelangh
Copy link
Author

gentlemen, any update or feedback on this, please? The database is not usable in this state.
-> is there other information that I can provide to you so that you can better debug/diagnose?

@robfrank
Copy link
Contributor

Hi, just finished a session of tests on 2.2.15-SNAPSHOT and 3.0.0-SNAPSHOT with this scenario:

  • two loaders that concurrently store tweets with 2 lucene indexes
  • loaders are started and stopped to allow the server to flush and close idle indexes
  • server is stopped and restarted as well sometimes

I think the scenario is similar to yours. Can you confirm?
I can't reproduce your problems.
About the lock file of lucene, it is essentially a place holder. For the sake of the explanation, I report here how it works, explained by a Lucene's developer:

The existence of the write.lock file has nothing to do with actual locking. The lock file is just a placeholder file (0 bytes) on which the lock during indexing is applied (the actual locking is done on this file using a fnctl). When indexing finishes, the lock is removed from this file, but the file stays alive. This was different in Lucene 3, where the lock was also removed. But this caused problems because of non-atomic file operations. Lock factory only checks the "lock on the file", not "the existence of the file as a lock".

(https://lists.gt.net/lucene/java-user/261725)

@rdelangh
Copy link
Author

hi @robfrank , many thanks for following-up on this!
Your scenario is not exactly the same, and I think that the differing part which might be relevant is the fact that I have 6 Lucene indexes, each on a different property. Some of these properties do not exist in the majority of the records, like the property on which this index "idx_cdr_af_20170709_5" is defined.
So, altough there is a continuous stream of records being inserted, it looks like this Lucene index "idx_cdr_af_20170709_5" gets not updated for a while, and therefor gets closed (like my traces of open files -see above- indicates).
But then after a while, there is some record which has a value for this particular property, so that Lucene index needs to be re-opened to add an entry in it for this new record. And I guess it is exactly at that point that the server software looses its tracks of which file is already open and locked, or not. For this latter assumption, of course you are the only person who can cross-check the behaviour of the code.

@rdelangh
Copy link
Author

is the ilde-time after which a Lucene index file gets closed, configurable?
If so, then I might raise that period substantially to avoid that the file gets closed too soon. If there are new records to be inserted in the dbase, the index files can remain open for several hours. When many hours have passed without new records for old indexes, then these index files can indeed be closed.

So, instead of the current idle-time of (I guess) a few 10's of seconds, I would like to raise it to a few hours (or 1 full day).

@rdelangh
Copy link
Author

hello @robfrank , is the idle-time after which a Lucene index file gets closed, configurable? If so, can you please tell me in which file, or via which command-line arguments, it can be adjusted?

@robfrank
Copy link
Contributor

Yes, it is documented:
http://orientdb.com/docs/last/Full-Text-Index.html#index-lifecycle

BTW, I worked on a test, trying to reproduce your use case, but indexes are open and closed without problems. I defined 2 indexes on a class, and I put records with both properties and records with only one.

@rdelangh
Copy link
Author

thanks again @robfrank
I read in this doc that the intervals are defined at the time of the creation of the index;
is it possible to adjust those metadata of an index, after the index has already been created ?
This way I can simply modify the metadata (small effort) and stop/restart the database, instead of having to drop then recreate each and every Lucene index (huge effort) with modified creation-metadata...

@robfrank
Copy link
Contributor

No, I'm afraid. At the moment these values are set at creation time. Our internals doesn't allow these settings to be dynamic at the moment.

@rdelangh
Copy link
Author

ok then, bad news.
This is something that you might put on your likely long list of enhancements, i.e. to add a lot of dynamic reconfiguration options of the database engine:
with the (big) size of databases in mind that you guys are writing this software for, operations like a drop and recreate of indexes only to reconfigure such timing parameters, is a very sad waste of (much!) time.

@robfrank
Copy link
Contributor

Yes, you're right. I will check with the rest of the team how and if we can enable this scenario.

@rdelangh
Copy link
Author

rdelangh commented Aug 2, 2017

Finally the dbase is alive again by dropping some 10's of Lucene indexes and recreating them with the metadata

"closeAfterInterval": 86400000

With that, the server starts up, and not too many indexes are opened (lazy behaviour), but they are also not closed too quickly and reopened very soon after that again. I assume that this too fast close/open sequence is the cause of some internal synchronisation problem.
-> to be (re)checked by you, please

@lvca lvca modified the milestones: 2.2.x (next hotfix), 2.2.24 Aug 2, 2017
robfrank added a commit that referenced this issue Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

9 participants