Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is 'old-heap' growing by memory-leak? #7228

Closed
rdelangh opened this issue Mar 13, 2017 · 56 comments
Closed

is 'old-heap' growing by memory-leak? #7228

rdelangh opened this issue Mar 13, 2017 · 56 comments
Assignees

Comments

@rdelangh
Copy link

rdelangh commented Mar 13, 2017

OrientDB Version: 2.2.18

Java Version: 1.8.0_92

OS: Ubuntu-16.04

Expected behavior

When we have 'not-so-heavy' data loading active on our standalone server, I notice that the Old-generation size is growing with each Full Garbage Collection that is initiated:

% jstat -gc -h10 17079 10s
...
14848.0 15872.0 14356.0  0.0   4162560.0 3986224.8 5716480.0  5633496.7  37592.0 36577.7 4096.0 3739.3   3666  182.964   5      5.399  188.363
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
14848.0 15360.0  0.0   8140.3 4163072.0 238585.4 5716480.0  5642829.5  37592.0 36577.7 4096.0 3739.3   3669  183.129   5      5.399  188.528
14848.0 41984.0  0.0    0.0   4110336.0 3888676.8 5973504.0  2838417.4  37592.0 36577.7 4096.0 3739.3   3670  183.316   6      7.345  190.661
37376.0 38400.0  0.0   10415.9 4111872.0 1463619.6 5973504.0  2845162.4  37592.0 36577.7 4096.0 3739.3   3673  183.452   6      7.345  190.797
...

Our server is started with heap size parameters "java -server -Xms4G -Xmx12G ..."

However when we start an additional, heavy data-loading program, that old-generation capacity very quickly becomes exhausted until the server dies/halts with Out-Of-Memory.

-> is this growing Old-generation Capacity ("OC" column) figure normal ?
-> a Full-GC is not reducing the Old-generation Usage ("OU" column), so the "OC" is further increasing at each FGC

@andrii0lomakin
Copy link
Member

@rdelangh , as usual, we need heap dump. the best, of course, to take it after each full GC. I do not remember is it possible for you to send heap dump for us, if not could you send us screenshots of heap dumps ?

@rdelangh
Copy link
Author

hi @Laa,

"to take it after each full GC" -> how can I control when a FGC is happening? Because you suggest that I would then initiate (manually) a heap dump.

And indeed, the heap dump will be around Xmx12G size, so 12GB, and impossible to send it to you.

Can you suggest which tool I shall use to open such a big heap dump, and command options that will allow to open the big (>= 12GB) file, and which screenshots you desire?

@rdelangh
Copy link
Author

Is this usefull?

jmap -heap <pid>
Attaching to process ID 17079, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.92-b14

using thread-local object allocation.
Parallel GC with 33 thread(s)

Heap Configuration:
   MinHeapFreeRatio         = 0
   MaxHeapFreeRatio         = 100
   MaxHeapSize              = 12884901888 (12288.0MB)
   NewSize                  = 1431306240 (1365.0MB)
   MaxNewSize               = 4294967296 (4096.0MB)
   OldSize                  = 2863661056 (2731.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 3441950720 (3282.5MB)
   used     = 2670891296 (2547.160430908203MB)
   free     = 771059424 (735.3395690917969MB)
   77.59818525234435% used
From Space:
   capacity = 5767168 (5.5MB)
   used     = 5406736 (5.1562652587890625MB)
   free     = 360432 (0.3437347412109375MB)
   93.7502774325284% used
To Space:
   capacity = 18350080 (17.5MB)
   used     = 0 (0.0MB)
   free     = 18350080 (17.5MB)
   0.0% used
PS Old Generation
   capacity = 6530531328 (6228.0MB)
   used     = 4365460296 (4163.227363586426MB)
   free     = 2165071032 (2064.772636413574MB)
   66.84693904281352% used

13855 interned Strings occupying 1276864 bytes.

@rdelangh
Copy link
Author

attached
jmap-histo.txt

is the output of "jmap -histo " , hope it helps

@cmassi
Copy link

cmassi commented Mar 13, 2017

To see how objects are growing, take a look at a sequence of ascii output from jmap -histo:live (this will force a fullgc and then collect histogram): https://docs.oracle.com/javase/8/docs/technotes/tools/unix/jmap.html .

@rdelangh
Copy link
Author

rdelangh commented Mar 14, 2017

@cmassi , @robfrank , @Laa
I had to increase the Xmx parameter and added the "-XX:+UseCompressedOops", and apparently got a stable 'high-water mark' for the old-generation heap capacity:

 java -server -Xms4G -Xmx12G -XX:+UseCompressedOops -Djna.nosys=true ...

from repetitive 'jstat' outputs, I saw a continuous increase of the OC at each fullgc, until finally an increase after fullgc 11 :

...

11776.0 28672.0 11769.2  0.0   1577472.0 201597.2 7113728.0  7099833.0  37592.0 36241.2 4096.0 3699.7   5104  499.054  10     22.870  521.924
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
24064.0 22016.0  0.0   21759.1 1553920.0 669679.9 7113728.0  7101665.7  37592.0 36241.9 4096.0 3699.7   5105  499.090  10     22.870  521.960
17408.0 28160.0  0.0    0.0   1530880.0 860371.9 6729216.0  3238519.1  37592.0 36197.3 4096.0 3693.1   5106  499.158  11     24.892  524.050
7168.0 28160.0 6982.6  0.0   1485312.0 49609.7  6729216.0  3241704.6  37592.0 36197.3 4096.0 3693.1   5108  499.230  11     24.892  524.121
...

same after fullgc 12:

...

16384.0 24064.0 16376.4  0.0   2226176.0 151432.1 6729216.0  6717509.2  37592.0 36205.2 4096.0 3693.1   6128  539.809  11     24.892  564.701
14336.0 22528.0 14112.0  0.0   2141184.0 1034595.4 6729216.0  6721386.1  37592.0 36205.2 4096.0 3693.1   6130  539.867  11     24.892  564.759
18432.0 24064.0 18430.5  0.0   2060800.0 1666781.4 6564352.0  3185646.1  37592.0 36205.2 4096.0 3693.1   6132  539.928  12     26.108  566.036
...

Will keep an eye on it...

@cmassi
Copy link

cmassi commented Mar 15, 2017

We see from latest jstat output that with Full GC number 12, the Old Usage has decreased, together with Old+Eden Capacity, but Eden was nearly full.
There is no leak, so far.
If you are interested in reducing the time spent in Full GC, which also does a resize of generations (26 secs in fullgc), you should try to understand which is the best size of each generation under heavy load (verify min/medium/max occupation and capacity), give some more space to each generation to be able to handle any peak (and so avoid OOM), and remove adaptive size policy.

e.g (for ParallelGC):
-Xms12g -Xmn3g -Xmx12g -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:CompressedClassSpaceSize=512m -XX:-UseAdaptiveSizePolicy .

To study gc behaviour you need a detailed log

e.g: -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCID -XX:+PrintGCDetails -Xloggc:$ORIENTDB_HOME/log/gc_%p_%t.log .

Optionally you can study promotion of objects in survivors too, with -XX:+PrintTenuringDistribution

@rdelangh
Copy link
Author

@cmassi
thank you for the feedback, only I do not know how I can set the space/capacity for each generation; the only known parameters that I can find in ODB documentation are about the Xms and Xmx parameters.

Now, with Xms=4g and Xmx=12g, I hit again OOM under heavy load, the ODB engine has become inaccessible:

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
740864.0 1397760.0 149862.5  0.0   1398784.0 1398784.0 8388608.0  8388201.0  39808.0 38609.6 4224.0 3864.6  38452 1638.937  76    623.980 2262.917
740864.0 1397760.0 149862.5  0.0   1398784.0 1398784.0 8388608.0  8388201.0  39808.0 38609.6 4224.0 3864.6  38452 1638.937  76    623.980 2262.917
740864.0 1397760.0 149862.5  0.0   1398784.0 1398784.0 8388608.0  8388201.0  39808.0 38609.6 4224.0 3864.6  38452 1638.937  77    666.147 2305.084
740864.0 1397760.0 149862.5  0.0   1398784.0 1398784.0 8388608.0  8388201.0  39808.0 38609.6 4224.0 3864.6  38452 1638.937  77    666.147 2305.084
740864.0 1397760.0 149862.5  0.0   1398784.0 1398784.0 8388608.0  8388201.0  39808.0 38609.6 4224.0 3864.6  38452 1638.937  77    666.147 2305.084
740864.0 1397760.0 149862.5  0.0   1398784.0 1398784.0 8388608.0  8388201.0  39808.0 38609.6 4224.0 3864.6  38452 1638.937  77    666.147 2305.084
740864.0 1397760.0 149862.5  0.0   1398784.0 1398784.0 8388608.0  8388201.0  39808.0 38609.6 4224.0 3864.6  38452 1638.937  78    708.223 2347.160
...

Shall I set those parameters that you mentioned? I.e.

-Xms12g -Xmn3g -Xmx12g -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:CompressedClassSpaceSize=512m -XX:-UseAdaptiveSizePolicy

@cmassi
Copy link

cmassi commented Mar 23, 2017

The documentation of XX flags for Hotspot jvm is available on Oracle web site, not in OrientDB documentation.

The OOM are not always the same, they can be thrown because it fills one of the spaces (old/meta/class) or even due to native allocation issue.

We cannot see the timestamps in jstat output but it seems that it takes 9 secs for each FullGC (708/78), and probably most of the time has been spent to increase the size from initial 4g to current 11gb (S0Cx2 + EC + OC = 11269120 ), but it seems it cannot grow after 3 FGC (please note that S1C seems to be wrong, because the 2 survivors are always identical in capacity, because objects are copied from 1 to 2 and from 2 to 1) .

Meta/Class spaces are quite small, so can use small sizes, e.g: -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12m

According to this jstat it seems OOM was due to Old, nearly full, which cannot accept promotion of new objects from survivor+eden

You can try to further increase old , raising max to 14gb

The New generation is setup with XX:NewSize XX:MaxNewSize or simply with Xmn . It is made of 2 survivors and eden, so nearly 3gb

The OrientDB java process uses also direct memory and does some other native allocations, so check process size to avoid swapping of java process (should be: process size < available ram)



Try -Xms14g -Xmn3g -Xmx14g -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12m -XX:-UseAdaptiveSizePolicy



Pay attention to OOM generated for wrong generation sizes when removing adaptive size policy ( -XX:-UseAdaptiveSizePolicy ) so remove -XX:+HeapDumpOnOutOfMemoryError until you have setup the correct sizes for each space (generation)



Anyway, to really study the gc behaviour you have to understand how objects move (promotion, cleanup ) reading the detailed log

@rdelangh
Copy link
Author

rdelangh commented Mar 26, 2017

Did that, the server is running quite fine with reasonable (but not at all maximum required) load. Going to stress the server a bit more now.

@rdelangh
Copy link
Author

The server is running with the following command-line args:

UID        PID  PPID  C STIME TTY          TIME CMD
orientdb  9879     1 20 Mar27 ?        03:37:12 java -server -Xms14G -Xmn3G -Xmx18G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12m -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops -Djna.nosys=true -XX:MaxDirectMemorySize=512g -Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9 -Dstorage.openFiles.limit=1024 -Denvironment.dumpCfgAtStartup=true -Dquery.parallelAuto=true -Dquery.parallelMinimumRecords=10000 -Dquery.parallelResultQueueSize=200000 -Dstorage.wal.maxSize=51200 -Dstorage.diskCache.bufferSize=54000 -Dmemory.chunk.size=33554432 -Djava.util.logging.config.file=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientdb-server-log.properties -Dorientdb.config.file=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientdb-server-config.xml -Dorientdb.www.path=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/www -Dorientdb.build.number=UNKNOWN@rf31f2e10de758cbdef4cee27451b4065b94d9ce2; 2017-03-04 00:50:53+0000 -cp /opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/orientdb-server-2.2.18-SNAPSHOT.jar:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/*:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/plugins/* com.orientechnologies.orient.server.OServerMain

"jstat" outputs are:

...
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
393216.0 393216.0 46000.3  0.0   2359296.0 1036639.3 11534336.0 5757716.7  38144.0 36970.7 4096.0 3790.2   1976   74.375   0      0.000   74.375
393216.0 393216.0 36832.4  0.0   2359296.0 1828153.9 11534336.0 5777078.5  38144.0 36970.7 4096.0 3790.2   1986   74.730   0      0.000   74.730
393216.0 393216.0  0.0   27095.9 2359296.0 386313.8 11534336.0 5798588.9  38144.0 36971.9 4096.0 3790.2   1997   75.111   0      0.000   75.111
393216.0 393216.0  0.0   36556.8 2359296.0 1516297.3 11534336.0 5810722.1  38144.0 36971.9 4096.0 3790.2   2007   75.480   0      0.000   75.480
393216.0 393216.0 37393.6  0.0   2359296.0 153809.2 11534336.0 5823612.0  38144.0 36972.5 4096.0 3790.2   2018   75.975   0      0.000   75.975
393216.0 393216.0 32680.8  0.0   2359296.0 1641432.9 11534336.0 5835027.5  38144.0 36972.5 4096.0 3790.2   2028   76.414   0      0.000   76.414
393216.0 393216.0  0.0   57806.1 2359296.0 1100656.5 11534336.0 5847019.4  38144.0 36972.5 4096.0 3790.2   2039   76.832   0      0.000   76.832
393216.0 393216.0 41307.9  0.0   2359296.0 1949798.2 11534336.0 5871701.6  38144.0 36972.5 4096.0 3790.2   2050   77.263   0      0.000   77.263
393216.0 393216.0  0.0   33043.3 2359296.0 1892463.5 11534336.0 5892800.7  38144.0 36972.5 4096.0 3790.2   2061   77.632   0      0.000   77.632
393216.0 393216.0 36911.4  0.0   2359296.0 1830153.5 11534336.0 5903316.5  38144.0 36972.5 4096.0 3790.2   2072   78.035   0      0.000   78.035
...

The server itself still has quite some unused RAM (installed RAM=128GB):

$ vmstat 4
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 5  0    788 40803252 105920 13794260    0    0    56   105    0    0  3  0 96  1  0
 2  0    788 40822440 105920 13794760    0    0  1391 15428 9154 10217  8  1 90  1  0
 4  1    788 40785956 105920 13801652    0    0  1648 18393 12993 9999  9  1 89  1  0
...

However, sometimes I can not start a new ODB client program "console.sh":

$ /opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/bin/console.sh "CONNECT ... ; desc myclass; quit"
Error occurred during initialization of VMjava.lang.OutOfMemoryError: unable to create new native thread

-> which of the JVM settings should be further adjusted? Is this connection from the client program trying to consume additional 'new' generation space?

@andrii0lomakin
Copy link
Member

Hi @rdelangh , what is output of ulimit -a ?

@andrii0lomakin
Copy link
Member

Could you also execute ps -o nlwp <pid> and provide output ?

@rdelangh
Copy link
Author

@Laa

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514444
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 10000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 514444
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

$ ps -o nlwp 9879
NLWP
 342

@andrii0lomakin
Copy link
Member

@rdelangh did you take this values at the moment after you have seen a problem with the last OOM?

@rdelangh
Copy link
Author

trying again to launch an extra JVM program ("jstat"), fails with

orientdb@orient2:~$ /usr/lib/jvm/jdk1.8.0_92/bin/jstat -gc -h10 9879 10s
Error occurred during initialization of VM
Cannot create VM thread. Out of system resources.

At that very moment, only 2 Java processes are active on this machine:

  • the ODB server
  • one "console.sh" client program
$ ps -aef|grep java
orientdb  9879     1 41 Mar27 ?        08:27:38 java -server -Xms14G -Xmn3G -Xmx18G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12m -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops -Djna.nosys=true -XX:MaxDirectMemorySize=512g -Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9 -Dstorage.openFiles.limit=1024 -Denvironment.dumpCfgAtStartup=true -Dquery.parallelAuto=true -Dquery.parallelMinimumRecords=10000 -Dquery.parallelResultQueueSize=200000 -Dstorage.wal.maxSize=51200 -Dstorage.diskCache.bufferSize=54000 -Dmemory.chunk.size=33554432 -Djava.util.logging.config.file=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientdb-server-log.properties -Dorientdb.config.file=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientdb-server-config.xml -Dorientdb.www.path=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/www -Dorientdb.build.number=UNKNOWN@rf31f2e10de758cbdef4cee27451b4065b94d9ce2; 2017-03-04 00:50:53+0000 -cp /opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/orientdb-server-2.2.18-SNAPSHOT.jar:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/*:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/plugins/* com.orientechnologies.orient.server.OServerMain
qcontrol 38625     1  0 12:56 ?        00:00:07 java -client -Xmx1024m -XX:MaxDirectMemorySize=512g -Djava.util.logging.config.file="/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/orientdb-client-log.properties" -Djava.awt.headless=true -Dclient.ssl.enabled=false -Dfile.encoding=utf-8 -Dorientdb.build.number=UNKNOWN@rf31f2e10de758cbdef4cee27451b4065b94d9ce2; 2017-03-04 00:50:53+0000 -cp /opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/orientdb-tools-2.2.18-SNAPSHOT.jar:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/lib/*:/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/plugins/* -Djavax.net.ssl.keyStore=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/cert/orientdb-console.ks -Djavax.net.ssl.keyStorePassword=password -Djavax.net.ssl.trustStore=/opt/orientdb/orientdb-community-2.2.18-SNAPSHOT/config/cert/orientdb-console.ts -Djavax.net.ssl.trustStorePassword=password com.orientechnologies.orient.graph.console.OGremlinConsole

Only when I "kill" that Java client program, the "jstat" command can be launched again:

orientdb@orient2:~$ kill 38625
orientdb@orient2:~$ /usr/lib/jvm/jdk1.8.0_92/bin/jstat -gc -h10 9879 10s
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
393216.0 393216.0 27548.1  0.0   2359296.0 2073754.0 11534336.0 5994826.1  39040.0 37640.2 4224.0 3794.0   7322  279.558   1      1.629  281.187
393216.0 393216.0  0.0   23546.8 2359296.0 1878604.6 11534336.0 5995314.1  39040.0 37640.2 4224.0 3794.0   7323  279.605   1      1.629  281.233
...

but from the output of "jstat", it looks like there is plenty of heap space available, no?

@andrii0lomakin
Copy link
Member

Hi @rdelangh OS limits an amount of threads but not amount of processes. Also threads are created outside of heap memory. The method which you used to run jstat instance suggests me that you hit this limit. Lets check total amount of threads running in your system currently. Could you execute ps -eo nlwp | tail -n +2 | awk '{ num_threads += $1 } END { print num_threads }'

@rdelangh
Copy link
Author

@Laa
At this moment, that number of threads is more or less stable around 370
Meanwhile, the system-wide max nr of threads is about 1M :

$ cat /proc/sys/kernel/threads-max
1028888

@andrii0lomakin
Copy link
Member

andrii0lomakin commented Mar 30, 2017

Hm, @rdelangh very strange.
Let's try another way.

Could you print output of ulimit -u (limit of processes for current user) and ps -eLf | grep 'my user' | wc -l total amount of processes for the user. If you have enough of memory then the allocation of new heap should not be a problem, so we need to look closer to OS limits.

Could you also do it once you have got the error mentioned above?

@rdelangh
Copy link
Author

hi @Laa
Number of processes surely will not be any issue, because we do not launch any client programs on this machine, all client programs run remotely on other servers.
So the number of processes remains static, independent of which clients are connecting, and VERY small:

$ ps -aef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Jan31 ?        00:00:11 /sbin/init
root        16     1  0 Jan31 ?        00:00:14 /lib/systemd/systemd-journald
syslog      55     1  0 Jan31 ?        00:00:05 /usr/sbin/rsyslogd -n
root        61     1  0 Jan31 ?        00:00:06 /usr/sbin/cron -f
root       129     1  0 Jan31 ?        00:00:03 /usr/sbin/sshd -D
root       131     1  0 Jan31 pts/1    00:00:00 /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt220
root       132     1  0 Jan31 pts/0    00:00:00 /sbin/agetty --noclear --keep-baud pts/0 115200 38400 9600 vt220
root       133     1  0 Jan31 pts/1    00:00:00 /sbin/agetty --noclear --keep-baud pts/1 115200 38400 9600 vt220
root       134     1  0 Jan31 pts/3    00:00:00 /sbin/agetty --noclear --keep-baud pts/3 115200 38400 9600 vt220
root       135     1  0 Jan31 pts/2    00:00:00 /sbin/agetty --noclear --keep-baud pts/2 115200 38400 9600 vt220
root      3327   129  0 10:56 ?        00:00:00 sshd: myuser  [priv]
myuser  3492  3327  0 10:56 ?        00:00:00 sshd: myuser @pts/4
myuser  3493  3492  0 10:56 pts/4    00:00:00 -bash
myuser  3573  3493  0 10:57 pts/4    00:00:00 ps -aef
orientdb  9879     1 99 Mar27 ?        8-19:49:58 java -server -Xms14G -Xmn3G -Xmx18G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpa

The only additional program that I sometimes run on this server host, is a "jstat" command, or a "console.sh" (which in turns launches "java")

@andrii0lomakin
Copy link
Member

andrii0lomakin commented Mar 30, 2017

@rdelangh I have already written you before. It is called a number of processes, but in Linux terminology, it is some lightweight processes aka threads. If you noticed in command which I provided there is flag L according to ps documentation it will make command to add NLWP column (which is an acronym for a number of lightweight processes). For example snippet of result of execution of this command on currently running a benchmark.

andrey    3937     1  3937  0   31 Mar29 ?        00:00:00 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3939  0   31 Mar29 ?        00:00:00 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3941  0   31 Mar29 ?        00:00:00 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3942  0   31 Mar29 ?        00:00:02 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3944  0   31 Mar29 ?        00:00:02 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3945  0   31 Mar29 ?        00:00:00 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3946  0   31 Mar29 ?        00:00:02 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3947  0   31 Mar29 ?        00:00:02 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3948  0   31 Mar29 ?        00:00:02 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3949  0   31 Mar29 ?        00:00:02 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3950  0   31 Mar29 ?        00:00:02 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D
andrey    3937     1  3951  0   31 Mar29 ?        00:00:00 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx1024m -Dfile.encoding=UTF-8 -Duser.country=US -D

As you can see we have 31 thread in a process. Each line corresponds to a single thread of the process. Could you execute a script which I provided to you when you will have this issue?

@rdelangh
Copy link
Author

hi @Laa, sorry for my ignorance, I know very well what the LWP are, but I was reading your comment too quickly.
So here are the outputs you asked for:

$ ulimit -u
514444
$ ps -eLf | grep 'my user' | wc -l 
390

@rdelangh
Copy link
Author

hello,
any update on this?
I run each time against an OOM situation, increasing the heap size params, but this seems really like a memory leakage :

  • currently already with the following settings: -Xms14G -Xmn3G -Xmx20G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12m -XX:-UseAdaptiveSizePolicy

  • outputs from "jstat" at this moment:

$ /usr/lib/jvm/jdk1.8.0_92/bin/jstat -gcutil 41883
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
  3.81   0.00  84.00  99.15  96.80  90.82 238565 11812.863    38  111.985 11924.848

$ /usr/lib/jvm/jdk1.8.0_92/bin/jstat -gc -h10 41883 10s
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GC                      T
393216.0 393216.0 10329.7  0.0   2359296.0 2359296.0 12031488.0 11928461.2 39808.0 38532.6 4224.0 3836.2 238563 11812.730  38    111.985 11924.715
393216.0 393216.0  0.0   10553.7 2359296.0 104145.5 12031488.0 11928929.9 39808.0 38532.6 4224.0 3836.2 238564 11812.804  38    111.985 11924.789
393216.0 393216.0  0.0   10553.7 2359296.0 229703.2 12031488.0 11928929.9 39808.0 38532.6 4224.0 3836.2 238564 11812.804  38    111.985 11924.789
393216.0 393216.0  0.0   10553.7 2359296.0 329738.2 12031488.0 11928929.9 39808.0 38532.6 4224.0 3836.2 238564 11812.804  38    111.985 11924.789
...
  • output from "ps" shows that the resident size of the process (= RAM usage) is now 65GB !
$ ps -o pid,vsz,rss,args 41883
  PID    VSZ   RSS COMMAND
41883 435728852 68394480 java -server -Xms14G -Xmn3G -Xmx20G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=64m -XX:CompressedClassSpaceSize=12m ...
  • symptom: dbase can run fine for 1 week or so, until no more space seems to be reclaimed by garbage collections; the effect on the client programs is that their REST calls do not complete, but timeout. Other effect is that any modification via the console, such as creation of an index, do not even start.

The only solution possible is a "shutdown", then a restart.

I raise now the heap setting further to "-Xmx24G"

-> sure that this is not a case of memory leakage??

@cmassi
Copy link

cmassi commented Apr 10, 2017

Your jvm has executed 38 fullgc and old is nearly full again. The last 3 lines do not show any gc (238564) or fullgc (38) increased number, and so nothing is cleaned (OU is stable because they are not fullgc), but Eden usage increases each time of 100m. If you enable detailed gclog and you see something like "java.lang.OutOfMemoryError: GC overhead limit exceeded" , then it could be you do not have hw resources to handle this huge heap cleanup (need more cpu), or there is nothing else to clean (so you need more heap or there is a leak). From jstat I can only see that average time, which seems to be fine (111.985 secs/38=2.94 secs). You can check and try to reduce promotion of objects (or keep it) to a small value, tuning New generation, with the best size of Survivors and Eden (add PrintGCDetails and PrintTenuringDistribution). The OU gc shows promoted objects (I see only 500k promoted in first line, but you should look at jstat log for a long time). If your jvm do not survive with 24 gb, it could be a leak, and you can collect histogram (jmap -histo:live) to see which objects continously increase (note: the live collection of histogram requires an additional fullgc).

@rdelangh
Copy link
Author

rdelangh commented Apr 12, 2017

@taburet : yes, extensively using Lucene indexes

@robfrank :

  • number of classes: currently in two databases: "cdrarch" (67 classes) and "mobile" (78 classes); indeed one class per week of data, in each of two databases
  • number of records: different classes for different purposes, containing a few K records till some 100's M records
  • number of indexes per class: in "cdrarch" dbase, 8 indexes/class, of which 6 Lucene; in "mobile" dbase, 2 indexes/class of which 1 Lucene
  • size of dbase: "cdrarch" 1.4 TB, "mobile" 680 GB
  • sizes of Lucene indexes: 160 GB in "cdrarch" dbase, 36 GB in "mobile" dbase
  • data structure is very basic: nearly 'flat', schema-full classes; mainly heavily and continuous loading of new records; counting of these records is done 4 times per week and results are stored in (much) smaller classes; add-hoc queries launched by users for very specific records (mainly found via the Lucene indexes)

@robfrank
Copy link
Contributor

AFAIU you create 6 different single field lucene indexes for each class.
So on cdarch you have more than 360 indexes.
My suggestion is to create a single multi-field index per class, this will reduce the number of indexes to 60, and so the number of threads. BUT, lucene eats ram, ad every other.
By default we use memory mapped MMapDirectory, but you can configure the index to work with NIOFSDirectory:

http://orientdb.com/docs/last/Full-Text-Index.html#lucene-writer-fine-tuning-expert

I don't know if the latter would increase performance/decrease ram occupation,but for sure a single index per class should be better. The only backdraw is that maybe you need to rewrite some queries, just to qualify the field name(s) if you are using different analyzers.

@rdelangh
Copy link
Author

hi @robfrank , thanks for your valueable feedback!
The issue #7220 is not asserting me that such multi-field Lucene indexes really work. If that issue is resolved, then I will know if and how they can be used, and I will certainly run some tests with them.

@rdelangh
Copy link
Author

Moreover, apart from reducing the memory-usage by using few multi-field Lucene indexes instead of several single-field Lucene indexes, how will this avoid that the Old-heap keeps on growing as we see, until an OOM?

@rdelangh
Copy link
Author

What I noticed in the past days, is the following:

  • running only data-loading programs which do plain "INSERT", seems to keep the memory usage of ODB stable; these programs load approx 20M records per day in one database "cdrarch", plus some 22M records per day in the other database "mobile"; the pace of insertion for such records is then approx 1-3msec per record
  • when I activate another program that does "INSERT" of records, plus a lookup/INSERT/UPDATE of another record, all this in a so-called script, into the dbase "mobile" at a pace of some 100M records per day, it becomes way too much for ODB to be able to absorb; the pace of absorbing records becomes 8-15msec per record
  • at any time, ODB seems not to be able to make use of multiple CPU-cores in the machine: only a single core gets used, of course at approx 100% of capacity for that core leading me to assume that currently the bottleneck is CPU bound

I could try to raise the memory limits of the ODB server even a little higher from currently max 24GB to -say- 30GB, but maybe this will cause very long garbage collections of a few/many seconds

On top of all this, it is still not possible to spread the load over more than a single hardware server. See issue #6666 (open for months!)

@rdelangh
Copy link
Author

I have again this situation where the server is freezing, altough:
as I wrote already above, this ODB server is dealing with two big databases, one named "cdrarch" and another named "mobile". The access to "cdrarch" is still fine, programs can stuff their records to this database at a normal pace.
A data-loading program to dbase "mobile" was also busy, altough much slower, until I launched a query to that dbase in parallel. That query was clearly slowing down very much the data-loading program to "mobile", to the point where I interrupted the query. This did not allow the data-loading program to pick up its pace, even worse: it was granted to a halt altogether ! So I stopped that program as well.
Meanwhile the data-loading to dbase "cdrarch" was still running fine.

I have no clue what the ODB server engine is doing now, regarding this dbase "mobile":

  • the machine is not out of RAM (still 20GB of the installed 128GB is unused)
  • I can capture "jstack" info, which looks like it busy very normally:
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
393216.0 393216.0  0.0   57083.8 2359296.0 754055.5 11537408.0 4442292.2  38400.0 37226.9 4096.0 3742.7  40217 1727.860   6     14.630 1742.490
393216.0 393216.0 35303.4  0.0   2359296.0 1424549.3 11537408.0 4448764.3  38400.0 37226.9 4096.0 3742.7  40218 1727.903   6     14.630 1742.533
...
  • I can start "console.sh", and connect to dbase "mobile", and run a small query
  • still, none of the data loading programs to "mobile", can send any single record to the ODB server; they can connect, send a batch of INSERT statements, but they hang waiting for any reply from ODB

-> what else can I do to find out why the ODB server is stuck for inserting records to one database ("mobile"), while it happily allows inserting records to another dbase?

@rdelangh
Copy link
Author

And after some 20 minutes of total freeze of data-loading program into dbase "mobile", it finally started with its first records. Its speed is abnormally slow, however.
On the ODB server, system resources are fine (plenty of unused RAM, CPU fairly idle, disks not over-used)

@andrii0lomakin
Copy link
Member

@rdelangh so right now the speed of freeze is very slow right at this moment?

@cmassi
Copy link

cmassi commented Apr 24, 2017

Looking at your last jstat -gc report you have had 6 fullgc with a FGCT of 14.630 secs.
Has this number increased, and is time up to 34 secs or more (maybe you have had more fullgc in the meantime)?

It would be better to confirm with a detailed gc log.

You can use to enable without restart:
$ jinfo -flag +PrintGCDateStamps pid_jvm;
$ jinfo -flag +PrintGCTimeStamps pid_jvm;
$ jinfo -flag +PrintGCID pid_jvm;
$ jinfo -flag +PrintGC pid_jvm;
$ jinfo -flag +PrintGCDetails pid_jvm;
And to disable use jinfo -flag -XX:-PrintGCDetails pid_jvm and so on

@rdelangh
Copy link
Author

rdelangh commented Apr 24, 2017

Meanwhile I did not see any reasonably fast solution apart from migrating this database "mobile" onto another, separate hardware server named "orient1".
At moments, the data-loading is still freezing, while another program runs a query on the same data.

The "jstat" output at this moment:

$ /usr/lib/jvm/jdk1.8.0_92/bin/jstat -gc -h10 1057 10s
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
393216.0 393216.0 143922.5  0.0   2359296.0 45404.0  11538432.0 6056022.2  37632.0 36230.9 4096.0 3653.8 232242 7553.991  13     52.214 7606.205
393216.0 393216.0 143922.5  0.0   2359296.0 1318014.1 11538432.0 6056022.2  37632.0 36230.9 4096.0 3653.8 232242 7553.991  13     52.214 7606.205
393216.0 393216.0 143922.5  0.0   2359296.0 1772914.5 11538432.0 6056022.2  37632.0 36230.9 4096.0 3653.8 232242 7553.991  13     52.214 7606.205
...

The commands you gave can only partially run successful:

$ /usr/lib/jvm/jdk1.8.0_92/bin/jinfo -flag +PrintGCDateStamps 1057
$ /usr/lib/jvm/jdk1.8.0_92/bin/jinfo -flag +PrintGCTimeStamps 1057
$ /usr/lib/jvm/jdk1.8.0_92/bin/jinfo -flag -XX:+PrintGCID 1057
Exception in thread "main" com.sun.tools.attach.AttachOperationFailedException: flag 'XX:+PrintGCID' cannot be changed

        at sun.tools.attach.LinuxVirtualMachine.execute(LinuxVirtualMachine.java:229)
        at sun.tools.attach.HotSpotVirtualMachine.executeCommand(HotSpotVirtualMachine.java:261)
        at sun.tools.attach.HotSpotVirtualMachine.setFlag(HotSpotVirtualMachine.java:234)
        at sun.tools.jinfo.JInfo.flag(JInfo.java:144)
        at sun.tools.jinfo.JInfo.main(JInfo.java:81)
$ /usr/lib/jvm/jdk1.8.0_92/bin/jinfo -flag -XX:+PrintGCDetails 1057
Exception in thread "main" com.sun.tools.attach.AttachOperationFailedException: flag 'XX:+PrintGCDetails' cannot be changed

        at sun.tools.attach.LinuxVirtualMachine.execute(LinuxVirtualMachine.java:229)
        at sun.tools.attach.HotSpotVirtualMachine.executeCommand(HotSpotVirtualMachine.java:261)
        at sun.tools.attach.HotSpotVirtualMachine.setFlag(HotSpotVirtualMachine.java:234)
        at sun.tools.jinfo.JInfo.flag(JInfo.java:144)
        at sun.tools.jinfo.JInfo.main(JInfo.java:81)

Where would I get this additional GC logging details? The output of "jstat" is still showing the same output columns:

$ /usr/lib/jvm/jdk1.8.0_92/bin/jstat -gc -h10 1057 10s
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
393216.0 393216.0  0.0   393193.3 2359296.0 1785915.9 11538432.0 6881507.2  37632.0 36231.5 4096.0 3653.8 232299 7558.596  13     52.214 7610.810
393216.0 393216.0 332867.2  0.0   2359296.0 1069934.1 11538432.0 7173228.7  37632.0 36231.5 4096.0 3653.8 232308 7559.410  13     52.214 7611.623
393216.0 393216.0 163408.9  0.0   2359296.0 1004431.1 11538432.0 7360261.5  37632.0 36231.5 4096.0 3653.8 232316 7559.936  13     52.214 7612.149
393216.0 393216.0 63640.0  0.0   2359296.0 1011200.5 11538432.0 7478940.9  37632.0 36231.5 4096.0 3653.8 232324 7560.389  13     52.214 7612.602
...

@cmassi
Copy link

cmassi commented Apr 24, 2017

Sorry,
my mistake, due to cut&paste of complete options, recommended before.
Syntax is without -XX, when used with jinfo. Only the name of the option, should be used, if manageable option. The jstat is a tool which provide only statistics from the start, alway with the same format, as requested with option (-gc).
The detailed gclog is going to stdout, if not redirected with -Xloggc:$ORIENTDBHOME/log/gc%p_%t.log , at startup.

Anyway, I see you have had FGC=13 FGCT=52.214 , which do not confirm the 20 min pause was a sequence of slow fullgc, because you have had 7 more fullgc of 5 sec each.
But you have a lot of simple gc, (7612 - 1742) / 60 = 97 min, which in default ParallelGC, are stop the world too.

So it would be better to understand the detailed gc log and add -XX:+PrintGCApplicationStoppedTime to have in log a simple output to grep, with a row where you see for how long your threads have been stopped (e.g: Total time for which application threads were stopped: 0.0117340 seconds )

This to rule out the gc from your investigation

@cmassi
Copy link

cmassi commented Apr 24, 2017

There is also the opposite option: -XX:+PrintGCApplicationConcurrentTime

See https://blogs.oracle.com/jonthecollector/the-unspoken-application-times

@rdelangh
Copy link
Author

hi @cmassi , thx a lot for your time ! Highly appreciated because we are really struggling with ODB under high load.
Both of my attempts to set this extra logging option are failing:

$ /usr/lib/jvm/jdk1.8.0_92/bin/jinfo -flag -XX:+PrintGCApplicationStoppedTime 1057
Exception in thread "main" com.sun.tools.attach.AttachOperationFailedException: flag 'XX:+PrintGCApplicationStoppedTime' cannot be changed

        at sun.tools.attach.LinuxVirtualMachine.execute(LinuxVirtualMachine.java:229)
        at sun.tools.attach.HotSpotVirtualMachine.executeCommand(HotSpotVirtualMachine.java:261)
        at sun.tools.attach.HotSpotVirtualMachine.setFlag(HotSpotVirtualMachine.java:234)
        at sun.tools.jinfo.JInfo.flag(JInfo.java:144)
        at sun.tools.jinfo.JInfo.main(JInfo.java:81)

$ /usr/lib/jvm/jdk1.8.0_92/bin/jinfo -flag +PrintGCApplicationStoppedTime 1057
Exception in thread "main" com.sun.tools.attach.AttachOperationFailedException: flag 'PrintGCApplicationStoppedTime' cannot be changed

        at sun.tools.attach.LinuxVirtualMachine.execute(LinuxVirtualMachine.java:229)
        at sun.tools.attach.HotSpotVirtualMachine.executeCommand(HotSpotVirtualMachine.java:261)
        at sun.tools.attach.HotSpotVirtualMachine.setFlag(HotSpotVirtualMachine.java:234)
        at sun.tools.jinfo.JInfo.flag(JInfo.java:140)
        at sun.tools.jinfo.JInfo.main(JInfo.java:81)

It seems that these options can only be set in the command-line when starting up the ODB server.
-> please confirm that this is very valuable information, so that I recycle my server process.

@cmassi
Copy link

cmassi commented Apr 24, 2017

I've added to the above list also the flag to activate from jinfo (which is not needed at startup): jinfo -flag +PrintGC pid_jvm

For the stopped time, it is not a manageable option.

See how to ask jvm which options are manageable : java -XX:+PrintFlagsFinal -version | grep manageable

@rdelangh
Copy link
Author

I have not done any restart of the server process yet (need to wait a timeslot later today), but I get meanwhile the following GC log messages in the output, every 5 secs or so:

...
 [Times: user=0.91 sys=0.00, real=0.04 secs]
 [Times: user=1.39 sys=0.01, real=0.07 secs]
 [Times: user=0.80 sys=0.00, real=0.03 secs]
 [Times: user=0.89 sys=0.01, real=0.04 secs]
...

@cmassi
Copy link

cmassi commented Apr 26, 2017

Please remember to add: jinfo -flag +PrintGC pid_jvm
Also young collection stop all the application threads, but for a short time (see also https://blogs.oracle.com/jonthecollector/our-collectors )
If you have a lot of cpu, you can add -XX:+UseParallelOldGC
If there are other application running you can decrease number of threads to be used with -XX:ParallelGCThreads=N (see https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/parallel.html)

@rdelangh
Copy link
Author

  1. done the "jinfo"
  2. I see now in the output of the server process, almost every second a new line like these:
[GC (Allocation Failure)  6240876K->3881448K(14286848K), 0.0616845 secs]
[GC (Allocation Failure)  6240744K->3887007K(14286848K), 0.0433432 secs]
[GC (Allocation Failure)  6246303K->3877727K(14286848K), 0.0496305 secs]
[GC (Allocation Failure)  6236444K->3882282K(14286848K), 0.0267033 secs]
[GC (Allocation Failure)  6241578K->3887636K(14286848K), 0.0271547 secs]
...

@cmassi
Copy link

cmassi commented Apr 26, 2017

The numbers are total_allocation_before->total_allocation_after (total capacity)

With the help of jstat -gc for current capacity of each generation (in your configuration capacity of generations can change dynamically) you have every second to clean the new area (eden+survivor) and promote something to old, so throughput is more or less 2.4g/s but all these objects are cleaned in up to 0.02 secs , which is a good job. Promoted objects are cleaned only with a fullgc in old, when nearly full, and to avoid as much as possible promotion in old, you should study it with PrintTenuringDistribution, and set size of survivor ratio.
With the setup of detailed gclog you are interested in time spent in each fullgc, to verify they are not too slow, and rule out gc activity and config sizes from the responsible activities of your hang issue (20 minutes)

@robfrank
Copy link
Contributor

May I close this issue? Does the "laziness" of lucene indexes improved the situation?

@rdelangh
Copy link
Author

hello Frank,

absolutely positive impact of the "laziness" of Lucene indexes, many thanks for that!

We still encounter hanging server processes when -for example- some queries have been launched which are trying to access too much records. I guess that causes an OOM situation, but the server logfiles do not mention anything about the fact that they are stuck : no more client processes (console, or REST-api) can connect, the clean "shutdown" fails to get a connection, a gentle "kill" signal is not trapped...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants