Recently we upgraded our JDK from 1.5 to 1.6 and of course we are using the latest greatest JDK from Oracle: 1.6.0 update 21. After that, some weird thing happened. One of our component which runs on Weblogic servers, crashes around twice a day, generating a 40G core dump file on the disk (which causes a disk usage problem as well!)
At the first thought, it looks like there is a memory leak in our code, but wait – the maximal possible Java heap size is defined as 4G (by -Xmx), plus all other stack, JVM itself, everything together should be something around 5-6G, why we have a core dump of 40G? This has to be from some native code!
The good thing is we have some very handful tools on Solaris – which IMHO is better than Linux; for example, dtrace, pstack, pmap. Linux has something similar, but not really as good.
Since we are debugging memory issue, let’s examine the memory first with pmap. A normal Java process should look like this:
$ pmap 3536 |more 3536: /opt/middleware/jdk1.6.0_21/bin/amd64/java -Dweblogic.Name=imdmapi -se 0000000000400000 52K r-x-- /opt/middleware/jdk1.6.0_21/bin/amd64/java 000000000041C000 4K rw--- /opt/middleware/jdk1.6.0_21/bin/amd64/java 000000000041D000 158048K rw--- [ heap ] FFFFFD7ECBCD0000 152K rw--R [ anon ] FFFFFD7ECC2D6000 152K rw--R [ stack tid=2729 ] ...... FFFFFD7EEA6F1000 8K rw--- /opt/middleware/jdk1.6.0_21/jre/lib/amd64/libnet.so FFFFFD7EEA6F5000 8K r--s- dev:32,0 ino:477791 FFFFFD7EEA6F8000 16K r--s- dev:32,0 ino:8281559 FFFFFD7EEA7FB000 16K rw--R [ stack tid=8 ] FFFFFD7EEA800000 1048576K rw--- [ anon ] FFFFFD7F2A800000 3145728K rw--- [ anon ] FFFFFD7FEA800000 131072K rw--- [ anon ] FFFFFD7FF2800000 88064K rw--- [ anon ] .... FFFFFD7FFF3FD000 8K rwx-- /lib/amd64/ld.so.1 FFFFFD7FFFDFC000 16K rw--- [ stack ]
The top is the heap of the native code. while the bottom is the stack, as well as the anonymous pages allocated by JVM as “Heap” for Java. I don’t really know why this is here since it seems to be mixed with those “stack” pages but by using truss (Linux: strace) we can see JVM actually opened /dev/zero then calls mmap() to map the “all zero” file to memory so the memory is also initialized to zero.
Now the core file:
core 'core.2' of 12264: /opt/middleware/jdk1.6.0_21/bin/amd64/java -Dserver.name=iip -Xrs -Xms
0000000000400000 52K r-x-- /opt/middleware/jdk1.6.0_21/bin/amd64/java
000000000041C000 4K rw--- /opt/middleware/jdk1.6.0_21/bin/amd64/java
000000000041D000 2227600K rw---
0000000088381000 2097152K rw---
0000000108381000 4194304K rw---
0000000208381000 8388608K rw---
0000000408381000 16777216K rw--- [ heap ]
FFFFFD7EDE76C000 88K rw--- [ stack tid=166 ]
FFFFFD7EDE86D000 88K rw--- [ stack tid=165 ]
……
FFFFFD7EF2800000 2048K rw---
FFFFFD7EF2A00000 2048K rw---
FFFFFD7EF2C00000 2048K rw---
FFFFFD7EF2E00000 2048K rw---
FFFFFD7EFAA00000 2797568K rw---
FFFFFD7FA5600000 1398784K rw---
FFFFFD7FFAC01000 4K r----
FFFFFD7FFAC03000 8K r----
Look at the highlighted native heap, we can see the native code is using (16+8+4+2+2)=32G, plus the 4G memory for the java heap at the bottom, it is quite close to the 40G core size. That simply means, some native code is leaking memory.
A funny thing that can be observed is the chunk of the memory being allocated is having some pattern: 2,2,4,8,16..2^n? let’s see if it is really being allocated in this way by:
<tt>truss -t brk -p 1920</tt>
The truss result did show the memory is being allocated in that way. Note the brk() is used by system to allocate some memory on system heap, while JVM uses mmap() to allocate the java heap. This is definitely a native problem. But which one? is it a problem of weblogic native code like the native io muxer?
One approach is to use tools like valgrind, DMALLOC etc to trace the “native memory” leak, but that will be too hard. The good thing is we have dtrace, which can print a stack trace when some method is being executed – apparently, I want a stacktrace when brk() is being called!
#!/usr/sbin/dtrace -s syscall::brk:entry /pid == $1/ { stack(); printf("====>allocating to: %x\n", arg0); }
The script will print the stack, as well as the size of memory being requested. dtrace did a good job this time as well:
0 52693 brk:entry libc.so.1`_brk_unlocked+0xa libc.so.1`sbrk+0x20 libc.so.1`_morecore+0x12b libc.so.1`_malloc_unlocked+0x10d libc.so.1`malloc+0x2d libjvm.so`__1cCosGmalloc6FL_pv_+0x1e libjvm.so`__1cFChunk2n6FLL_pv_+0x49 libjvm.so`__1cFArenaEgrow6ML_pv_+0x3a libjvm.so`__1cINodeHashEgrow6M_v_+0x87 libjvm.so`__1cINodeHashQhash_find_insert6MpnENode__2_+0x140 libjvm.so`__1cMPhaseIterGVNNtransform_old6MpnENode__2_+0x731 libjvm.so`__1cMPhaseIterGVNIoptimize6M_v_+0xff libjvm.so`__1cHCompileIOptimize6M_v_+0x801 ....
This apparently has nothing to do with weblogic, and must be the JVM itself. If you look carefully enough through the C++ mangled method name, you’ll see the code is doing NodeHash::grow(), and considering the pattern of 2,4,8,16…it is very likely to be that this NodeHash is rehashing it’s hashtable so doubling it’s table size each time!
The caller is from Compile::Optimize and it is running on the compiler thread. after a quick pstack of the compiler thread of the buggy JVM, we can see the thread is in a dead loop as shown above – keep growing its hashtable..
Based on the rule of “you won’t be the first one”, a quick google search reveals this is a JVM bug, and still open in JDK 1.6u21. It is triggered by some specific class – and we happened to have one! the fix? well, wait for Oracle to release a new JDK..(6u23 after one month).