Recently we upgraded our JDK from 1.5 to 1.6 and of course we are using the latest greatest JDK from Oracle: 1.6.0 update 21. After that, some weird thing happened. One of our component which runs on Weblogic servers, crashes around twice a day, generating a 40G core dump file on the disk (which causes a disk usage problem as well!)

At the first thought, it looks like there is a memory leak in our code, but wait - the maximal possible Java heap size is defined as 4G (by -Xmx), plus all other stack, JVM itself, everything together should be something around 5-6G, why we have a core dump of 40G? This has to be from some native code!

The good thing is we have some very handful tools on Solaris - which IMHO is better than Linux; for example, dtrace, pstack, pmap. Linux has something similar, but not really as good.

Since we are debugging memory issue, let’s examine the memory first with pmap. A normal Java process should look like this:

$ pmap 3536 |more
3536:   /opt/middleware/jdk1.6.0_21/bin/amd64/java -Dweblogic.Name=imdmapi -se
0000000000400000         52K r-x--  /opt/middleware/jdk1.6.0_21/bin/amd64/java
000000000041C000          4K rw---  /opt/middleware/jdk1.6.0_21/bin/amd64/java
000000000041D000     158048K rw---    [ heap ]
FFFFFD7ECBCD0000        152K rw--R    [ anon ]
FFFFFD7ECC2D6000        152K rw--R    [ stack tid=2729 ]
......
FFFFFD7EEA6F1000          8K rw---  /opt/middleware/jdk1.6.0_21/jre/lib/amd64/libnet.so
FFFFFD7EEA6F5000          8K r--s-  dev:32,0 ino:477791
FFFFFD7EEA6F8000         16K r--s-  dev:32,0 ino:8281559
FFFFFD7EEA7FB000         16K rw--R    [ stack tid=8 ]
FFFFFD7EEA800000    1048576K rw---    [ anon ]
FFFFFD7F2A800000    3145728K rw---    [ anon ]
FFFFFD7FEA800000     131072K rw---    [ anon ]
FFFFFD7FF2800000      88064K rw---    [ anon ]
....
FFFFFD7FFF3FD000          8K rwx--  /lib/amd64/ld.so.1
FFFFFD7FFFDFC000         16K rw---    [ stack ]

The top is the heap of the native code. while the bottom is the stack, as well as the anonymous pages allocated by JVM as “Heap” for Java. I don’t really know why this is here since it seems to be mixed with those “stack” pages but by using truss (Linux: strace) we can see JVM actually opened /dev/zero then calls mmap() to map the “all zero” file to memory so the memory is also initialized to zero.

Now the core file:

core 'core.2' of 12264: /opt/middleware/jdk1.6.0_21/bin/amd64/java -Dserver.name=iip -Xrs -Xms
0000000000400000         52K r-x--  /opt/middleware/jdk1.6.0_21/bin/amd64/java
000000000041C000          4K rw---  /opt/middleware/jdk1.6.0_21/bin/amd64/java
000000000041D000    2227600K rw---
0000000088381000    2097152K rw---
0000000108381000    4194304K rw---
0000000208381000    8388608K rw---
0000000408381000   16777216K rw---    [ heap ]
FFFFFD7EDE76C000         88K rw---    [ stack tid=166 ]
FFFFFD7EDE86D000         88K rw---    [ stack tid=165 ]
……
FFFFFD7EF2800000       2048K rw---
FFFFFD7EF2A00000       2048K rw---
FFFFFD7EF2C00000       2048K rw---
FFFFFD7EF2E00000       2048K rw---
FFFFFD7EFAA00000    2797568K rw---
FFFFFD7FA5600000    1398784K rw---
FFFFFD7FFAC01000          4K r----
FFFFFD7FFAC03000          8K r----

Look at the highlighted native heap, we can see the native code is using (16+8+4+2+2)=32G, plus the 4G memory for the java heap at the bottom, it is quite close to the 40G core size. That simply means, some native code is leaking memory.

A funny thing that can be observed is the chunk of the memory being allocated is having some pattern: 2,2,4,8,16..2^n? let’s see if it is really being allocated in this way by:

truss -t brk -p 1920

The truss result did show the memory is being allocated in that way. Note the brk() is used by system to allocate some memory on system heap, while JVM uses mmap() to allocate the java heap. This is definitely a native problem. But which one? is it a problem of weblogic native code like the native io muxer?

One approach is to use tools like valgrind, DMALLOC etc to trace the “native memory” leak, but that will be too hard. The good thing is we have dtrace, which can print a stack trace when some method is being executed - apparently, I want a stacktrace when brk() is being called!

#!/usr/sbin/dtrace -s
syscall::brk:entry
/pid == $1/
{
    stack();
    printf("====>allocating to: %x\n", arg0);
}

The script will print the stack, as well as the size of memory being requested. dtrace did a good job this time as well:

0  52693                        brk:entry
              libc.so.1`_brk_unlocked+0xa
              libc.so.1`sbrk+0x20
              libc.so.1`_morecore+0x12b
              libc.so.1`_malloc_unlocked+0x10d
              libc.so.1`malloc+0x2d
              libjvm.so`__1cCosGmalloc6FL_pv_+0x1e
              libjvm.so`__1cFChunk2n6FLL_pv_+0x49
              libjvm.so`__1cFArenaEgrow6ML_pv_+0x3a
              libjvm.so`__1cINodeHashEgrow6M_v_+0x87
              libjvm.so`__1cINodeHashQhash_find_insert6MpnENode__2_+0x140
              libjvm.so`__1cMPhaseIterGVNNtransform_old6MpnENode__2_+0x731
              libjvm.so`__1cMPhaseIterGVNIoptimize6M_v_+0xff
              libjvm.so`__1cHCompileIOptimize6M_v_+0x801
....

This apparently has nothing to do with weblogic, and must be the JVM itself. If you look carefully enough through the C++ mangled method name, you’ll see the code is doing NodeHash::grow(), and considering the pattern of 2,4,8,16…it is very likely to be that this NodeHash is rehashing it’s hashtable so doubling it’s table size each time!

The caller is from Compile::Optimize and it is running on the compiler thread. after a quick pstack of the compiler thread of the buggy JVM, we can see the thread is in a dead loop as shown above - keep growing its hashtable..

Based on the rule of “you won’t be the first one”, a quick google search reveals this is a JVM bug, and still open in JDK 1.6u21. It is triggered by some specific class - and we happened to have one! the fix? well, wait for Oracle to release a new JDK..(6u23 after one month).