Google Drive: Where are the files shared BY me?

I’m really surprised to see that google drive doesn’t have a central place to display “Shared by me”. It only has “Shared with me”. Well, a lot of times I want to see what files have been shared by me and I may want to revoke them so that they don’t get viewed by unwanted personnels. No, there is no way on the UI.

Lucky, Google provided a “scripting” functionality so you can write a piece of javascript to do this. I found a script from @labnol which supports 200 files only. And I modified it, for my own purpose, to support unlimited files. My script is here; open it your “google script” and run it. It will take a few minutes and send you an email.

Performance tuning with our server

It turns out no matter which company I join, I end up doing performance tuning work. Here’s some funny thing about what we encountered at Ariba.

Memory Issue

Our company is using its own ORM (Object-Rational-Mapping) technology and it seems to be a little out-dated. We were using a “framework” in our code to convert an object to another object (for the OData protocol, called OData-fication). It seems our framework is triggering a “reconstitution” of objects – which eagerly loads all the objects related to the current object being inspected. Considering the complex business logic of procurement system, usually it loads thousands of objects. This can cause a delay of a few seconds (since usually, if you get a list of objects, it will be multiplied). The solution was kind of simple, create a new Pojo and copy only the required fields of the object then pass the Pojo to the framework.

Apache/Tomcat Saturation Issue

I’m surprised to see we are still using an old version of apache since I assume most people would be using nginx etc as web server. We are running apache in *prefork* mode. These apache clusters do two things: terminate SSL, load balancing through mod_jk to tomcat. Each tomcat is configured with around 1000 max clients, while the tomcat at the backend are configured with max threads also around 1000. This sounds like we can handle at least 1000 users simultaneously per instance-pair. However, the real setting is we have 4 apache servers and 6 tomcat instances in our perf testing environment. The problem is, the mod_jk which talks to tomcat is configured to be “optimizing” the connection, means, the connection will be kept alive. Now, when a request comes to apache, it randomly choose on tomcat and send the request. With time goes, each apache process (since we are preforking) will have 6 mod_jk connections to each tomcat server. The funny thing is, each connection on the tomcat side will consume a thread but at one particular time, only 1 tomcat of these 6 connections will be working, other 5 will be just idling. That also means, 5/6 of the threads on the tomcat are doing nothing…so for a cluster of 6 tomcats, we can only serve 1000 users instead of 6000 or at least 4000 users (apache). It actually become worse since more tomcat get saturated, apache seems to be choking as well and the response time etc will be very very bad.

The solution was kind of stupid – don’t use keep alive for mod_jk. Although the mod_jk settings page says this will have strong performance impact, it is still much better than the ‘keep-alive’ behavior. We’ll seek for better options in the future :)

Cache synchronization

It’s a simple problem. Someone developed a cache by extending HashMap. He synchronized the get and put methods, and in each of the method, do two kinds of things: encryption/decryption and file IO. This is killing the app. 1/3 of the time, the threads are waiting to get a lock to access the cache. It is even worse than disabling it. and that was our solution.

Over Design

SOA is a good idea, however, in our case, it doesn’t make sense for us to make 3 http (over SSL) requests for each user’s call. It would be much simpler to just go to database and authenticate the user instead of make an HTTP call and go to database again, while the HTTP call is calling the same application itself!

Package drop and TCP retransmission

I’m surprised to see how much impact package dropping can bring to an app. We were having some small package drops in our production system only and it causes huge delay in the response time of the application. By huge I mean 15-60 seconds of delay. And when it comes to mobile, user will be keeping seeing a spinner (it is blue, we call it the death blue circle) quite often. The slowness was discovered by correlating the two requests/responses from one server to the other. (we reduced the HTTP call from 3 to 1) The response time on the backend server is something around 8-10ms, while at the frontend server, it shows the response comes back after a long delay (15-60s). We really did a tcpdump of production system while keep sending a request to the server using a simulator. It wasn’t too hard to figure out the TCP retransmission is the root cause of all these problems. I’m just surprised to see another problem due the age of the system. We should really give these kind of platform maintenance to PAAS/IAAS companies. PS. Wireshark is really the swissknife for network troubleshooting.

Reset a wrong push to gerrit

Someone did a very bad thing to our code system – merged a developing code to master branch. Luckily it is the weekend no one else is pulling the code. But to do a git hard reset on a remote gerrit is not as simple as I thought.

In fact, it is not complicated either – all we need to do is to delete the master branch and recreate it with the proper commit ID. From the “branch” tab of gerrit, we should be able to do that. If the master branch doesn’t have a checkbox, we need to edit the HEAD to be pointing to another branch, then you can delete your remote master branch. Of course, we need the proper SHA1 ID to recreate the master. I didn’t check Jenkins and I hope it will be fine.

I originally did some Google search and people were talking about forced push, history rewriting, push with ‘–delete’ flag etc. They are generic correct but I end up doing the ‘–delete’ command, which deletes the master branch — equivalent to the UI, and have to fix the ‘[remote rejected] master (branch is currently checked out)’ issue. I also granted myself push and force push permissions in gerrit access control.

Some ridiculous behavior of ios and android

There a few things that are really weird and undocumented which will cost you hours to figure out. Just to list a few:

  1. on android, 9 patch pictures are often used as background. However, if you set any padding to your view (e.g., <TextView padding=”4dp”>), then your content padding of the 9 patch will not be working.
  2. on iOS, for UICollectionView, if somehow your cell view size is greater than the property cell size, then none of your views will be shown. in fact, the cellForItemAtIndexPath: is not called at all even if you returned a non-zero value for the numberOfItemsInSection: method.
  3. on iOS, if you want to align the text in a label on the top, there no easy way. however, you can achieve it by adding extra “\n” to the text, like “\n\n\n” if your label is set to 3 lines, then your text will be top aligned.

10 seconds delay in custom NSURLProtocol

My new company is using something called cordova – not a very cool one but serves the purpose. The native experience of an app cannot be matched by cordova, not to mention the heavy javascript framework we are using.

One big challenge we are facing is how to debug the client-server issues. I come up with an idea of intercepting all the HTTP(s) calls by using a custom NSURLProtocol implementation such that I can print all the http request and response into log file. This sounds like a simple stupid idea but it works generally good until when we start to send POST request to server in the UIWebView. The time required to post anything to server is almost exact 10 seconds. Also, the request will be considered failing from the javascript perspective. This breaks most of our features and one of my engineer worked till 2AM then figure out my change is causing the problem and I have to back out my change.

I started to investigate the problem. It just doesn’t make sense. I added a lot of logs to the custom NSURLProtocol class, which immediately I noticed a lot of strange behaviors like the canInitWithRequest method is called many more times than others, and the pattern is also different when the POST is sent – it actually kind of create the request twice. I carefully examine the object IDs of all the request objects etc but didn’t find a clue. The startLoading method, however, is called only once. After that, there is exact 10 seconds delay before the connection:didReceiveResponse delegate method is called.

The bad part about iOS is that it is pretty hard to debug this kind of problem. You don’t have the source code of the system so you have no idea what is going on. I cannot even have a nice stack with a breakpoint. On the other hand, it seems the iOS networking library is fully asynchronous so you don’t see a socket read anywhere in any thread (unlike android, which usually one of the thread will be on socket operation for apache http client or URLConnection).

More carefully google search indicates that the multiple call to canInitWithRequest should be fine. It is the way the URL loading system works and it is trying to determine what is the best way to handle the request. It must be something wrong with the special POST method we were using. I decided to create a simple application with a simple cordova application to reproduce and isolate the problem. The funny thing is, with the simplified application, the problem disappears.

I only get a clue during my dream last night. Our javascript must be using some strange code, like, synchronous HTTP request. Tested an HTTP request with sync status set – yes, problem reproduced. The request cost exactly 10.134 seconds while the 134ms is the real request time.

I have not much idea about how apple handle this 10 seconds timeout. I only remembered some strange comment from one of apple’s own sample code of “CustomHTTPProtocol“. The comment said:


// Calculate our effective run loop modes.  In some circumstances (yes I'm looking at
// you UIWebView!) we can be called from a non-standard thread which then runs a
// non-standard run loop mode waiting for the request to finish.  We detect this
// non-standard mode and add it to the list of run loop modes we use when scheduling
// our callbacks.  Exciting huh?
//
// For debugging purposes the non-standard mode is "WebCoreSynchronousLoaderRunLoopMode"
// but it's better not to hard-code that here.

I put the exact same code as this example did and it fixes the problem. Basically, one should not run the NSURLConnection on the WebCoreSynchronousLoaderRunLoopMode. I may need dig further into this NSRunLoop and scheduleInRunLoop:forMode method, but for now, I’m satisfied. The big surprise is this problem has not been fixed by apple till today’s iOS 7.1.

Objective-C super simple notes

  1. pointers are generally objects. NSArray* array –> then array is an object. call object’s method by square brackets: [array size]
  2. id type means “Object” in java. id a = @”string” means Object a = “string” in java.
  3. nil is similar to null in java. however, it is safe to call [nil someMethod], it will not crash, but it will return null. may result in strange behavior.
  4. NSString can be initialized using @ character, and it is in Unicode. NSString* s = @"Some string"
  5. NSArray can be initialized using @ too: NSArray* str_arr = @[@”a”, @”b”, @”c”];
  6. selector is something like the “function pointers” in C. to refer a selector, use @selector(method_name:)
  7. #import is a better #include.
  8. for methods, + means “static” and – is “instance”.
  9. Method declaration:
    + (UIColor*) colorWithRed: (CGFloat) red green: (CGFloat) green
    blue: (CGFloat) blue alpha: (CGFloat) alpha

    colorWithRed:green:blue:alpha: is the method name.
    The (CGFloat) is param type, and the subsequenct red, green, blue and alpha are actually params.

  10. No method overloading is supported. (methods cannot have some name)
  11. CFTypeRefs can sometime be used to convert NS data type (objc) to CF data type (C). e.g., NSString* x; (CFStringRef)x will give CFString (CF= Core Foundation, NS=next step)
  12. Closure. (called Block) in objc. in the form of ^(id obj1, BOOL f){ … };
    NSArray* arr2 = [arr sortedArrayUsingComparator: ^(id obj1, id obj2) {}

    a __block qualifier will make a local variable accessible with a block. (like final for anonymous inner class in java)

  13. class structure: usually inherit from NSObject – even for allocation.
  14. class structure: @interface to declare the interface of the class (method prototype in C) and @implementation to write the real implementation.
    example:
  15. @interface MyClass : NSObject
        - (NSString*) sayGoodnightGracie;
    @end
    @implementation MyClass {
            // instance variable declarations go here
    }
    - (NSString*) sayGoodnightGracie {
            return @"Good night, Gracie!";
    } 
    @end
  16. Alloc-init: alloc is to alloc memory, init is something similar to a constructor. it must be called throughout the construction chain (designated initializer(default constructor?)). Sometime the new method in NSObject is equal to [[Object alloc] init]
  17. self = this; super = super

When the disk is slow…

In one of our product, we use tokyo tyrant, a variant of memcached to store data. The reason includes two folds, one is it supports persistence so that if the server dies, it can be recovered; second is it supports a master-master replication, instead of the master-slave replication of memcached. In general, it is a pretty amazon component, the performance is also superb in most cases – thousands TPS easily.

By the way, on the java side, we have the Staged Event Driven Architecture which separates the read and write to the tokyo tyrant database. By nature there will be queues attached to threads that containing the data to be processed.

Then all the sudden from one day, the operation team reports that during the busy hours, the system is dropping events due to the queue is full. We started from investigating the threads that writing to the database since that queue was the one that get full. A quick shell script does its work:

./jstack 1100 | grep -A2 submit | grep 'at '|sort|uniq -c|sort -nr
 39 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
  8 at java.lang.Object.wait(Native Method)
...

The output indicates that all the threads are almost waiting for the response from the tokyo tyrant server (ttserver). We can do the same to the ttserver using the pstack command:

pstack 1135

By examining the output of this command, it shows all the ttserver threads are writing to disk using pwrite. The disk IO is saturated.

Well it’s  true since the write to the disk is quite “random”. The ttserver’s delete command will generates “empty holes” on disk and the write may choose to fill in those holes if the data size fits. This ends up with a totally random write and although the underlying disk is using RAID10, it is not fast enough. This can also be figured out using the iosnoop command on Solaris. Simply put, the random write to the traditional spinning disk is extremely slow (comparing to a sequential write).

So what could the solution? One approach is to use SSD disk. But the operation team doesn’t want to do that – one is because it is not “certified” to be used on carrier level systems, second there isn’t any slots available to install the disks and they don’t want to make drastic change to the disk configuration.

Another attempt was made to change the source code of ttserver so that we can optimize the write to the disk. Not a trivial work either.

A third thought is, can we just make a ram disk to store the database file? after all the data size is only 10G+, and that server is having 64G of memory (well, the traditional Solaris system with a lot of CPU and memories). But creating a ramdisk on Solaris is not trivial either.

Finally we come up with a solution using tmpfs. It is available on most Unix systems out of the box and it is basically a memory file system. Better off, it can swap out pages to disk when the memory is low. (of course, under that situation we’ll face the random write problem again, probably. never tested). We save the “redo log” of ttserver to the RAID10 disk which is a sequential write, and the database itself on the tmpfs which just created in memory:

mount -o tmpfs /tt_data

As long as the machine doesn’t crash (which is very rare), the file will be kept, and the performance is even better than before. When the machine also crashes, we can re-generate the database file using the redo log as well as our daily backup. Of course, the size of ram disk must be carefully monitored so it won’t be full.

Conclusion: RAM is the disk. Disk is the tape. period.

Export private key from java keystore (JKS)

I was working with one of my servers to add SSL support (PS, There is free SSL certificate being issued by CAs – and on the other hand, how dangerous the world is!). I generated the CSR using the java keytool with a JKS format. However, when I need to export the JKS file to a DER format for nginx, the keytool doesn’t have a way to export that. I have to develop something by myself and here’s the jython script.


import java
k=java.security.KeyStore.getInstance('jks')
k.load(java.io.FileInputStream('imdm.jks'),'weblogic')
p=k.getKey('imdm_key1','weblogic')
print ''.join([chr(0xff&x) for x in p.encoded]).encode('base64')

In case of 40G JVM core dump

Recently we upgraded our JDK from 1.5 to 1.6 and of course we are using the latest greatest JDK from Oracle: 1.6.0 update 21. After that, some weird thing happened. One of our component which runs on Weblogic servers, crashes around twice a day, generating a 40G core dump file on the disk (which causes a disk usage problem as well!)

At the first thought, it looks like there is a memory leak in our code, but wait – the maximal possible Java heap size is defined as 4G (by -Xmx), plus all other stack, JVM itself, everything together should be something around 5-6G, why we have a core dump of 40G? This has to be from some native code!

The good thing is we have some very handful tools on Solaris – which IMHO is better than Linux; for example, dtrace, pstack, pmap. Linux has something similar, but not really as good.

Since we are debugging memory issue, let’s examine the memory first with pmap. A normal Java process should look like this:

$ pmap 3536 |more
3536:   /opt/middleware/jdk1.6.0_21/bin/amd64/java -Dweblogic.Name=imdmapi -se
0000000000400000         52K r-x--  /opt/middleware/jdk1.6.0_21/bin/amd64/java
000000000041C000          4K rw---  /opt/middleware/jdk1.6.0_21/bin/amd64/java
000000000041D000     158048K rw---    [ heap ]
FFFFFD7ECBCD0000        152K rw--R    [ anon ]
FFFFFD7ECC2D6000        152K rw--R    [ stack tid=2729 ]
......
FFFFFD7EEA6F1000          8K rw---  /opt/middleware/jdk1.6.0_21/jre/lib/amd64/libnet.so
FFFFFD7EEA6F5000          8K r--s-  dev:32,0 ino:477791
FFFFFD7EEA6F8000         16K r--s-  dev:32,0 ino:8281559
FFFFFD7EEA7FB000         16K rw--R    [ stack tid=8 ]
FFFFFD7EEA800000    1048576K rw---    [ anon ]
FFFFFD7F2A800000    3145728K rw---    [ anon ]
FFFFFD7FEA800000     131072K rw---    [ anon ]
FFFFFD7FF2800000      88064K rw---    [ anon ]
....
FFFFFD7FFF3FD000          8K rwx--  /lib/amd64/ld.so.1
FFFFFD7FFFDFC000         16K rw---    [ stack ]

The top is the heap of the native code. while the bottom is the stack, as well as the anonymous pages allocated by JVM as “Heap” for Java. I don’t really know why this is here since it seems to be mixed with those “stack” pages but by using truss (Linux: strace) we can see JVM actually opened /dev/zero then calls mmap() to map the “all zero” file to memory so the memory is also initialized to zero.

Now the core file:

core 'core.2' of 12264: /opt/middleware/jdk1.6.0_21/bin/amd64/java -Dserver.name=iip -Xrs -Xms
0000000000400000         52K r-x--  /opt/middleware/jdk1.6.0_21/bin/amd64/java
000000000041C000          4K rw---  /opt/middleware/jdk1.6.0_21/bin/amd64/java
000000000041D000    2227600K rw---
0000000088381000    2097152K rw---
0000000108381000    4194304K rw---
0000000208381000    8388608K rw---
0000000408381000   16777216K rw---    [ heap ]
FFFFFD7EDE76C000         88K rw---    [ stack tid=166 ]
FFFFFD7EDE86D000         88K rw---    [ stack tid=165 ]
……
FFFFFD7EF2800000       2048K rw---
FFFFFD7EF2A00000       2048K rw---
FFFFFD7EF2C00000       2048K rw---
FFFFFD7EF2E00000       2048K rw---
FFFFFD7EFAA00000    2797568K rw---
FFFFFD7FA5600000    1398784K rw---
FFFFFD7FFAC01000          4K r----
FFFFFD7FFAC03000          8K r----

Look at the highlighted native heap, we can see the native code is using (16+8+4+2+2)=32G, plus the 4G memory for the java heap at the bottom, it is quite close to the 40G core size. That simply means, some native code is leaking memory.

A funny thing that can be observed is the chunk of the memory being allocated is having some pattern: 2,2,4,8,16..2^n? let’s see if it is really being allocated in this way by:

<tt>truss -t brk -p 1920</tt>

The truss result did show the memory is being allocated in that way. Note the brk() is used by system to allocate some memory on system heap, while JVM uses mmap() to allocate the java heap. This is definitely a native problem. But which one? is it a problem of weblogic native code like the native io muxer?

One approach is to use tools like valgrind, DMALLOC etc to trace the “native memory” leak, but that will be too hard. The good thing is we have dtrace, which can print a stack trace when some method is being executed – apparently, I want a stacktrace when brk() is being called!

#!/usr/sbin/dtrace -s
syscall::brk:entry
/pid == $1/
{
    stack();
    printf("====>allocating to: %x\n", arg0);
}

The script will print the stack, as well as the size of memory being requested. dtrace did a good job this time as well:

0  52693                        brk:entry
              libc.so.1`_brk_unlocked+0xa
              libc.so.1`sbrk+0x20
              libc.so.1`_morecore+0x12b
              libc.so.1`_malloc_unlocked+0x10d
              libc.so.1`malloc+0x2d
              libjvm.so`__1cCosGmalloc6FL_pv_+0x1e
              libjvm.so`__1cFChunk2n6FLL_pv_+0x49
              libjvm.so`__1cFArenaEgrow6ML_pv_+0x3a
              libjvm.so`__1cINodeHashEgrow6M_v_+0x87
              libjvm.so`__1cINodeHashQhash_find_insert6MpnENode__2_+0x140
              libjvm.so`__1cMPhaseIterGVNNtransform_old6MpnENode__2_+0x731
              libjvm.so`__1cMPhaseIterGVNIoptimize6M_v_+0xff
              libjvm.so`__1cHCompileIOptimize6M_v_+0x801
....

This apparently has nothing to do with weblogic, and must be the JVM itself. If you look carefully enough through the C++ mangled method name, you’ll see the code is doing NodeHash::grow(), and considering the pattern of 2,4,8,16…it is very likely to be that this NodeHash is rehashing it’s hashtable so doubling it’s table size each time!

The caller is from Compile::Optimize and it is running on the compiler thread. after a quick pstack of the compiler thread of the buggy JVM, we can see the thread is in a dead loop as shown above – keep growing its hashtable..

Based on the rule of “you won’t be the first one”, a quick google search reveals this is a JVM bug, and still open in JDK 1.6u21. It is triggered by some specific class – and we happened to have one! the fix? well, wait for Oracle to release a new JDK..(6u23 after one month).