It turns out no matter which company I join, I end up doing performance tuning work. Here’s some funny thing about what we encountered at Ariba.
Memory Issue
Our company is using its own ORM (Object-Rational-Mapping) technology and it seems to be a little out-dated. We were using a “framework” in our code to convert an object to another object (for the OData protocol, called OData-fication). It seems our framework is triggering a “reconstitution” of objects - which eagerly loads all the objects related to the current object being inspected. Considering the complex business logic of procurement system, usually it loads thousands of objects. This can cause a delay of a few seconds (since usually, if you get a list of objects, it will be multiplied). The solution was kind of simple, create a new Pojo and copy only the required fields of the object then pass the Pojo to the framework.
Apache/Tomcat Saturation Issue
I’m surprised to see we are still using an old version of apache since I assume most people would be using nginx etc as web server. We are running apache in *prefork* mode. These apache clusters do two things: terminate SSL, load balancing through mod_jk to tomcat. Each tomcat is configured with around 1000 max clients, while the tomcat at the backend are configured with max threads also around 1000. This sounds like we can handle at least 1000 users simultaneously per instance-pair. However, the real setting is we have 4 apache servers and 6 tomcat instances in our perf testing environment. The problem is, the mod_jk which talks to tomcat is configured to be “optimizing” the connection, means, the connection will be kept alive. Now, when a request comes to apache, it randomly choose on tomcat and send the request. With time goes, each apache process (since we are preforking) will have 6 mod_jk connections to each tomcat server. The funny thing is, each connection on the tomcat side will consume a thread but at one particular time, only 1 tomcat of these 6 connections will be working, other 5 will be just idling. That also means, 5/6 of the threads on the tomcat are doing nothing…so for a cluster of 6 tomcats, we can only serve 1000 users instead of 6000 or at least 4000 users (apache). It actually become worse since more tomcat get saturated, apache seems to be choking as well and the response time etc will be very very bad.
The solution was kind of stupid - don’t use keep alive for mod_jk. Although the mod_jk settings page says this will have strong performance impact, it is still much better than the ‘keep-alive’ behavior. We’ll seek for better options in the future :)
Cache synchronization
It’s a simple problem. Someone developed a cache by extending HashMap. He synchronized the get and put methods, and in each of the method, do two kinds of things: encryption/decryption and file IO. This is killing the app. 1/3 of the time, the threads are waiting to get a lock to access the cache. It is even worse than disabling it. and that was our solution.
Over Design
SOA is a good idea, however, in our case, it doesn’t make sense for us to make 3 http (over SSL) requests for each user’s call. It would be much simpler to just go to database and authenticate the user instead of make an HTTP call and go to database again, while the HTTP call is calling the same application itself!
Package drop and TCP retransmission
I’m surprised to see how much impact package dropping can bring to an app. We were having some small package drops in our production system only and it causes huge delay in the response time of the application. By huge I mean 15-60 seconds of delay. And when it comes to mobile, user will be keeping seeing a spinner (it is blue, we call it the death blue circle) quite often. The slowness was discovered by correlating the two requests/responses from one server to the other. (we reduced the HTTP call from 3 to 1) The response time on the backend server is something around 8-10ms, while at the frontend server, it shows the response comes back after a long delay (15-60s). We really did a tcpdump of production system while keep sending a request to the server using a simulator. It wasn’t too hard to figure out the TCP retransmission is the root cause of all these problems. I’m just surprised to see another problem due the age of the system. We should really give these kind of platform maintenance to PAAS/IAAS companies. PS. Wireshark is really the swissknife for network troubleshooting.