When the disk is slow...

In one of our product, we use tokyo tyrant, a variant of memcached to store data. The reason includes two folds, one is it supports persistence so that if the server dies, it can be recovered; second is it supports a master-master replication, instead of the master-slave replication of memcached. In general, it is a pretty amazon component, the performance is also superb in most cases - thousands TPS easily.

By the way, on the java side, we have the Staged Event Driven Architecture which separates the read and write to the tokyo tyrant database. By nature there will be queues attached to threads that containing the data to be processed.

Then all the sudden from one day, the operation team reports that during the busy hours, the system is dropping events due to the queue is full. We started from investigating the threads that writing to the database since that queue was the one that get full. A quick shell script does its work:

./jstack 1100 | grep -A2 submit | grep 'at '|sort|uniq -c|sort -nr
 39 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
  8 at java.lang.Object.wait(Native Method)
...

The output indicates that all the threads are almost waiting for the response from the tokyo tyrant server (ttserver). We can do the same to the ttserver using the pstack command:

pstack 1135

By examining the output of this command, it shows all the ttserver threads are writing to disk using pwrite. The disk IO is saturated.

Well it’s true since the write to the disk is quite “random”. The ttserver’s delete command will generates “empty holes” on disk and the write may choose to fill in those holes if the data size fits. This ends up with a totally random write and although the underlying disk is using RAID10, it is not fast enough. This can also be figured out using the iosnoop command on Solaris. Simply put, the random write to the traditional spinning disk is extremely slow (comparing to a sequential write).

So what could the solution? One approach is to use SSD disk. But the operation team doesn’t want to do that - one is because it is not “certified” to be used on carrier level systems, second there isn’t any slots available to install the disks and they don’t want to make drastic change to the disk configuration.

Another attempt was made to change the source code of ttserver so that we can optimize the write to the disk. Not a trivial work either.

A third thought is, can we just make a ram disk to store the database file? after all the data size is only 10G+, and that server is having 64G of memory (well, the traditional Solaris system with a lot of CPU and memories). But creating a ramdisk on Solaris is not trivial either.

Finally we come up with a solution using tmpfs. It is available on most Unix systems out of the box and it is basically a memory file system. Better off, it can swap out pages to disk when the memory is low. (of course, under that situation we’ll face the random write problem again, probably. never tested). We save the “redo log” of ttserver to the RAID10 disk which is a sequential write, and the database itself on the tmpfs which just created in memory:

mount -o tmpfs /tt_data

As long as the machine doesn’t crash (which is very rare), the file will be kept, and the performance is even better than before. When the machine also crashes, we can re-generate the database file using the redo log as well as our daily backup. Of course, the size of ram disk must be carefully monitored so it won’t be full.

Conclusion: RAM is the disk. Disk is the tape. period.