Thursday, January 1, 2009

VMware ESX NFS performance on openstorage

There are plenty of ESX discussions about the performance of NFS vs iscsi on the web, believe me I spent a lot of time reading them all to try and get a better handle on what I was seeing here.

To summarise the general feeling over all the articles out there; the performance of NFS is much the same or slightly better than iscsi (and in most cases FC wins overall).
This is pretty much in-line with what I would have expected. Iscsi carries quite a lot of overhead and is a relatively new protocol with little in the way of speed optimisations present.
NFS has been around for decades and it doesn't have much overhead at all, with many implementations measuring near wirespeed (fishworks for example).

My environment here consists of an IBM x226 server (3Ghz, 1GB RAM, onboard Bbroadcom 1Gb NIC) running solaris nevada build 95, with 8x 300GB SATA2 disks connected to a Supermicro 8-Port SATA2 PCI-X card (AOC-SAT2-MV8).
This SATA card is a fairly cheap, dumb sata board that simply provides connectivity to the 8 SATA disks. It's important to understand that it's NOT a raid card and there is no on board cache; think of it merely as an addition 8 onboard SATA ports that the motherboard can see. ZFS makes ordinary storage like this...awesome, and in many cases faster than a hardware implementation. Many, but not all, as I discovered in testing ESX nfs performance.


The storage is laid out as follows:
root@supernova ~]#zpool status Z
  pool: Z
 state: ONLINE
 scrub: none requested
config:

 NAME STATE READ WRITE CKSUM
 Z ONLINE 0 0 0
  raidz1 ONLINE 0 0 0
  c0t0d0 ONLINE 0 0 0
  c0t1d0 ONLINE 0 0 0
  c0t2d0 ONLINE 0 0 0
  c0t3d0 ONLINE 0 0 0
  c0t4d0 ONLINE 0 0 0
  c0t5d0 ONLINE 0 0 0
  c0t6d0 ONLINE 0 0 0
  c0t7d0 ONLINE 0 0 0

errors: No known data errors
root@supernova ~]#zpool list Z
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
Z 2.17T 1.91T 268G 87% ONLINE -

For those of you that haven't seen the light yet and moved to zfs and have no idea what that means, it's basically like an 8 disk Raid5.

I knew from the beginning that the storage performance wasn't going to be out of this world, but this is just a home network and it should be "good enough" for what I'm wanting to work with.

I created a filesystem called Z/VMs, and this is what it looks like after putting a few VMs on there:

root@supernova ~]#zfs get used,compress,sharenfs,shareiscsi Z/VMs
NAME PROPERTY VALUE SOURCE
Z/VMs used 27.3G -
Z/VMs compression off default
Z/VMs sharenfs on inherited from Z
Z/VMs shareiscsi off default

Results
This blog entry is being done after the fact and I didn't bother recording the exact figures as I went as I wasn't intending for this to turn into the big investigation that it turned out to be.

I immediately noticed that read performance across nfs was good, around 40MB/sec for sequential reads which is about all I tend to see from supernova to my desktop too. This was inline with what I was expecting.
Write performance though was appauling. 4-5MB/sec, sometimes I'd see 6MB/sec tops. It really was awful! From my desktop I can easily do 30+MB/sec writes to supernova, so the bottleneck wasn't network or disk throughput on the file server, so why was ESXi having such a hard time with it?

Troubleshooting this from the ESXi server's end quickly proved near impossible. Nfs doesn't show up under esxtop as disk activity at all, it's all just counted as network traffic which doesn't give many clues. In addition there are pretty much no useful knobs to tune when it comes to NFS, in turns of setting/viewing nfs mount options/flags.

Ofter nfs performance issues are down to using a buffer size that is too small or it's a protocol (udp vs tcp) issue.

Since ESXi won't tell you anything, even from the service console I was forced to watch everything from the nfs server's end, but fortunately I'm using solaris which has some great obversability tools.

nfsstat quickly showed me that ESXi was using nfs3 over TCP. Nfs4 would have been nice, but this shouldn't be the problem. Iostat wasn't hinting at and disk bottlenecks and the cpu was just ticking over.
I ran snoop to watch the nfs traffic while doing some large writes with iometer in a nfs mounted WinXP vm on the ESXi server.

[root@supernova ~]#snoop host esx1 and rpc nfs
Using device bge0 (promiscuous mode) esx1.griffous.net -> supernova NFS C WRITE3 FH=9A62 at 1917214208 for 65536 (FSYNC) esx1.griffous.net -> supernova NFS C WRITE3 FH=9A62 at 61346304 for 3072 (FSYNC) esx1.griffous.net -> supernova NFS C WRITE3 FH=9A62 at 117681664 for 1024 (FSYNC) supernova -> esx1.griffous.net NFS R WRITE3 OK 65536 (FSYNC) esx1.griffous.net -> supernova NFS C WRITE3 FH=9A62 at 1918262784 for 65536 (FSYNC) supernova -> esx1.griffous.net NFS R WRITE3 OK 3072 (FSYNC) supernova -> esx1.griffous.net NFS R WRITE3 OK 1024 (FSYNC) esx1.griffous.net -> supernova NFS C WRITE3 FH=9A62 at 117682688 for 4096 (FSYNC) supernova -> esx1.griffous.net NFS R WRITE3 OK 65536 (FSYNC) supernova -> esx1.griffous.net NFS R WRITE3 OK 4096 (FSYNC)

2 things are interesting to observe.

  1. It's using 64k (65536) byte packets, so the window size is at the correct maximum
  2. All write(3) operations are using FSYNC.

FSYNC

This was the real source of the "problem". I spent a lot of time googling this and come across a number of websites talking about zfs+nfs and zfs+databases with FSYNC, and the performance issues that come with it. This turned into a big exploration of the way writes are handled at the various layers by the difference services involved, in particular the ZFS ZIL.

At the bottom of the stack is zfs and it's disks. By default zfs caches up writes in it's journal the Zfs intent log (ZIL) and it will flush these writes to disk every 5 seconds in an aggregated/optimised write. The Zil flush operation is a O_DSYNC write, which waits until the disks themselves have confirmed that the write has completed before returning (remember that disks themselves also have caches).
This is a time consuming operation, but if it's critical that your data makes it to disk before the application moves on (think databases), then it's entirely appropriate.

In a RaidZ, a write operation occurs across several disks and the entire strip has to be read to calculate the parity for the stripe before updating it. Raid-5 write performance is always going to be less than optimal because of this. RaidZ has a variable length stripe which helps a lot, but the bottom line is that with a single width RaidZ you'll only get the IOPs of a single disk.

Putting this all together on a system with no DRAM based disk write cache as I have here, and a O_DSYNC/FSYNC operation is going to give the net write performance of a single unbuffered disk or less. Fortunately most applications don't use FSYNC.

NFS writes can optionally have the fsync bit set on write operations, which works it's way down the layers to the filesystem (zfs), it's journal (zil), and then the disks/array itself (my sata disks).
When an NFS client requests a fsync write the nfs server cannot confirm the completion of the write request until that data has actually been confirmed as written on disk and zfs honours this.

Based on my snoop results, *all* ESX writes are using fsync which mean that for every IO write request in a VM the nfs server has to flush all writes to disk before continuing with the next request. Installing an OS for example will issues hundreds of write fsync requests per second which is just annihilating the write throughput.

Given the critical nature of the data running in virtual machines, I don't think ESXi is doing anything wrong here. Data being lost midtransaction could have disasterous results for the VMs further up the stack and the only way for ESXi to gaurantee data integrity is to use fsync for all writes.

This all makes sense.... so why is everyone else reporting that nfs performance is on par with iscsi when from what I've seen here, it's a disaster due to fsyncs! Funnily enough my googling turned up a few other people asking the same question, also reporting around 4-5MB/sec on writes, with many talking about using linux nfs storage.
Unfortunately no one had replied to these threads to I had to do a bit more head scratching to get the answer.

I think the answer is this; I'm doing nfs against a commodity server, with commodity disks which means that my writes are done using write-through while most ESX(i) installations will be against commercial SAN/NAS such as netapp appliances, which will operate in write-back mode as they have a battery backed cache of DRAM of NVRAM that can survive a power outage without data loss.

Write Caches

In write-through mode a FSYNC write is written to disks before returning with a successful IO, which is just exasserbated over a higher than local latency network storage system such as NFS.
In write-back mode, as soon as the data is in DRAM or in NVRAM the array will return a successful IO allowing the client keep hammering those IOs through even though they haven't actually made it to disk yet. This is actually transactionally safe because the battery backup will ensure that those writes that didn't actually make it to disk are replayed from cache (still live due to the battery) onto the disks as soon as power returns.
Naturally write-back caching makes a huge difference to write performance latency.

My home system is using simple disks with no additional write caching so I can't do write-back caching.....or can I?

ZFS is very configurable, and there is a "knob" that you can change on the fly to disable flushing fsync transaction to disks. By disabling the ZIL, zfs will effectively ignore fsync/O_DSYNC requests, or put another way it changes your zpool to behave like writeback storage. Now, this is a very BAD idea and should never be used in production as it will cause corruption for nfs clients in the event of a power outage. Don't do it! Really, don't. More information can be found here.

I wanted to confirm that my understanding of all this was in fact accurate so while doing a large file copy within the VM as root I issued "echo zil_disable/W0t1 | mdb -kw", which disables the ZIL globally. Straight away, mid copy the file copy performance rocketed up and I started seeing more like 25-35MB/second. Woohoo, so it's the lack of a write-back cache that's killing nfs performance in my environment.

Obviously leaving thi ZIL disabled isn't a safe thing to do and I don't want corruption so I put it back again with "echo zil_disable/W0t0 | mdb -kw".
It proved that I was correct in my assumption though, it was the write-through behaviour of zfs + nfs that was killing my performance.

ZFS does have the ability to put the zil onto alternate storage while keeping your data on the main zpool. Putting your zil onto a battery backed RAM device or a solid state disk will do wonders for this kind of loading so I could likely solved this problem by putting my zil on a seperate SSD.
Right at the moment there is a rather major bug with this functionality in zfs; once a zil (slog) has been setup on an alternate vdev it can't be removed again. I'll be keeping an eye on that bug, and once fixed I'll seriously look into going ahead with doing this.

Having the zil on very low latency storage makes the most difference for O_DSYNC/fsync operations but it will benefit a large range of fs loads.

What next? Iscsi testing

Having proved that transactionally stable nfs storage on supernova is going to be painfully slow until I add additional hardware I turned to my other option for storage on supernova, iscsi.

The short of it is that iscsi is performing perfectly on both reads and writes which I'm very pleased to see. I haven't yet had a big dig into what operations are going on with iscsi in terms of disk flushes on write, but whatever the differences are the user facing result is much better performance.
I still think nfs is far more flexible and would offer many great advantages for management but until I can get the performance on par with iscsi, I'll stick with iscsi.

I think the best news is that opensolaris still does remain a great platform for use with ESX. With project comstar even FC is an option, though I don't have the hardware here to play with that :P

No comments: