Thursday, January 29, 2009

Solaris ldap naming - Part 2 concepts and preparation

Before we can move onto actually configuring and loading the DSEE with data, there are some concepts and terminologies for the future ldap clients that need to be understood.

Authentication and Encryption options

There are three basic methods to authenticate with the DS, and many methods to encrypt that conversion. This is a very important distinction, but this wasn't made very clear in the official documentation that I've found to date. To make matters worse the terminology used to describe these concepts are very counter intuative.

Authentication to the Directory (Called "credential level" in solaris-ldap-speak)
This is who your ldap clients will connect to the DS as.There are 3 basic choices:
Anonymous
Pretty self explanitory really, clients won't connect as a any specific user at all, this is called an anonymous bind. Bear in mind that this means that everyone (computers or users) connect to your server at the same access level. Ldap clients need write to the directory to perform tasks like password changes which means that you'll have to grant write access not to just everybody, but more precisely anybody with network access to your directory server.I really hope I don't have to explain how dire this is from a security standpoint!

Proxy
This is not to be confused with directory proxy servers, or http proxy servers. This method simply means that all ldap client will share a single username/password. The key difference here as compared with anonymous is that only those clients that already know this shared username/password can connect. Obviously only your known clients should have that password, not the evil hackers! The documentation refers to the ability to create and use multiple proxy accounts across your enterprise, which seems to be something a bit like a rudementary implementation of group based access. This might be useful for segmenting your ldap domain into major business units with different access to different sections of your directory however I expect that if your network is that big/complex than you'll want the next method.

In short, you never ever want to use anonymous, proxy is the more secure and better choice to start off with.

Per user (Kerberos)
This method assumes that each user/host will have their own account with which to access the directory server. This of course will be familiar ground for anyone that's run any Windows AD network. Each of your user's will have their own account in your directory, so why not have them use this account to access the directory, rather than this silly shared proxy account?

It turns out that it doesn't work that way here. Instead the concept of a kerberos user account is entirely seperate from the LDAP record representing a user account's identity.
As we won't be installing kerberos (yet), proxy is the "credential level" that we will use across the Directory Servers.


Encryption (Called Authentication Method in solaris-ldap-speak)

This deals with how to connect with the directory server, rather than as whom.
By using proxy authentication we've stopped just anyone on the network connecting and accessing our directory server, but unfortunately in it's native implementation the proxy user's username & password go across the wire in clear text. *shudder*

In this modern age of switched networks, (rather than "hey, everyone watch my traffic" hubs) man-in-the-middle attacks are hopefully far less common or as easy to pull off as they once were when Solaris 8/9 were making the move to LDAP, however encryption remains a very good idea and it's standard practice in most commercial implementations, such as the Windows AD world.

Unfortunately, based on my experiences setting it up for this blog, the Solaris implementation and easee of setup of SSL for LDAP is decades behind what is found in the Microsoft domain world.
I recommend starting off with encryption optional while you get everything up and running initially.

Of the 8 choices available, there are 3 main ones of interest. I expect the others are more for legacy implementations.
simple:Simple doesn't use encryption at all. Enough said.

TLS:simple:
This method uses SSL encryption for all communication with the DS. Think https, but ldap.
This is the method that this guide will be using, and it's the method that all the other guides seemed to recommend based on my googling at the time.
As I mentioned earlier, unfortunately it does takes quite bit of work o get SSL up and working correctly on clients but fortunately you can specify multiple methods and an order of preference.  Therefore we are going to select tls:simple with the option to fall back to "simple" (no encryption remember) while getting initially setup.

sasl:GSSAPI:
This method is exclusively for use with kerberos.

Schema:

Another big surprise for first timers will be the fact that out of the box, DSEE can't work with and doesn't understand how to deal with Solaris clients. I'm just really dumbstruck here.
I would understand if the default implementation needed to be extended in some way to handle non-SUN clients, but in actual fact it can't do anything in it's default configuration. There is no integration whatsoever. SUN - FIX THIS!


For a client to use ldap, the directory needs to be prepped with a Solaris client friendly schema. The second shock to newcomers will be that the DSEE server software doesn't actually come with any software, wizards, or buttons to set it up for Solaris clients.

The 'tool' (a shell script in fact!) for setting up a solaris ldap client friendly schema is found on client machines in /usr/lib/ldap/idsconfig, which comes with the SUNWnisu (Network Information System - usr) package.
This package is included as part of the base Solaris install. 

Run this script an any user, and follow the prompts. Most of the defaults will be OK. My configuration summary is belowe, with notes appended where important.

  1 Domain to serve : solnet.com
  2 Base DN to setup : dc=solnet,dc=com (my test domain)
  3 Profile name to create : default
  4 Default Server List : 192.168.10.4 (My DS server)
  5 Preferred Server List : 
  6 Default Search Scope : one
  7 Credential Level : proxy (Important. Make sure you don't use anonymous) 
  8 Authentication Method : tls:simple;simple (Important. SSL if possible, but fall back to no encryption while we get setup) 
  9 Enable Follow Referrals : FALSE
 10 iDS Time Limit : 
 11 iDS Size Limit : 
 12 Enable crypt password storage : FALSE
 13 Service Auth Method pam_ldap : 
 14 Service Auth Method keyserv : 
 15 Service Auth Method passwd-cmd: 
 16 Search Time Limit : 30
 17 Profile Time to Live : 43200
 18 Bind Limit : 10
 19 Service Search Descriptors Menu

Next I ran a bunch of index commands as instructed. In my case I had to change the paths and domains to look something like this for each line:
dsadm reindex -l -t solnet.com.getgrent /var/opt/SUNWdsee/dsins1/ dc=solnet,dc=com

This will create all the additional records and schema information in the directory to service Solaris LDAP clients.

At last you are ready to connect your first client machine, which we will do next.

Monday, January 26, 2009

Solaris ldap naming - Part 1 Directory Server Enterprise Edition

Now that we've decided to go with an LDAP directory server, the question is - which one?

The major players appear to be:
Apache Directory Server
OpenDS
OpenLDAP
Sun Java System Directory Server

I'm pretty sure that Solaris clients can be made to work with pretty much any LDAP server (even Windows 2003 r2 these days), but in the interests of K.I.S.S, I'll stick with the SUN product for this guide. I expect this will be the most common approach taken by most administrators.

The current full name for the Sun directory server is "Sun Java Server Directory Server Enterprise Edition", a REDICULOUS mouthful - come on SUN, seriously!.
I shall refer to it simply as DSEE everywhere going forward.

DSEE needs to be installed in the global zone, or a full-root zone. A sparse-root zone doesn't work unfortunately. Technically you could install the common components only in the global zone, and the rest can be installed in a sparse zone, but that messes with my global zone and is just generally messy For the installation of DSEE, I have created a full-root zone on persephone, my x86 SXCE 105 VMware ESX guest. The zone is called ds1.

Installation steps:

  1. Download and install the native pkg version of DSEE. The main product appears to be "java ES 5.0u1" which comes with DS 6.2, and there is an update patch which takes that up to 6.3. I have it on good authority that 6.2 is evil - make sure you download the 6.3 update too!
  2. I used ssh X11 forwarding to run the installer via ./installer. I assure you - this is much nicer than installing over the command line! I seleted DSEE 6.2 and it's subcomponents only.

  3. The shared components upgrade screen was a bit of a surprise, and it highlights a bug when installing on recent SXCE builds. It listed some upgrades to Cacao that are actually downgrades. On my system it wanted to "upgrade Cacao 2.2.0.1 to 2.0::PATHCHES:12384". You have to continue anyway, but the unfortunate reality is that you are going to end up with a broken cacao that will need to be reinstalled from original media :(Next I failed the OS resources check because I didn't have SUNWpl5u installed, which turns out to be a perl package.
    This turned into an even bigger problem as it turns out that sxcr 105 doesn't even ship with SUNWpl5u on the DVD any longer (that will be why the resource check was moaning!). I seems that in modern SXCE builds the SUNWperl584core pkg is used instead.
    Fortunatly I still had a sxcr95 DVD around, which did have the package but it turns out that SUNWpl5u can only be installed in a global zone....ack!Boy this is fun isn't it!
    So I had to install the sxcr95 SUNWpl5u package in the sxcr105 global zone to proceed with the install, all the while crossing my fingers hoping that this won't break my perl for all my existing and future zones on persephone.

  4. The install continued, and I chose to configure the server now, including creating a default DS instance. It used the default path "/var/opt/SUNWdsee/dsins1", which didn't seem all that intuative to me, but I left it as is.
    I changed the Suffix to dc=solnet,dc=com to suit my test environment here, but otherwise left everything as default, including running everything as root. (I'll try to come back to that!). The install proceeded smoothly after that.
  5. Next I tried to install the upgrade patch to get up to DS 6.3. As I'd almost come to expect by now, this too led to cursing. The patchadd folder contained two packages that I didn't have installed on the system, which in turn caused the whole patch to fail. I didn't install the ldap-proxy, but it appears the pactch fails it if can't patch it regardless.This is actually my first time installing a proper solaris patch, since I've always lived a bit on the edge running sxcr or higher. I found that if I deleted the two offending folders from the patchadd folder, it would proceed.
    rm -rf 125278-07/SUNWldap-proxy, and rm -rf 125278-07/SUNWldap-proxy-man did the trick, and the patch installed cleanly.
  6. Ok, actually on to configuring the product at last!
    I connected to the webconsole over https://ds1:6789. Normally I disable the webconsole by default due to it's memory usage, check that svc:/system/webconsole:console is enabled.
    You can connected with your standard non-root user initially and the DSCC will give you the following (helpful) error: the Directory Service Control Center requires a one-time initialization process to be run before it can be used. You then log in an root, and after following the DSCC wizard it will helpfully tell you that
    Authentication As root No Longer Required.
  7. Once logged in again as my non-root user, I saw that I didn't have any directory servers registered. This came as a bit of a surprise, as I know we created one during the install in /var/opt/SUNWdsee/dsins1!

    After a quick bit of digging around, I quickly realised that cacaoadm wasn't working and that I was hitting this bug again. This is the bit that I mentioned earlier about Cacao needing to be re-installed.The fix is to un-install SUNWcacaowsvr, and SUNWcacaort and to revert back to the OS provided versions of these packages (2.2x rather that 2.0x).
    SUNWcacaomn doesn't come with ON, so I left that package at the older 2.0 version.

    After being re-installed the dscc and cacao aren't talking properly:
    # /opt/SUNWdsee/dscc6/bin/dsccsetup status
    *** DSCC Application is registered in Sun Java (TM) Web Console
    *** DSCC Agent is not registered in Cacao
    *** DSCC Registry has been created Path of DSCC registry is /var/opt/SUNWdsee/dscc6/dcc/ads
    Port of DSCC registry is 3998

    You must reregister the Cacao agent in the dscc using the cacao-reg command
    # /opt/SUNWdsee/dscc6/bin/dsccsetup cacao-reg
    Registering DSCC Agent in Cacao...
    Checking Cacao status...
    Stopping Cacao...
    Enabling remote connections in Cacao ...
    Starting Cacao...
    DSCC agent has been successfully registered in Cacao.
    At this point I thought I'd check in on the resourcse usage of the zone. DSEE and the webconsole are real memory hogs!How much RAM is used by an idle DSEE server (zone) with no data yet?

    ZONEID NPROC SWAP RSS MEMORY TIME CPU ZONE
    4 44 388M 568M 21% 0:01:07 0.4% ds1 
    Over half a gig already! cripes!

  8. For reasons beyond me, the default behaviour for both the DSCC registry/management system and your DS instance both won't start by default. You must use the dsadm command to register them both within SMF to get them starting by default with the following command:
    ds1:~# /opt/SUNWdsee/ds6/bin/dsadm enable-service --type SMF /var/opt/SUNWdsee/dsins1/
    Registering 'Directory Server' as 'application/sun/ds' in SMF ...
    Registering '/var/opt/SUNWdsee/dsins1' as 'ds--var-opt-SUNWdsee-dsins1' in SMF ...
    Instance /var/opt/SUNWdsee/dsins1 registered in SMF
    Use 'dsadm start '/var/opt/SUNWdsee/dsins1/'' to activate the service 

    And then run the suggested command to actually start you instance.Repeat again for the DSCC registry:ds1:~# dsadm enable-service --type SMF /var/opt/SUNWdsee/dscc6/dcc/ads/

    Now your DS instance (and management service) are installed and ready for configuration!Phew!Note to Sun: This really needs some better polish in a BIG way!
  9. Bonus step: Have yourself a coffee, there is lots more to come.

Sunday, January 11, 2009

zfs sparse provisioning with compression - results

I thought I'd quickly share an update with the results after a few days of using this solution.

My ESX servers are hosting 5 VMs, with 30GB,15GB,15GB,10GB,10GB disks respectively for a total of 90GB.
ESX is showing the iscsi volume as 93GB/500GB used. I have a couple of GB on the volume for the ESXi userworld swap , which explains the descrepancy.

ZFS on the other hand is showing 35.0GB/220GB used on the zvol.
What's more, it's also reporting 1.28x compression!

This is a great space saving, so what are we seeing here?

On the ESX/VMFS side, the vmdks are actually being sparsely provisioned onto the storage from what I can tell. VMFS makes sure that it accounts for the full space of each vmdk in it's totals so that it doesn't overcommit itself, but it isn't actually zeroing/preallocating the contents of each vmdk as it makes them, despite what you may think when you hear the term "thick-provisioning".

I actually confirmed this by testing with another zvol with compression off to rule out zfs compression reclaiming any "000000....0" blocks that VMFS may have been writing. Same result as my production compression=on iscsi volume.

On the Solaris side, ZFS is COWing only the blocks that actually get touched. As you would expect most of my VMs are not using the full capacity of their file systems yet, which explains the big difference between ~90GB and the 35GB that ZFS is actually using.

It's also interesting to see that I'm getting 1.28x compression too. It was higher earlier on but I've been doing quite a bit of work on a Solaris vm which is using ZFS-root, with compression=on in the guest, so we can't compress any blocks for that VM a second time.
This means that this compression figure is for 4 windows VMs of the 5, which is worth having in my book!

Scaling this out: 35GB-for-90GB is a space saving of bit over 2.5x.
I thinly provisioned a 500GB volume with around 250GB of actually storage available with was only a 2x overcommit. Maybe I should have aimed higher!

Snapshots

Just a quick note on ESX snapshots. They will burn through your space savings as they are COWing in a totally different area on the disk, raising the "high water mark" that I referenced in my earlier post.
That's not to say that you shouldn't use them, but just be aware that they really are quite costly both in terms of performance to the VM as we all know, but also to your thin provisioning savings :)

Sunday, January 4, 2009

zfs sparse provisioning with compression

As I was writing my last post, the possibility of zfs reclaiming/saving some space within the zvol by using compression kept nagging at me. In this post I've explored what happens when compression is used with a sparse-vmfs-iscsi-zvolume

I'm not so much interested in whether the real,used VM data blocks would compress well (I don't really expect there is much to be made in the way of gains here) but I did want to see if there were circumstances where compression would allow some unused or freed blocks above zfs to be reclaimed to the zpool by way of compressing empty blocks to almost nothing.

Testing

I performed the following tests to see what would happen:
Virtual machine FS blocks - Write 10GB of data to the file system in a VM, and then delete it again. Check zvol usage before/after.
VMFS blocks - Delete a the 10GB virtual disk from the iscsi VMFS lun. Check zvol usage before/after.

I added a 10GB disk to a winXP VM, formatted it with NTFS and then ran an iometer benchmark on the volume. I did have an interesting result while iometer was creating it's initial 10GB scratch file on the disk. Iometer must write zeros or something very compressable while populating the file because I was getting fantastic compression. Iometer finished creating it's 10GB file, but ZFS was reporting only 78MB used on the whole zvol. That's over 100x compression/deduplication!

With the scratch file fully created, the benchmark actually started and at this point iometer must have been writing some real data across the disk, as the zfs usage worked it's way back up to the full 10GB disk usage that I would have expected.

After the benchmark had finished the zvol looked like this:
#zfs get used,compressratio,referenced Z/esxiscsi-compressed
NAME PROPERTY VALUE SOURCE
Z/esxiscsi-compressed used 10.9G -
Z/esxiscsi-compressed compressratio 1.00x -
Z/esxiscsi-compressed referenced 10.9G - 

I closed the benchmark, and deleted the 10Gb iobw.tst file and rebooted the VM to ensure that all NTFS/VM caches were flushed.
As I expected, the zvol used space remained at the full 10.9G despite the virtual disk being 'empty'.
The NTFS delete operation won't have zeroed the blocks, instead it will have simply updated it's FS pointers/metadata to reflect that the blocks previously occupied by iobw.tst are now 'free'. There is no way for zfs to know that suddenly that these blocks are unreferenced so it's still faithfully storing them unaware of what's happening up the stack at the NTFS level. It was worth a shot!

Next I deleted the virtual disk from the VM using the VIC. Will vmfs free/zero the blocks??
No, it didn't and while I don't know as much about VMFS semantics I expect this result is a result of the same FS level operations/shortcuts as the NTFS test, but with VMFS this time. The zvol usage at this point was still 10.9GB.

That's 0/2 for my compressability tests.

Reusing blocks?

Next I wondered what would happen if I added and fill another 10GB disk on the now empty VMFS iscsi volume.
If VMFS were to reuse the same blocks that the previous 10GB virtual disk had used, then the vzol will simply update the contents of those blocks again, using no further disk space.

To my surprise this is exactly what happened. After creating and filling another 10GB disk and performing the same iometer procedure the zvol was still only using 10.9GB.
In actual fact it went from 10.9 down to around 100MB before climbing back to 10.9 again due to the iometer scratch file/zeroing behaviour that we observed earlier.


We now know that if guests write zeros blocks on their file systems, zfs compression will reclaim the blocks - though I should point out that this would be pretty unusual in a real environment!


VMFS block allocation policies

This most recent result made me even more curious about how VMFS allocates blocks on disks/LUNs. If the block allocater works in a very simple first to last LBA fashion with a preference to reuse blocks at the start of a volume then there may be some space savings to be realised if you churn VMs a lot as I will in my test setup here.
Modern filesystems have fragmentation and wear levelling considerations to factor in, which typically results in blocks being allocated all over the volume but since vmfs typically deals with few, but very large files rather than the millions of small files that are typically created by any modern OS perhaps the engineers at VMware decided to fill their filesystem across a disk/LUN in much the same fashion as filling a glass from bottom to top. As files are deleted, all new ones are created at beginning again!

To test this I started afresh with an empty VMFS volume. I created a 10GB virtual disk, and filled it up. I then added a second 15GB disk and filled this too.
Here is the ZFS view of that:

root@supernova WinXP-ESX]#zfs get used,compressratio,referenced Z/esxiscsi-compressed Z/esxiscsi
NAME PROPERTY VALUE SOURCE
Z/esxiscsi used 27.3G -
Z/esxiscsi compressratio 1.00x -
Z/esxiscsi referenced 27.3G -
Z/esxiscsi-compressed used 27.2G -
Z/esxiscsi-compressed compressratio 1.00x -
Z/esxiscsi-compressed referenced 27.2G - 

Next I deleted both the 10, and 15GB disks, and created a new 20GB disk.
I then filled the 20GB disk, and checked the zfs usage again.

Sure enough I was still using the previously allocated 25Gb (27.3G reported by zfs).

I could go on to test fragmentation and overlap, but I've learned all I wanted to find out today.

Take aways
Zvol sparse provisioning + iscsi +VMFS works out to be a very efficient and scalable storage system. The total disk usage will only be as much as your highest utilisation point, and with sparse provisioning you don't have to worry about extending your vmfs volume each time you need to add another VM.

I should also point out that I've been taking a worst case scenario of a completed full disk that has had every single block written to. In the real world free space within the VMs isn't initially going to take up space on the zvol until the OS's FS allocates those blocks with data. This will of course happen over time but in the short term your disk savings will be even better than what I've tested here.

ZFS compression doesn't appear to have any obvious benefits in terms of storage savings, but i'll have a play with some real world VMs that don't contain synthetic data and I'll post an update in the future with my compression ratio results.

Saturday, January 3, 2009

ESX + iscsi + ZFS. Provisioning options

With my ESX server(s) up and running with acceptable performance using an iscsi shared zvol, it's time to think about how I'd like to actually best provision my storage for the ESX hosts.

Which model?

When setting up FC storage all VMware admins face the architecture decision of whether to go with 1 LUN per VM or a single large LUN for all [or at least large groups of] VMs.
1 LUN per VM requires a lot of extra provisioning each time you wish to add a VM or to grow a disk, but on the up side you don't have to worry about scsi lock contension as you're effectively not using VMFS as an active/active clustered file system at all. There are possibly some performance benefits to using 1 Lun/VM if you're hitting the storage very hard as you have a dedicated queue depth per LUN too.

In my most recent deployment we opted for 1 shared LUN per RAID (2 in total) and we never looked back. Life was simpler and we didn't experience any service impacting slow downs with snapshots or other lock related activities.
Here I'll be using the same model

ZFS iscsi volume options

Within the 1 iscsi volume shared by several VMs model, what are the options that zfs gives us for this volume?
I have around 250GB free on my zpool at the moment and there are three main options as I see it:

  1. Start small with a reasonable sized zvol (50GB?), and simply grow it as VMware needs more space.
  2. Make a large (200GB?) zvol that is unlikely to be filled in the near future.
  3. Thinly provision a very large (500GB) zvol that is unlikely to be filled.

1. A 50GB zvol, that I grow each time I need more space

Positives: Only uses around as much space as I actually need, when more is needed I can grow the zvol simply by using "zfs set volumesize=xxGB"

Negatives: ESX can't directly/nicely use the additional space. When you add to the vzol size, ESX will see additional storage beyond the end of the partition that the vmfs is running within however unlike most modern filesystem there doesn't appear to be a way to grow the vmfs partition/volume to make use of the extra space directly.
You can only add another vmfs volume/partition to the end, and then span between both vmfs volumes to make one large contiguous (virtual) volume, but that really seems pretty ugly to me. I really don't know that I feel too good about VMware taking on the role of logical volume manager, I'd rather keep that intelligence on the Solaris end where I have confidence in it.

If I add another 10GB and another VMFS extent each time I want to try another VM, I can easily see my vmfs spanning many partitions very quickly all within the one zvol.... this just seems stupid and the spanning extent feature feels to me more of a last resort to work around storage that can't expand volumes on the fly.

2. A single 200GB zvol

Positives: I quickly create this once and I'll probably not need to worry about it for some time to come, with no mucky multi-extent volumes.

Negatives: I'm immediately kissing goodbye to 200GB of space on the file server, whether it's used or not. Initially I'm only looking to use maybe 3-4 VMs, each with disks of maybe 10 GB each. I have 40GB but it costs me 200GB... really not optimal! 

3. A sparse 500GB zvol

Positives: PLENTY of space for VM testing, more then I currently have available in my zpool in fact, but ZFS CAN DO!
By thinly provisioning I don't immediately write off disk space on the zpool, but ESX can format and start using what it sees as a full featured 500GB lun. ZFS uses it's COW technology so that only blocks containing newly written data are ACTUALLY written to the zpool so my sparsely provisioned xxxGB volume only uses as much data as has been written to the LUN.

Negatives: This is almost perfect except for the fact that ESX thickly provisions it's disks. This means that when I create a 40Gb VM, and install a 5GB OS, 40Gb is marked as used within zvol. ZFS doesn't know that the blocks are actually "empty" 2 levels of virtualisation up the stack. An nfs mounted VM is thinly provisioned by default, so I'm not quite getting that level of optimum volume utilisation, but NFS isn't an option... I'm just still a bit bitter about that :p

Option 3, is by far the most efficient choice of the 3 options I've evaluated, and so this is what I chose to use:

root@supernova ~]#zpool list Z
NAME SIZE USED AVAIL CAP HEALTH ALTROOT Z 2.17T 1.92T 262G 88% ONLINE -
root@supernova ~]#zfs create -s -V 500G Z/esxiscsi
root@supernova ~]#zpool list Z NAME SIZE USED AVAIL CAP HEALTH ALTROOT Z 2.17T 1.92T 262G 88% ONLINE -
root@supernova ~]#zfs set shareiscsi=on Z/esxiscsi

And...we're done!

Thursday, January 1, 2009

VMware ESX NFS performance on openstorage

There are plenty of ESX discussions about the performance of NFS vs iscsi on the web, believe me I spent a lot of time reading them all to try and get a better handle on what I was seeing here.

To summarise the general feeling over all the articles out there; the performance of NFS is much the same or slightly better than iscsi (and in most cases FC wins overall).
This is pretty much in-line with what I would have expected. Iscsi carries quite a lot of overhead and is a relatively new protocol with little in the way of speed optimisations present.
NFS has been around for decades and it doesn't have much overhead at all, with many implementations measuring near wirespeed (fishworks for example).

My environment here consists of an IBM x226 server (3Ghz, 1GB RAM, onboard Bbroadcom 1Gb NIC) running solaris nevada build 95, with 8x 300GB SATA2 disks connected to a Supermicro 8-Port SATA2 PCI-X card (AOC-SAT2-MV8).
This SATA card is a fairly cheap, dumb sata board that simply provides connectivity to the 8 SATA disks. It's important to understand that it's NOT a raid card and there is no on board cache; think of it merely as an addition 8 onboard SATA ports that the motherboard can see. ZFS makes ordinary storage like this...awesome, and in many cases faster than a hardware implementation. Many, but not all, as I discovered in testing ESX nfs performance.


The storage is laid out as follows:
root@supernova ~]#zpool status Z
  pool: Z
 state: ONLINE
 scrub: none requested
config:

 NAME STATE READ WRITE CKSUM
 Z ONLINE 0 0 0
  raidz1 ONLINE 0 0 0
  c0t0d0 ONLINE 0 0 0
  c0t1d0 ONLINE 0 0 0
  c0t2d0 ONLINE 0 0 0
  c0t3d0 ONLINE 0 0 0
  c0t4d0 ONLINE 0 0 0
  c0t5d0 ONLINE 0 0 0
  c0t6d0 ONLINE 0 0 0
  c0t7d0 ONLINE 0 0 0

errors: No known data errors
root@supernova ~]#zpool list Z
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
Z 2.17T 1.91T 268G 87% ONLINE -

For those of you that haven't seen the light yet and moved to zfs and have no idea what that means, it's basically like an 8 disk Raid5.

I knew from the beginning that the storage performance wasn't going to be out of this world, but this is just a home network and it should be "good enough" for what I'm wanting to work with.

I created a filesystem called Z/VMs, and this is what it looks like after putting a few VMs on there:

root@supernova ~]#zfs get used,compress,sharenfs,shareiscsi Z/VMs
NAME PROPERTY VALUE SOURCE
Z/VMs used 27.3G -
Z/VMs compression off default
Z/VMs sharenfs on inherited from Z
Z/VMs shareiscsi off default

Results
This blog entry is being done after the fact and I didn't bother recording the exact figures as I went as I wasn't intending for this to turn into the big investigation that it turned out to be.

I immediately noticed that read performance across nfs was good, around 40MB/sec for sequential reads which is about all I tend to see from supernova to my desktop too. This was inline with what I was expecting.
Write performance though was appauling. 4-5MB/sec, sometimes I'd see 6MB/sec tops. It really was awful! From my desktop I can easily do 30+MB/sec writes to supernova, so the bottleneck wasn't network or disk throughput on the file server, so why was ESXi having such a hard time with it?

Troubleshooting this from the ESXi server's end quickly proved near impossible. Nfs doesn't show up under esxtop as disk activity at all, it's all just counted as network traffic which doesn't give many clues. In addition there are pretty much no useful knobs to tune when it comes to NFS, in turns of setting/viewing nfs mount options/flags.

Ofter nfs performance issues are down to using a buffer size that is too small or it's a protocol (udp vs tcp) issue.

Since ESXi won't tell you anything, even from the service console I was forced to watch everything from the nfs server's end, but fortunately I'm using solaris which has some great obversability tools.

nfsstat quickly showed me that ESXi was using nfs3 over TCP. Nfs4 would have been nice, but this shouldn't be the problem. Iostat wasn't hinting at and disk bottlenecks and the cpu was just ticking over.
I ran snoop to watch the nfs traffic while doing some large writes with iometer in a nfs mounted WinXP vm on the ESXi server.

[root@supernova ~]#snoop host esx1 and rpc nfs
Using device bge0 (promiscuous mode) esx1.griffous.net -> supernova NFS C WRITE3 FH=9A62 at 1917214208 for 65536 (FSYNC) esx1.griffous.net -> supernova NFS C WRITE3 FH=9A62 at 61346304 for 3072 (FSYNC) esx1.griffous.net -> supernova NFS C WRITE3 FH=9A62 at 117681664 for 1024 (FSYNC) supernova -> esx1.griffous.net NFS R WRITE3 OK 65536 (FSYNC) esx1.griffous.net -> supernova NFS C WRITE3 FH=9A62 at 1918262784 for 65536 (FSYNC) supernova -> esx1.griffous.net NFS R WRITE3 OK 3072 (FSYNC) supernova -> esx1.griffous.net NFS R WRITE3 OK 1024 (FSYNC) esx1.griffous.net -> supernova NFS C WRITE3 FH=9A62 at 117682688 for 4096 (FSYNC) supernova -> esx1.griffous.net NFS R WRITE3 OK 65536 (FSYNC) supernova -> esx1.griffous.net NFS R WRITE3 OK 4096 (FSYNC)

2 things are interesting to observe.

  1. It's using 64k (65536) byte packets, so the window size is at the correct maximum
  2. All write(3) operations are using FSYNC.

FSYNC

This was the real source of the "problem". I spent a lot of time googling this and come across a number of websites talking about zfs+nfs and zfs+databases with FSYNC, and the performance issues that come with it. This turned into a big exploration of the way writes are handled at the various layers by the difference services involved, in particular the ZFS ZIL.

At the bottom of the stack is zfs and it's disks. By default zfs caches up writes in it's journal the Zfs intent log (ZIL) and it will flush these writes to disk every 5 seconds in an aggregated/optimised write. The Zil flush operation is a O_DSYNC write, which waits until the disks themselves have confirmed that the write has completed before returning (remember that disks themselves also have caches).
This is a time consuming operation, but if it's critical that your data makes it to disk before the application moves on (think databases), then it's entirely appropriate.

In a RaidZ, a write operation occurs across several disks and the entire strip has to be read to calculate the parity for the stripe before updating it. Raid-5 write performance is always going to be less than optimal because of this. RaidZ has a variable length stripe which helps a lot, but the bottom line is that with a single width RaidZ you'll only get the IOPs of a single disk.

Putting this all together on a system with no DRAM based disk write cache as I have here, and a O_DSYNC/FSYNC operation is going to give the net write performance of a single unbuffered disk or less. Fortunately most applications don't use FSYNC.

NFS writes can optionally have the fsync bit set on write operations, which works it's way down the layers to the filesystem (zfs), it's journal (zil), and then the disks/array itself (my sata disks).
When an NFS client requests a fsync write the nfs server cannot confirm the completion of the write request until that data has actually been confirmed as written on disk and zfs honours this.

Based on my snoop results, *all* ESX writes are using fsync which mean that for every IO write request in a VM the nfs server has to flush all writes to disk before continuing with the next request. Installing an OS for example will issues hundreds of write fsync requests per second which is just annihilating the write throughput.

Given the critical nature of the data running in virtual machines, I don't think ESXi is doing anything wrong here. Data being lost midtransaction could have disasterous results for the VMs further up the stack and the only way for ESXi to gaurantee data integrity is to use fsync for all writes.

This all makes sense.... so why is everyone else reporting that nfs performance is on par with iscsi when from what I've seen here, it's a disaster due to fsyncs! Funnily enough my googling turned up a few other people asking the same question, also reporting around 4-5MB/sec on writes, with many talking about using linux nfs storage.
Unfortunately no one had replied to these threads to I had to do a bit more head scratching to get the answer.

I think the answer is this; I'm doing nfs against a commodity server, with commodity disks which means that my writes are done using write-through while most ESX(i) installations will be against commercial SAN/NAS such as netapp appliances, which will operate in write-back mode as they have a battery backed cache of DRAM of NVRAM that can survive a power outage without data loss.

Write Caches

In write-through mode a FSYNC write is written to disks before returning with a successful IO, which is just exasserbated over a higher than local latency network storage system such as NFS.
In write-back mode, as soon as the data is in DRAM or in NVRAM the array will return a successful IO allowing the client keep hammering those IOs through even though they haven't actually made it to disk yet. This is actually transactionally safe because the battery backup will ensure that those writes that didn't actually make it to disk are replayed from cache (still live due to the battery) onto the disks as soon as power returns.
Naturally write-back caching makes a huge difference to write performance latency.

My home system is using simple disks with no additional write caching so I can't do write-back caching.....or can I?

ZFS is very configurable, and there is a "knob" that you can change on the fly to disable flushing fsync transaction to disks. By disabling the ZIL, zfs will effectively ignore fsync/O_DSYNC requests, or put another way it changes your zpool to behave like writeback storage. Now, this is a very BAD idea and should never be used in production as it will cause corruption for nfs clients in the event of a power outage. Don't do it! Really, don't. More information can be found here.

I wanted to confirm that my understanding of all this was in fact accurate so while doing a large file copy within the VM as root I issued "echo zil_disable/W0t1 | mdb -kw", which disables the ZIL globally. Straight away, mid copy the file copy performance rocketed up and I started seeing more like 25-35MB/second. Woohoo, so it's the lack of a write-back cache that's killing nfs performance in my environment.

Obviously leaving thi ZIL disabled isn't a safe thing to do and I don't want corruption so I put it back again with "echo zil_disable/W0t0 | mdb -kw".
It proved that I was correct in my assumption though, it was the write-through behaviour of zfs + nfs that was killing my performance.

ZFS does have the ability to put the zil onto alternate storage while keeping your data on the main zpool. Putting your zil onto a battery backed RAM device or a solid state disk will do wonders for this kind of loading so I could likely solved this problem by putting my zil on a seperate SSD.
Right at the moment there is a rather major bug with this functionality in zfs; once a zil (slog) has been setup on an alternate vdev it can't be removed again. I'll be keeping an eye on that bug, and once fixed I'll seriously look into going ahead with doing this.

Having the zil on very low latency storage makes the most difference for O_DSYNC/fsync operations but it will benefit a large range of fs loads.

What next? Iscsi testing

Having proved that transactionally stable nfs storage on supernova is going to be painfully slow until I add additional hardware I turned to my other option for storage on supernova, iscsi.

The short of it is that iscsi is performing perfectly on both reads and writes which I'm very pleased to see. I haven't yet had a big dig into what operations are going on with iscsi in terms of disk flushes on write, but whatever the differences are the user facing result is much better performance.
I still think nfs is far more flexible and would offer many great advantages for management but until I can get the performance on par with iscsi, I'll stick with iscsi.

I think the best news is that opensolaris still does remain a great platform for use with ESX. With project comstar even FC is an option, though I don't have the hardware here to play with that :P

Whitebox burn in and VMware

With everything assembled I went straight into the benchmarking and testing.
The bios in on the XFX is very nice to use, which heaps of information and very specific voltage/FSB adjustments.

I quickly discovered that I could take my 2.83 up to 4GHz, I seem to remember getting to about 4.08 before it would stop posting and I didn't want to push to the voltage on the cores too highly.

RAM

From here I went down on to discover a number of very interesting points about burn in testing and memory speed. I purchased brand name DDR1066MHz RAM and with the motherboard supporting all the way up to DDR 1200MHz RAM, I should be able to run it at it's "factory" DDR1066MHz settings, right?...

It turns out that corsair advise running the RAM at a whopping 2.1V (1.8V is the default). I wouldn't have dreamed of pushing my RAM's voltage that high for fear of breaking it, but sure enough I confirmed this information on their website.

Frustratingly, even at 2.1V memtest86 was still giving me RAM errors even with the CPU/FSB running at it's factory defaults. Even running at 2.15V didn't cure the problem.
I thought it might just be a bad Dimm, so pulled a pair out... no problem. I switched pairs to the presumably faulty pair...no problem there either.
Wait, what???!

Then I remembered discovering something similar back when I built my AMD desktop some years ago. The memory controller on the CPU (AMD remember) just couldn't drive all 4 Dimms at their uprated speeds so if you were to run 4 Dimms you had to drop the DDR speed down. With this system being an Intel with a dedicated off-CPU memory controller I never even gave this a thought, but here I was having the exact same problem symptoms. I'd already upped the SPP voltage, and the FSB (despite running at a stock 333Mhz)... no dice.

I did explore running at DDR 1000 but after eventually getting another crash I finally conceeded default and fell back to DDR800 with my DDR1066 RAM. Very disappointing, but I'm not sure that it's the RAM at fault specifically and RMAing this would be very challenging.

CPU limits?

On the CPU side, I had settled on 3.83Ghz, which seemed to be pretty stable with the CPU temps in the high 50s at idle. Unfortunately I couldn't easily monitor the CPU temps under load save for thrashing it and then really quickly rebooting then jumping into the BIOS and checking the cpu temperatures. There are some windows tools to monitor the BIOS lifesigns, but I was using ubuntu/ESX at all times.

After an extended period of load at 3.83Ghz I'd get weird failures, and even at 3.6Ghz I'd have problems once I'd installed the servers in my server room. I guess there isn't the same airflow in there so it's a little less forgiving. For testing here I was running memtest in a 2CPU VM with a 2CPU xp VM running prime95.
Prime95 has turned out to be extremely useful. It was forever picking up errors in it's own calculations hinting that there was a memory or CPU error; normally memory. What's interesting is that memtest wasn't picking anything up so my guess is that it's something to do with the FPUs being hammered in addition to the RAM itself, while memtest is just testing the RAM.

Either way, prime95 was frequently telling me that I had issues even when everything else appeared to be running OK. I even installed server 2008 in another VM on the same host while it was telling me there were issues. I really started to conclude that it was just prime95 that was having the issues, but sure enough a crash/purple screen would come along soon enough.

3.4Ghz has been very stable, with prime95 & memtest running in VMs all night long without issues, so I've settled on that. The voltage in the bios for the CPU has been set to 1.250V, but it's actually getting around 1.19V after what I've since discovered to be known as "voltage droop".

I think the default voltage is 1.15, so the CPUs are hardly any warmer for the extra 500Mhz/core. 2Ghz is worth having in my book!

VMware ESXi

I did quite a bit of googling to try and work out if my XFX MG-V780-ISH9 motherboard would be supported by ESX so for the benefit of any others that may hit this blog; the NICs both work using the forcedeth driver. IDE/PATA also works, though you'll have to hack the ESXi installer if you wish to install onto them as ESXi actively ignores PATA disks during install weirdly enough, while it will let you use them for vmfs stores. I found a guide somewhere to bypass the IDE restriction at install time, but I'v ultimately ended up using USB keys for booting anyway. (Less noise/heat... and maybe even faster?)

I'm not sure if the SATA ports work, I haven't tried them.

Having spent a bit more time hacking at ESXi now, I must say it's very annoying to use. We all know about the "u n s u p p o r t e d" hack to use a console, and yes you can enable SSH access (for now at least), but the service console has very little in it. Yes, I know this is kinda the whole point, but boy it makes troubleshooting stuff a pain in the ass!

Another odd ESXi specific oddity is the networking for service console. Under ESX you have 3 types of networks.
VM networks
Service console networks
VMkernel networks.

ESX merges the Service Console and the VMkernel into one.
I've been using NFS as the backing storage to get things up and running quickly and I discovered very early on that the performance metrics for the disk usage simply don't exist with NFS, it all shows up as network traffic.
If you want to know how much disk load an individual VM is generating, you can't look at it's disk performance information (there isn't even a drop down for disk), but furthermore the network metrics are just literally for the VM's actual network (not it's underlying nfs traffic).

This just leaves monitoring it at the host level, and with the service console network data mixed in with the nfs traffic it all gets a bit mixed up.

Rather a weird way of doing things VMware!

VMware Whiteboxen

I've been in denial for a while, but the time finally came to shell out some cash on faster servers for my home testlab. It currently exists as a mess of old desktops, and even older Compaq servers (before hp rebranded them).

While everything has keeping up with what I need it to do, I really want to start spending more time working with virtualisation technologies and bringing myself up to speed again with some of the later MS technology (server 2008/sql 2008/etc) and the fact is that 4-8 year old hardware just can't cut it any more.

I've been running Xen for many years now with great success but I really wanted chance to start testing other hypervisors such as VMware's, and to test some of the newer MS products in VMs. Windows VMs of course require hardware assistance, so my older hardware won't do (even slowly), regardless of the hypervisor used (with the possible exception of qemu, which isn't technically a hypervisor anyway).

Real Servers or whiteboxes again?

I had a quick look into the pricing and options for buying a cheap commercial server, from the likes of Dell having had good success with my last purchase of an entry level IBM x226 as my file server. The big killer when it comes to servers is the RAM pricing. VM hosts need lots of RAM and I wanted a minimum of 8GB per server. ECC RAM is not cheap and that's the only kind you can get even with entry level servers so having reviewed the options I ended up going down the DIY/whitebox path.

A friend from work builds PCs all the time and he was helpful enough to suggest a base config which I just tweaked a bit to come out with my final component list.

The final configuration was 2:

Intel Core 2 Quad Q9550 2.83Ghz CPU
XFX MG-N780-ISH9 motherboard
Coolermaster Elite 330 Black Case
Lite-on 20A4P PATA DVD Burner
Corsair XMS2 DDR2 1066MHz 2GB RAM pair (x2)
Silverstone Olympia OP700 700W Power Supply
ASUS 8400GS silent video card

The CPU was the fastest I could find readily, though I think there is a 3Ghz version out there somewhere.The motherboard is very much a gaming motherboard, in fact it's triple SLI. Obviously the video prowess wasn't the goal, but given that it's built for high throughput and overclocking it will mean that it's a very stable board. The powersupply is bigger than I need which should add to the reliability, and the case was pretty much the cheapest one I could find.
The video card and DVD burners will probably be repurposed and shuffled at a later time, but to get things started I needed both to initially build the system. I have plans to use one of the video cards in my HTPC for h264 decoding as covered here. I'm actually using the exact same card they used to be sure that it will work :)
I went for 8GB of RAM, which is actually the maximum that these motherboards support anyway. I also opted to buy the faster DDR1066 versions for some possible overclocking and some extra margin above DDR800.

With everything "overspeced", they should be very reliable.

I haven't included any hard drives as initially I'll be running ESXi, and attempting to use it from a USB key. I purchased a pair of "high speed" 4GB usb keys for this.

File Server RAM

Finally I also ordered another 4GB of RAM for my file server. It's been humbly chugging along with 1GB serving up nfs/cifs while also running 4 zones, but once I start throwing VM traffic at it too (iscsi/nfs) it won't have the RAM to cache anything and the performance is going to suffer big time. Being a server, I had to pay way too much for the ECC RAM. I did find some Kingston aftermarket DIMMS that are gauranteed to work rather than the stupidly priced IBM OEM RAM. The documentation on the RAM configuration for the x226 is very confusing so I still don't actually know if I'll be able to use my current pair of 512MB DIMMs in conjunction with the 2x2GB DIMMs that I've ordered. Worst case, I'll have 4GB of RAM, best case, 5GB. I can live with that.

Hardware delivery

While most of my new "server" hardware arrived before christmas, the CPUs were on backorder; especially frustrating as they WERE in stock when I placed my order so that I could work on this over the holidays.
This mean that I couldn't actually start on anything until the 29th December.
It turns out that my server RAM has been delayed too, which is annoying but I can start testing everything with only 1GB.