With my ESX server(s) up and running with acceptable performance using an iscsi shared zvol, it's time to think about how I'd like to actually best provision my storage for the ESX hosts.
When setting up FC storage all VMware admins face the architecture decision of whether to go with 1 LUN per VM or a single large LUN for all [or at least large groups of] VMs.
1 LUN per VM requires a lot of extra provisioning each time you wish to add a VM or to grow a disk, but on the up side you don't have to worry about scsi lock contension as you're effectively not using VMFS as an active/active clustered file system at all. There are possibly some performance benefits to using 1 Lun/VM if you're hitting the storage very hard as you have a dedicated queue depth per LUN too.
In my most recent deployment we opted for 1 shared LUN per RAID (2 in total) and we never looked back. Life was simpler and we didn't experience any service impacting slow downs with snapshots or other lock related activities.
Here I'll be using the same model
ZFS iscsi volume options
Within the 1 iscsi volume shared by several VMs model, what are the options that zfs gives us for this volume?
I have around 250GB free on my zpool at the moment and there are three main options as I see it:
- Start small with a reasonable sized zvol (50GB?), and simply grow it as VMware needs more space.
- Make a large (200GB?) zvol that is unlikely to be filled in the near future.
- Thinly provision a very large (500GB) zvol that is unlikely to be filled.
1. A 50GB zvol, that I grow each time I need more space
Positives: Only uses around as much space as I actually need, when more is needed I can grow the zvol simply by using "zfs set volumesize=xxGB"
Negatives: ESX can't directly/nicely use the additional space. When you add to the vzol size, ESX will see additional storage beyond the end of the partition that the vmfs is running within however unlike most modern filesystem there doesn't appear to be a way to grow the vmfs partition/volume to make use of the extra space directly.
You can only add another vmfs volume/partition to the end, and then span between both vmfs volumes to make one large contiguous (virtual) volume, but that really seems pretty ugly to me. I really don't know that I feel too good about VMware taking on the role of logical volume manager, I'd rather keep that intelligence on the Solaris end where I have confidence in it.
If I add another 10GB and another VMFS extent each time I want to try another VM, I can easily see my vmfs spanning many partitions very quickly all within the one zvol.... this just seems stupid and the spanning extent feature feels to me more of a last resort to work around storage that can't expand volumes on the fly.
2. A single 200GB zvol
Positives: I quickly create this once and I'll probably not need to worry about it for some time to come, with no mucky multi-extent volumes.
Negatives: I'm immediately kissing goodbye to 200GB of space on the file server, whether it's used or not. Initially I'm only looking to use maybe 3-4 VMs, each with disks of maybe 10 GB each. I have 40GB but it costs me 200GB... really not optimal!
3. A sparse 500GB zvol
Positives: PLENTY of space for VM testing, more then I currently have available in my zpool in fact, but ZFS CAN DO!
By thinly provisioning I don't immediately write off disk space on the zpool, but ESX can format and start using what it sees as a full featured 500GB lun. ZFS uses it's COW technology so that only blocks containing newly written data are ACTUALLY written to the zpool so my sparsely provisioned xxxGB volume only uses as much data as has been written to the LUN.
Negatives: This is almost perfect except for the fact that ESX thickly provisions it's disks. This means that when I create a 40Gb VM, and install a 5GB OS, 40Gb is marked as used within zvol. ZFS doesn't know that the blocks are actually "empty" 2 levels of virtualisation up the stack. An nfs mounted VM is thinly provisioned by default, so I'm not quite getting that level of optimum volume utilisation, but NFS isn't an option... I'm just still a bit bitter about that :p
Option 3, is by far the most efficient choice of the 3 options I've evaluated, and so this is what I chose to use:
root@supernova ~]#zpool list Z
NAME SIZE USED AVAIL CAP HEALTH ALTROOT Z 2.17T 1.92T 262G 88% ONLINE -
root@supernova ~]#zfs create -s -V 500G Z/esxiscsi
root@supernova ~]#zpool list Z NAME SIZE USED AVAIL CAP HEALTH ALTROOT Z 2.17T 1.92T 262G 88% ONLINE -
root@supernova ~]#zfs set shareiscsi=on Z/esxiscsi