Sunday, September 14, 2008

zonadm & zfs clones

Having spent nearly 2 hours chasing my tail on this now, it's time to blog about it for future safekeeping - that and google didn't find any good pages discussing this problem for me, so perhaps it will benefit someone else when google crawls this entry.

Firstly, my environment:
I'm using Solaris express community edition, build 95, on a machine named supernova.
Supernova has 2 zpools. rpool is the default zfs-root pool that we all know about, the second is an 8 disk raidz zpool, named (very creatively) Z

Background:
I was trying to clone a zone that I'd [mostly] setup earlier, so that my latest zone didn't waste a whole bunch of disk space with duplicate data.
Once/if ZFS de-duplication comes along, perhaps this will become a non-issue, but for now I really like the idea of creating new zones in under 1 second, and not wasting disk space doing it. As an added bonus, there is a chance performance will be slightly better, due to shared caching of data blocks,

I already have 2 zones setup on supernova: dns, and proxy, and tonight was time to make a start on a mail server zone to replace my suse-linux, xen based mail server.

Unfortunately it's now the end of the night, and I'm still no closer to actually setting up the new mail server, but I did at least succeed in getting a zfs cloned "mail" zone running and configured. And here it is, captured for all time

The problem was that zoneadm was simply refusing to actually clone my zone. Instead it was copying all the data across.

Here is what I started off with:

NAME USED AVAIL REFER MOUNTPOINT
Z 1.30T 567G 43.0G /Z
Z/backups 64.7G 567G 34.8K none
Z/backups/angelous 64.7G 567G 44.4G /Z/backups/angelous
Z/backups/angelous/home 20.3G 567G 1.77G /Z/backups/angelous/home
Z/backups/supernova 33.0K 567G 33.0K /Z/backups/supernova
Z/media 1.15T 567G 1.15T /Z/media
Z/storage 45.3G 567G 45.3G /Z/storage
Z/zones 1.79G 567G 36.5K none
Z/zones/dns 15.7M 567G 374M /zones/dns
Z/zones/proxy 1.70G 567G 1.69G /zones/proxy
rpool 6.66G 30.0G 36K /rpool
rpool/ROOT 5.65G 30.0G 18K legacy
rpool/ROOT/snv_95 5.65G 30.0G 5.36G /
rpool/export 66.5K 30.0G 28K /export
rpool/export/home 38.5K 30.0G 38.5K /export/home

We'll be focusing on the lines that I've bolded.
Firstly lets look at which directories map to which zfs's and for that matter, zpools.

rpool is / and most of Z is mounted under /Z/
zones however are under /zones/* which seemed logical enough to me.
I mainly use my raidz for storage of media and backups, whereas my zones as part of the core OS, are mounted directly off root as you would expect.
I wanted to make use of the 8-disk raidZ for both redundancy and performance for the zones, so putting them on the rpool was not the goal.

Z/zones was set with no mountpoint.
I did this because it doesn't actually contain any data, and since all zones will have their own zfs datasets, it will never need to contain any actual data. I've been caught out before when I've created a series of nested zfs's, and copied data into an "admin" parent directory and then "lost" it as zfs have mapped over the same namespace. Sure, in unix-speak everything just maps on top, but that's also a curse sometimes, and things can get hidden!

So, Z/zones has no mountpoint, but the children are mounted under it in /zones/dns and /zones/proxy.
Where exactly is /zones coming from, if not from Z/zones?

When I first started it was coming from / which is rpool, also a zfs fs, but the wrong zpool - which is important as I'll come to later.

I'd been running quite happily like this ever since upgrading supernova to snv_95, and dns is actually a clone of proxy - so I know that clones do work.

Yet tonight when I tried doing a zoneadm -z mail clone proxy, it was doing two things wrong. It wasn't creating a dedicated zfs for mail, and it wasn't cloning, it was copying... it even told me so.

Why? I'm running zfs root so I am sure that the prerequisite of the zonepath being on a zfs was being met!

The zonepath for mail, according to zonecfg -z mail was as follows:
zonename: mail
zonepath: /zones/mail

...which was exactly the same layout as dns and proxy.
The issue was all around the /zones directory, which technically didn't exist on my Z zpool at all, it was from /.

Now clearly zfs isn't going to be able to clone across zpools, so that explains why the clone failed. An warning/explanation would have been nice though!

But what about simply creating a zfs dataset on rpool then?
I think it wasn't doing that because zoneadm is being quite clever about it's use of naming and zfs mountpoint inheritance.

Based on my mucking around on this, it seems that zoneadm looks at the path that you've given it, and takes the parent dataset from this, and then attempts to create a child based on the path that you've provided.

I set the zonepath to /zones/mail, and I think the code then tries to access a /zones zfs. Now in my case, /zones wasn't a zfs... it was in a zfs, but it wasn't itself a zfs.
I guess zoneadm likes for zones to be directly below dedicated zfs datasets, as it can then inherit everything that it needs. One has to remember that while zfs datasets have a concept of heirachy, that doesn't actually necessarily match the logic directory layout on the disk.
In fact I do just this already with Z
The root of the Z zfs is under /Z, however /Z/zones/dns is up a directory, and then down another, which just happens to also be a zfs dataset, but until my upgrade to snv_95, it was ufs instead.

zfs and the unix mount anywhere system is almost too flexible sometimes, it gets confusing!

For zoneadm checked the provided path for a parent zfs, but because /zones wasn't a zfs, it had no way of knowing that I meant for it to create a child of Z/zones, since it wasn't mounted there.

So in the end, I was having a double failure, which results in my very weird results. At least I know understand why it was happening.

The fix was easy: I shutdown my zones, zfs umounted them, and then deleted the /zones directory from / (rpool)
With that out of the way, I set the mountpoint of Z/zones to be /zones, issued a zfs mount -a, and repeated my zonedam -z mail clone proxy.

This time around we were away, and fastforwarding a bit; here is where I'm at now:

NAME USED AVAIL REFER MOUNTPOINT
Z 1.30T 567G 43.0G /Z
Z/backups 64.7G 567G 34.8K none
Z/backups/angelous 64.7G 567G 44.4G /Z/backups/angelous
Z/backups/angelous/home 20.3G 567G 1.77G /Z/backups/angelous/home
Z/backups/supernova 33.0K 567G 33.0K /Z/backups/supernova
Z/media 1.15T 567G 1.15T /Z/media
Z/storage 45.3G 567G 45.3G /Z/storage
Z/zones 1.79G 567G 36.5K /zones
Z/zones/dns 15.7M 567G 374M /zones/dns
Z/zones/mail 83.1M 567G 1.72G /zones/mail
Z/zones/proxy 1.70G 567G 1.69G /zones/proxy
rpool 6.66G 30.0G 36K /rpool
rpool/ROOT 5.65G 30.0G 18K legacy
rpool/ROOT/snv_95 5.65G 30.0G 5.36G /
rpool/export 67.5K 30.0G 28K /export
rpool/export/home 39.5K 30.0G 39.5K /export/home


Lesson to be learned, if you wish to have zfs clone based zones, the parent directory of zonepath MUST be a zfs dataset, or zoneadm will get very confused.

First Post!!!!111

Well, it's high time that I started dumping some of this stuff out of my head.
One of the most frustrating things for me is hitting a problem and getting stuck whilst simultaneously remembering that this is both a challenge that I've faced both, and one that I've already done the hard yards solving - yet I can't remember the answer, and end up having to figure it out all over again.

I've also read from many sources that blogging is both good for the pysche (and who doesn't like a good vent now and then...and then... and then too!), and also that it improves wordpower as more practice goes into higher quality writing.

Consider this a well rounded experiment then, to see if any of this turns out to be true, and to see if I end up referring back to older posts is search of the ever elusive "ah HA!" moments.

Largely I expect this to be a private blog, without readership. I've chosen blogger.com on a whim; it was just quick and easy. Perhaps I'll self-host it in the future, although I can't imagine that there will be many benefits to be doing this, and frankly I have enough work already!