Thursday, January 1, 2009

Whitebox burn in and VMware

With everything assembled I went straight into the benchmarking and testing.
The bios in on the XFX is very nice to use, which heaps of information and very specific voltage/FSB adjustments.

I quickly discovered that I could take my 2.83 up to 4GHz, I seem to remember getting to about 4.08 before it would stop posting and I didn't want to push to the voltage on the cores too highly.

RAM

From here I went down on to discover a number of very interesting points about burn in testing and memory speed. I purchased brand name DDR1066MHz RAM and with the motherboard supporting all the way up to DDR 1200MHz RAM, I should be able to run it at it's "factory" DDR1066MHz settings, right?...

It turns out that corsair advise running the RAM at a whopping 2.1V (1.8V is the default). I wouldn't have dreamed of pushing my RAM's voltage that high for fear of breaking it, but sure enough I confirmed this information on their website.

Frustratingly, even at 2.1V memtest86 was still giving me RAM errors even with the CPU/FSB running at it's factory defaults. Even running at 2.15V didn't cure the problem.
I thought it might just be a bad Dimm, so pulled a pair out... no problem. I switched pairs to the presumably faulty pair...no problem there either.
Wait, what???!

Then I remembered discovering something similar back when I built my AMD desktop some years ago. The memory controller on the CPU (AMD remember) just couldn't drive all 4 Dimms at their uprated speeds so if you were to run 4 Dimms you had to drop the DDR speed down. With this system being an Intel with a dedicated off-CPU memory controller I never even gave this a thought, but here I was having the exact same problem symptoms. I'd already upped the SPP voltage, and the FSB (despite running at a stock 333Mhz)... no dice.

I did explore running at DDR 1000 but after eventually getting another crash I finally conceeded default and fell back to DDR800 with my DDR1066 RAM. Very disappointing, but I'm not sure that it's the RAM at fault specifically and RMAing this would be very challenging.

CPU limits?

On the CPU side, I had settled on 3.83Ghz, which seemed to be pretty stable with the CPU temps in the high 50s at idle. Unfortunately I couldn't easily monitor the CPU temps under load save for thrashing it and then really quickly rebooting then jumping into the BIOS and checking the cpu temperatures. There are some windows tools to monitor the BIOS lifesigns, but I was using ubuntu/ESX at all times.

After an extended period of load at 3.83Ghz I'd get weird failures, and even at 3.6Ghz I'd have problems once I'd installed the servers in my server room. I guess there isn't the same airflow in there so it's a little less forgiving. For testing here I was running memtest in a 2CPU VM with a 2CPU xp VM running prime95.
Prime95 has turned out to be extremely useful. It was forever picking up errors in it's own calculations hinting that there was a memory or CPU error; normally memory. What's interesting is that memtest wasn't picking anything up so my guess is that it's something to do with the FPUs being hammered in addition to the RAM itself, while memtest is just testing the RAM.

Either way, prime95 was frequently telling me that I had issues even when everything else appeared to be running OK. I even installed server 2008 in another VM on the same host while it was telling me there were issues. I really started to conclude that it was just prime95 that was having the issues, but sure enough a crash/purple screen would come along soon enough.

3.4Ghz has been very stable, with prime95 & memtest running in VMs all night long without issues, so I've settled on that. The voltage in the bios for the CPU has been set to 1.250V, but it's actually getting around 1.19V after what I've since discovered to be known as "voltage droop".

I think the default voltage is 1.15, so the CPUs are hardly any warmer for the extra 500Mhz/core. 2Ghz is worth having in my book!

VMware ESXi

I did quite a bit of googling to try and work out if my XFX MG-V780-ISH9 motherboard would be supported by ESX so for the benefit of any others that may hit this blog; the NICs both work using the forcedeth driver. IDE/PATA also works, though you'll have to hack the ESXi installer if you wish to install onto them as ESXi actively ignores PATA disks during install weirdly enough, while it will let you use them for vmfs stores. I found a guide somewhere to bypass the IDE restriction at install time, but I'v ultimately ended up using USB keys for booting anyway. (Less noise/heat... and maybe even faster?)

I'm not sure if the SATA ports work, I haven't tried them.

Having spent a bit more time hacking at ESXi now, I must say it's very annoying to use. We all know about the "u n s u p p o r t e d" hack to use a console, and yes you can enable SSH access (for now at least), but the service console has very little in it. Yes, I know this is kinda the whole point, but boy it makes troubleshooting stuff a pain in the ass!

Another odd ESXi specific oddity is the networking for service console. Under ESX you have 3 types of networks.
VM networks
Service console networks
VMkernel networks.

ESX merges the Service Console and the VMkernel into one.
I've been using NFS as the backing storage to get things up and running quickly and I discovered very early on that the performance metrics for the disk usage simply don't exist with NFS, it all shows up as network traffic.
If you want to know how much disk load an individual VM is generating, you can't look at it's disk performance information (there isn't even a drop down for disk), but furthermore the network metrics are just literally for the VM's actual network (not it's underlying nfs traffic).

This just leaves monitoring it at the host level, and with the service console network data mixed in with the nfs traffic it all gets a bit mixed up.

Rather a weird way of doing things VMware!

No comments: