This is a big one to try to tackle in a single post, but the question comes up often enough to try. I figure to best answer it, it would help to understand what each does:
VMware HA
What it does: VMware HA will detects host & VM (VM heartbeat, etc) failures. On a failure it will attempt to restart the VM or group of VMs on another node in the VMware cluster.
What it doesn’t: VMware HA will not detect application or OS level failures (excepting VM heartbeat, etc). What this means: Your SQL VM will only fail from its host to another cluster node after a catastrophic failure: Someone sticks a screwdriver into the ESX host, etc.
MSCS (SQL Clustering)
What it does: MSCS will detect the failure of any one of it’s cluster resources, and take the defined action. What does that mean? Each cluster resource can be set to have any number of dependencies and have a failure action like: Move Resource Group. This setup will also protect you from catastrophic failure of a host, in that the SQL services will fail over to a VM that is running on the other node.
What you lose
You lose VMware HA for the 2-4 MSCS VMs. Knowing what MSCS does, that may not be a problem. So long as one sets up appropriate DRS rules to keep the MSCS VMs from running on the same host.
What you DON’T lose
This one is so critical, that I used caps in the section header! Really! Why? Because while you give up some of the more advanced features (HA, etc) going with MSCS, you DON’T (there go them caps again) lose the ability to have HA for the remainder of your VMs. That’s right, your VM web heads, and that accounting VM will still have the advanced features available to them (That is if you have vCenter and are licensed for them).
Other Considerations
Cost. There are associated costs with either method, for instance, VMware HA requires a vCenter license, and vCenter server to make it work. MSCS requires Windows to be licensed appropriately for both nodes. Both solutions require some form of share storage medium.
Supportability. While it can be done, MSCS on ESX adds some complexity into the design that would not other wise be present. Is it a san issue? A VM issue? Heart Beat networking? Each piece that changes from your standard method adds complexity into the solution, and makes it more ‘interesting’ to troubleshoot.
Which is best?
This is really up to you, and what your environment requires. After all, who knows the complexity and requirements of your design better than you. Well… perhaps that Leprechaun from down the street, but alas. With the notes above, it should help clear up the choice.
Questions? Comments? Other issues I missed? Drop me a note in the comments or via Twitter
 
									
Hi
Good points, but you forgot MS licensing costs as one of the reasons.
To cluster SQL you require Windows 2003/Windows 2008 Enterprise edition and above. Also in most cases you will want to give your virtualised SQL server to have more than 1 vpcu, which again puts the costs up if you buy MS SQL 2005 Std or above.
MS SQL 2005 Enterprise edition gives you unlimited SQL 2005 VM's per virtual host, but at a premium price of course.
Sometimes it is cheaper to have the 2 way physical SQL cluster, using 1 quad core CPU per cluster node (and memory of course) which still give you the availability, reduces hardware costs/software costs and gives you the performance you need.
Cheers
David
David,
Licensing was mentioned:
“*Cost. *There are associated costs with either method, for instance, VMware
HA requires a vCenter license, and vCenter server to make it work. MSCS
requires Windows to be licensed appropriately for both nodes. Both solutions
require some form of share storage medium.”
But not in as much detail. I'll update the post accordingly in just a bit.
Tanks again for the heads up.
Or you could simplify your life and get much higher availability by putting this all on a Stratus ftServer. You only need one server, so you save the cost of the second of third (or more) machines. You don’t need an SQL enterprise license (~$25,000.) to run on ftServer, it’s a single server with a standard license (~$2,300.), saves more then the cost of the ftServer, not to mention the simplicity of the ftServer vs. either one of these options, and greater then 5-nines uptime to boot…
Phil,
Good comment. I've not heard of ftServer prior. I'll look into it more, but
they appear to essentially a hardware solution, which while excellent, may
not be an option for currently existing environments. It is however, another
option, and options are always great to have!
Thanks!
-Cody
Phil,
Good comment. I've not heard of ftServer prior. I'll look into it more, but
they appear to essentially a hardware solution, which while excellent, may
not be an option for currently existing environments. It is however, another
option, and options are always great to have!
Thanks!
-Cody
Completely silly and flawed logic there with Stratus ftServer. You will never build server that cannot fail. There will always be elements that are vulnerable like cabling, power, fiber, switches ets, etc. You need to build your systems in such way that failure of individual component does not matter. Servers are cheap commodity, fault tolerance simply adds cost. This is where VMWare adds value.
With the introduction of VMWare-FT and Distributed vSwitch, I think MSCS required for vCenter SQL.
With the introduction of VMWare-FT and Distributed vSwitch, I think MSCS required for vCenter SQL.
Not too flawed – the ftserver has totally duplicated hardware running in lockstep, so there is no hardware failure that cannot be recovered from. In any datacentre all cabling and power will be redundant, as well as switching and data networks. The only downside of ftservers are their cost, as you point out, normal servers are relatively cheap and thus two servers in a cluster can be a better option than the ftserver (which is essentially like having two servers in one box).