This is a big one to try to tackle in a single post, but the question comes up often enough to try. I figure to best answer it, it would help to understand what each does:
What it does: VMware HA will detects host & VM (VM heartbeat, etc) failures. On a failure it will attempt to restart the VM or group of VMs on another node in the VMware cluster.
What it doesn’t: VMware HA will not detect application or OS level failures (excepting VM heartbeat, etc). What this means: Your SQL VM will only fail from its host to another cluster node after a catastrophic failure: Someone sticks a screwdriver into the ESX host, etc.
MSCS (SQL Clustering)
What it does: MSCS will detect the failure of any one of it’s cluster resources, and take the defined action. What does that mean? Each cluster resource can be set to have any number of dependencies and have a failure action like: Move Resource Group. This setup will also protect you from catastrophic failure of a host, in that the SQL services will fail over to a VM that is running on the other node.
What you lose
You lose VMware HA for the 2-4 MSCS VMs. Knowing what MSCS does, that may not be a problem. So long as one sets up appropriate DRS rules to keep the MSCS VMs from running on the same host.
What you DON’T lose
This one is so critical, that I used caps in the section header! Really! Why? Because while you give up some of the more advanced features (HA, etc) going with MSCS, you DON’T (there go them caps again) lose the ability to have HA for the remainder of your VMs. That’s right, your VM web heads, and that accounting VM will still have the advanced features available to them (That is if you have vCenter and are licensed for them).
Cost. There are associated costs with either method, for instance, VMware HA requires a vCenter license, and vCenter server to make it work. MSCS requires Windows to be licensed appropriately for both nodes. Both solutions require some form of share storage medium.
Supportability. While it can be done, MSCS on ESX adds some complexity into the design that would not other wise be present. Is it a san issue? A VM issue? Heart Beat networking? Each piece that changes from your standard method adds complexity into the solution, and makes it more ‘interesting’ to troubleshoot.
Which is best?
This is really up to you, and what your environment requires. After all, who knows the complexity and requirements of your design better than you. Well… perhaps that Leprechaun from down the street, but alas. With the notes above, it should help clear up the choice.
Questions? Comments? Other issues I missed? Drop me a note in the comments or via Twitter