Help! I Snapshotted My Datastore Into Oblivion

This post actually puts together two of my past posts to solve a ‘common’ problem. Common that is, if you often leave snapshots running, and do not have a method for otherwise checking on them. The situation is this:

It is 3:45PM on Friday (what good issue doesn’t happen on a Friday?), you are closing up shop for the weekend and the phone rings. You let the first call go to voice mail, after all it is Friday, it can wait. Moments later, the “Bat Phone” rings, and you know what that means. It means a long Friday night, and likely a long weekend. After a short conversation you’ve found that one of your VM administrators left a snapshot running on their Counter Strike server, and it has filled the datastore. The problem with this, however, is that it also happens to be running on the same datastore as the Payroll web server, the same Payroll web server that makes sure you get paid. Story sound interesting yet? So what do you do when this occurs:

Step 1) Kill the bunk VM. Why? It will likely be useless by the time you get to it anyways, but if it is running, and there is no space left, we need to kill it to ensure that it does not affect our attempts to fix it during step 2.

Step 2) Removing the snapshot with extreme prejudice. This is critical as well, it is what got you into this situation in the first place, isn’t it? Removing the snapshot, and cleanly will free up the space on the datastore that you need to not only get the Counter Strike server back online, but will also allow you the free space to power back on the remaining crashed VMs.

Step 3) Remediate the VM Admin using what ever policy best suits you. I prefer public humiliation, and death by firing squad.

6 thoughts on “Help! I Snapshotted My Datastore Into Oblivion

  • While I assume this is a “what-if” example crafted to demonstrate a point, it is also a good intro into a discussion about a company's VM governance and policy:

    * Why aren't your VM's (including the datastores) sequestered based on role and SLA? You should never see “pet VM's” on the same (virtual) hardware as production systems just as you would never have them on the same as you wouldn't have them on the same vnet.
    * Unless your customer is a CS hosting provider, why are you running CS on your production equipment?(put it under your desk on that p3 they haven't scrapped yet)
    * Even if you are hosting for a customer, wouldn't it still be wise to sequster your internal systems (billing, payroll, etc.) from your hosted systems?

    Aside from the questions regarding VM governance, could you touch more on why you're critical of snapshots? They give you a good way to start with a vetted canned image of (something) and differentiate it as necessary. We've used them in addition to strong backup policies to facilitate recovery on systems rated for developers and testers who frequently need to step back to a golden version when things go south.

  • Excellent comment! Indeed this was a 'crafted' example, to demonstrate a
    point. The example was a bit overboard, but not one that is completely
    uncommon, as there are plenty of posts on the VMware communities forums, as
    well as in the #vmware irc channels where this comes up. In this particular
    instance, had they planed properly, and had an appropriate policy in place,
    the risk would have been minimized. As you said, it does highlight the need
    for both planning and policy:

    1) Proper planning is key!
    2) Policies are part of proper planning.
    3) Policies should include appropriate measures for segregating internal and
    external systems (resource pools, or separate gear).

    As to being critical of snapshots… I'm not really. They are part of what
    makes the entire virtualization experience magical. However, like with
    anything, without proper planning and having measures in place to ensure you
    stay within 'best practices' limits… you can find yourself in a world of
    hurt.

  • I will second the case for not preserving snapshotss for a lengthy period of time.

    1 – The degradation of performance your VM will experience because of this is not worth the “advantage” of keeping te snapshot for a lengthy period of time.
    2 – The additional disk space that you will need to accommodate these snapshots is – as Cody said – something you need to plan for in advance.
    3 – In no way should snapshots be used for long term backups. That is what you have backup software for. If you need to ensure your VM state before a change by all means – do so. but after your change has been been made, validated and confirmed to be working, there is no reason to keep the snapshot for a lengthy period of time.
    4 – The longer you keep your snapshot, the larger it wil grow, and the longer it will take to commit.

    My policy is not to keep snapshots for at most 7 days.

  • Excellent comment! Indeed this was a 'crafted' example, to demonstrate a
    point. The example was a bit overboard, but not one that is completely
    uncommon, as there are plenty of posts on the VMware communities forums, as
    well as in the #vmware irc channels where this comes up. In this particular
    instance, had they planed properly, and had an appropriate policy in place,
    the risk would have been minimized. As you said, it does highlight the need
    for both planning and policy:

    1) Proper planning is key!
    2) Policies are part of proper planning.
    3) Policies should include appropriate measures for segregating internal and
    external systems (resource pools, or separate gear).

    As to being critical of snapshots… I'm not really. They are part of what
    makes the entire virtualization experience magical. However, like with
    anything, without proper planning and having measures in place to ensure you
    stay within 'best practices' limits… you can find yourself in a world of
    hurt.

  • I will second the case for not preserving snapshotss for a lengthy period of time.

    1 – The degradation of performance your VM will experience because of this is not worth the “advantage” of keeping te snapshot for a lengthy period of time.
    2 – The additional disk space that you will need to accommodate these snapshots is – as Cody said – something you need to plan for in advance.
    3 – In no way should snapshots be used for long term backups. That is what you have backup software for. If you need to ensure your VM state before a change by all means – do so. but after your change has been been made, validated and confirmed to be working, there is no reason to keep the snapshot for a lengthy period of time.
    4 – The longer you keep your snapshot, the larger it wil grow, and the longer it will take to commit.

    My policy is not to keep snapshots for at most 7 days.

Comments are closed.