Tag Archive for vSphere

Downtime sucks! Designing Highly Available Applications on a Budget

HA Minions

Downtime sucks.

I write this whilst sitting in an airport lounge, having been disembarked from my plane due to a technical fault. I don’t really begrudge the airline in question; it was a plumbing issue! This is a physical failure of the aircraft in question and just one of those things (unless I find out later they didn’t do the appropriate preventative maintenance of course)! Sometimes failures just happen and I would far rather it was just a plumbing issue, not an engine issue!

What is not excusable, however, is if the downtime is easily preventable; for example, if you are designing a solution which has no resilience!

This is obviously more common with small and medium sized businesses, but even large organisations can be guilty of it! I have had many conversations in the past with companies who have architected their solutions with significant single points of failure. More often than not, this is due to the cost of providing an HA stack. I fully appreciate that most IT departments are not swimming in cash, but there are many ways around a budgetary constraint and still provide more highly available, or at least “Disaster Resistant” solutions, especially in the cloud!HA Austin Powers Meme

Now obviously there is High Availability (typically within a single region or Data Centre), and Disaster Recovery (across DCs or regions). An ideal solution would achieve both, but for many organisations it can be a choice between one and the other!

Budgets are tight, what can we do?

Typically HA can be provided at either the application level (preferred), or if not, then at the infrastructure level. Many solutions to improvise availability are relatively simple and inexpensive. For example:

  • Building on a public cloud platform (and assuming that the application supports load balancing), why not test running twice as many instances with half the specification each? In most cases, unless there are significant storage quantities in each instance, the cost of scaling out this way is minimal.
    If there is a single instance, split it out into two instances, immediately doubling your availability. If there are two instances, what about splitting into 4? The impact of a node loss is then only 25% of the overall throughput capacity for the application, and can even bring down the cost of HA for applications where the +1 in N+1 is expensive!
  • Again in cloud, if there are more than two availability zones in a region (e.g. on AWS), then take advantage of them! If an application can handle 2 AZs, then the latency of adding a third shouldn’t make much, if any difference, and costs will only increase slightly with a small amount of extra inter-AZ bandwidth or per-AZ services (e.g NAT gateways).
    Again, in this scenario the loss of an AZ will only take out 33% of the application servers, not 50%, so it is possible to reduce the number of servers which are effectively there for failover only.
  • If you can’t afford to run an application as multi-AZ or multi-node, consider putting it in an auto-scaling group or scale-set with a minimum and maximum of 1 server. That way if an outage occurs or int he case of AWS, an entire AZ goes down, an instance will automatically be regenerated in an alternative AZ.HA Oliver
What if my app doesn’t like load balancers?

If you have an application which cannot be load balanced, you probably shouldn’t be thinking about running it in the cloud (not if you have any serious availability requirements anyway!). It amazes me how many business critical applications and services are still running in single servers all over the world!

  • If your organisation is dead set on using cloud for a SPoF app, then making it as ephemeral as possible can help. Start by splitting the DBs from the apps, as these can almost always be made HA by some means (e.g. master/slave replication, mirroring, log shipping, etc). Failover nodes also often don’t attract a license fee from many vendors (e.g. MS SQL), so always check your license documentation to see what you can achieve on the cheap.
  • Automate! If you can deploy application server(s) from a script, even if the worst happens, the application can be redeployed very quickly, in a consistent fashion.
    The trend at the moment is moving towards a more agile deployment process and automated CI/CD pipelines. This enables companies to recover from an outage by rebuilding their environments and redeploying code rapidly (as long as they have a replica of the data or a highly available datastore!).
  • If it’s not possible to script or image the code deployment, then taking regular backups (and snapshots where possible) of application servers, and testing them often is an option! If you don’t want to go through the inflexible, unreliable and painful nightmare of doing system state restores, then take image-based backups (supported by the vast majority of backup vendors nowadays). Perhaps even syncing of application data to a warm standby server which can be brought online reasonably swiftly, or even use an inexpensive DR service such as Azure Site Recovery, to provide an avenue of last resort!
  • If maybe cloud isn’t the best place to locate your application, then provide HA at the infrastructure layer by utilising the HA features of your favourite hypervisor!
    For example, VMware vSphere will have an instance back up and running within a minute or two of the failure of a host using the vSphere HA feature (which comes with every edition except Essentials!). On the assumption/risk that the power cycle does not corrupt OS, applications or data, you minimise exposure to hardware outages.
  • If the budget is not enough to buy shared storage and all VMs are running on local storage in the hypervisor hosts (I have seen this more than you might imagine!), then consider using something like vSphere Replication or Hyper-V Replicas to copy at least one of each critical VM role to another host, and if there are multiple instances, then spread them around the hosts.

Finally, make sure whatever happens there is some form of DR, even if it is no more than a holding page or application notification and a replica or off-site backup of critical data! Customers and users would rather see something telling them that you’re working to resolve the problem, than getting a spinning wheel and a timeout! If you can provide something which is of limited functionality or performance, then it’s better than nothing!

HA ServersTLDR; High Availability on a Budget

There are a million and one ways to provide more highly available applications; these are just a few. The point is that providing highly available applications is not as expensive as you might initially think.

With a bit of elbow grease, a bit of scripting and regular testing, even on the smallest budgets you can cobble together more highly available solutions for even the crummiest applications! 🙂

Now go forth and HA!

NanoLab – Index of Tekhead.it Posts on Intel NUC VMware vSphere Homelabs

Starting in January 2013, my Intel NUC series is now over has reached the heady heights of double digits over the past few years, so I figured it might be handy to make them a bit easier to find!

Nanolab Posts

More posts coming soon… 🙂

Just in case I forget to keep this page updated:

Intel NUC Nanolab blog posts

Quick Fix for “The task was canceled by a user” when deploying OVA in vCenter 6

The task was cancelled by a user

So I came across a very odd vCenter bug today when trying to deploy an OVA file on vSphere 6.0, specifically the latest CoreOS image.

The import was repeatedly failing with the same error message.

What was more frustrating was the fact that the error message was “The task was cancelled by a user”, which it blatantly was not!

Error log example below:

OVA Import Errors

OVA Import Errors

A quick bit of testing and Googling and I came across an article by my good friend Ather Beg from the LonVMUG, who had a very simple fix for the same issue in vSphere 5.5.

  1. Install 7-zip or a similar archiving tool
  2. Extract the OVA file using 7-zip into its component parts
  3. Import into vCenter, selecting the OVF file for the import target

That’s it – simples!



What’s really weird is that after importing the OVF successfully, I then went back and imported the OVA, and it worked fine!

Very strange indeed…

VCP6-DCV Delta Exam (2V0-621D) Study Guide and Exam Experience

Having successfully completed the VCP6-DCV Delta Exam (2V0-621D) this week, I thought it would be worthwhile jotting down a few thoughts on the exam, and noting the resources I used to prepare for it.

I’ve previously completed the VCP3, VCP4 and VCP5 “DCV” exams, however being specifically a delta exam, this one was a little different. The exam primarily covers the differences between vSphere 5 and vSphere 6, with a handful of seemingly more general questions.

For summary impressions of the exam (i.e. the TLDR), jump to the end of this article! 🙂

I used the following resources in prep for the exam:


The Exam
The exam itself was different to any previous VCP exam I’ve done. I would say that because the scope of the exam was much narrower, the depth of the questions seemed to me to be significantly more, with a few really tricky ones thrown in there.

Over all if I was to do it again (and when it comes time to do the VCP7 in a few years) I would probably just do the full VCP exam, rather than the delta. That way you can be sure of a decent number of the easy peasy questions which will probably be on stuff you’ve been doing for years, as well as the new stuff you may not know quite as well.

Obviously having not done the full VCP6 exam I can’t say this for sure, but I would say it’s a pretty good bet.

%d bloggers like this: