Tag Archive for DR

Scale-Out Doesn’t Just Mean Applications

Scale Out

A couple of months ago I wrote a post entitled Scale-Out. Distributed. Whatever the Name, it’s the Future of Computing.

Taking the concept a step further, I recently started thinking about other elements in IT which are moving in that direction; not just applications and storage, but underlying infrastructure and management elements too.

Then it dawned on me that this really is not a new thing… we’ve been taking this approach for years! Technologies like VMware vSphere, have enabled us to become trusting, almost presumptuous, that we can add resources as we need them; increasing the shared pool transparently and enabling us to continue to service requirements, whilst eliminating downtime. (You can even use them to scale up on-the-fly if you really have to!)

The current breed of infrastructure engineers and startups have grown up in this era and the great thing is that this has now become part of their DNA! Typically, no longer are solutions designed from scratch to be scale-up in nature; hitting some artificial limit in capacity or having to scale specific elements of a solution to avoid nasty bottlenecks.

Instead, infrastructure is being designed to scale-out natively; distributed architectures, balancing workloads and metadata evenly across platforms. This has the added benefit, of course, of making them more resilient to failure of individual components.Distributed Systems

Backup isn’t Sexy, but it’s Necessary

One great example of this new architecture paradigm (drink!), is Rubrik, a startup in the backup space who we met at Tech Field Day 12. Their home-grown distributed file system, distributed metadata, built in off-site replication and global namespace, provide a massively scalable and resilient backup system.

All of the roles from a traditional backup solution (such as backup proxies/media servers/metadata servers, etc) are now rolled into a single, scale-out platform. As I seem to find myself saying more and more often these days, KISS personified!kiss - Keep it simple stupid EFS

With shrinking IT teams, I commonly find that companies are willing to trade budget for time savings. Utilising a simple, policy-driven management interface and enabling off-site replication to be done over-the-wire, has a lot of benefits to operational time!

As an added bonus, it can even replicate out to S3, Blob and NFS targets, to give even more options for off-site replication. Of course, a big fat pipe to the internet will cost you more each month; though you’re probably investing in that anyway, to meet your employee’s peak lunchtime demand for facebook and youtube! 🙂

Much like any complex machine, under the hood, Rubrik is pretty impressive. There is a masterless cluster management solution, multi-tier flash and disk for performance, and a clever redirect-on-write snapshot chain algorithm, which minimises capacity utilisation whilst providing very granular restores.

The key thing here, though, is we don’t really care; we are a consumer society who just wants things to work, as we have more exciting things than backup to worry about!

rubrik

TLDR;

We have enough complexity in IT these days without having to worry about backup. I would say that the simple to manage, scale-out solution from Rubrik is certainly worth considering as part of any PoC or RFP! 🙂

Further Info

You can catch the full Rubrik session at the link below:
Rubrik Presents at Tech Field Day 12

Further Reading

Some of the other TFD delegates had their own takes on the presentation we saw. Check them out here:

Disclaimer: My flights, accommodation, meals, etc at Tech Field Day 12 were provided by Tech Field Day, but there was no expectation or request for me to write about any of the vendors products or services.

What’s your definition of Cloud DR, and how far down do the turtles go?

Dr Evil Disaster Recovery

WARNING – Opinion piece! No Cloud Holy Wars please!

DR in IT can mean many different things to different people. To a number of people I have spoken to in the past, it’s simply HA protection against the failure of a physical host (yikes!)! To [most] others, it’s typically protection against failure of a data centre. As we discovered this week, to AWS customers, a DR plan can mean needing to protect yourself against a failure impacting an entire cloud region!

But how much is your business willing to pay for peace of mind?

When I say pay, I don’t just mean monetarily, I also mean in terms of technical flexibility and agility as well.

What are you protecting against?

What if you need to ensure that in a full region outage you will still have service? In the case of AWS, a great many customers are comfortable that the Availability Zone concept provides sufficient protection for their businesses without the need for inter-region replication, and this is perfectly valid in many cases. If you can live with a potential for a few hours downtime in the unlikely event of a full region outage, then the cost and complexity of extending beyond one region may be too much.

That said, as we saw from the failure of some AWS capabilities this week, if we take DR in the cloud to it’s most extreme, some organisations may wish to protect their business against not only a DC or region outage, but even a global impacting incident at a cloud provider!

This isn’t just technical protection either (for example against a software bug which hits multiple regions); what if a cloud provider goes under due to a financial issue? Even big businesses can disappear overnight (just ask anyone who used to work for Barings Bank, Enron, Lehman Brothers, or even 2e2!).

Ok, it’s true that the likelihood of your cloud provider going under is pretty teeny tiny, but just how paranoid are your board or investors?

Cloud DR

Ultimate Cloud DR or Ultimate Paranoia?

For the ultimate in paranoia, some companies consider protecting themselves against the ultimate outage, by replicating between multiple clouds. In doing so, however, they must stick to using the lowest common denominator between clouds to avoid incompatibility, or indeed any potential for the dreaded “lock-in”.

At that point, they have then lost the ability to take advantage of one of the key benefits of going to cloud; getting rid of the “undifferentiated heavy lifting” as Simon Elisha always calls it. They then end up less agile, less flexible, and potentially spend their time on things which fail to add value to the business.

What is best for YOUR business?

These are all the kinds of considerations which the person responsible for an organisation’s IT DR strategy needs to consider, and it is up to each business to individually decide where they draw the line in terms of comfort level vs budget vs “lock-in” and features.

I don’t think anyone has the right answer to this problem today, but perhaps one possible solution is this:

No cloud is going to be 100% perfect for every single workload, so why not use this fact to our advantage? Within reason, it is possible to spread workloads across two or more public clouds based on whichever is best suited to those individual workloads. Adopting a multi-cloud strategy which meets business objectives and technical dependencies, without going crazy on the complexity front, is a definite possibility in this day and age!

(Ok, perhaps even replicating a few data sources between them, for the uber critical stuff, as a plan of last resort!).

The result is potentially a collection of smaller fault domains (aka blast radii!), making the business more resilient to significant outages from major cloud players, as only some parts of their infrastructure and a subset of applications are then impacted, whilst still being able to take full advantage of the differentiating features of each of the key cloud platforms.replication photocopierOf course, this is not going to work for everyone, and plenty of organisations struggle to find talent to build out capability internally on one cloud, never mind maintaining the broad range of skills required to utilise many clouds, but that’s where service providers can help both in terms of expertise and support.

They simply take that level of management and consulting a little further up the stack, whilst enabling the business to get on with the more exciting and value added elements on top. Then it becomes the service provider’s issue to make sure they are fully staffed and certified on your clouds of choice.

*** Full Disclosure *** I work for a global service provider who does manage multiple public clouds, and I’m lucky enough to have a role where I get to design solutions across many types of infrastructure, so I am obviously a bit biased in this regard. That doesn’t make the approach any less valid! 🙂

The Tekhead Take

Whatever your thoughts on the approach above are, it’s key to understand what the requirements are for an individual organisation, and where their comfort levels lie.

An all-singing, all-dancing, multi-cloud, hybrid globule of agnostic cloudy goodness is probably a step too far for most organisations, but perhaps a failover physical host in another office isn’t quite enough either…

I would love to hear your thoughts? Don’t forget to comment below!

Amazon AWS Tips and Gotchas – Part 8 – AWS EC2 Reserved Instances

Continuing in this series of blog posts taking a bit of a “warts and all” view of a few Amazon AWS features, below are a handful more tips and gotchas when designing and implementing solutions on Amazon AWS, including AWS EC2 Reserved Instances.

For the first post in this series with a bit of background on where it all originated from, see here:
Amazon #AWS Tips and Gotchas – Part 1

For more posts in this series, see here:
Index of AWS Tips and Gotchas

AWS Tips and Gotchas – Part 8

Reserved Instances are a great way to save yourself some money for instances you know you will require for a significant period of time (from 12-36 months). One really cool fact which AWS don’t announce enough, in my opinion, is that reserved instances can actually be shared across consolidated billing accounts!

If you wanted to, you could purchase all of your reserved instances from your primary consolidated billing accounts, however, it doing this has some potentially unexpected results:

  1. Reserved instances don’t just provide you with a better price, they also provide you with guaranteed ability to spin up an instance of your chosen type, regardless of how busy the AZ in question actually is.
    If there is an AZ outage, other AWS customers will scramble to spin up additional instances in other AZs in the same region, either manually or via ASGs, and this has the potential to starve the compute resources for one or more instance types!
    Yes, that’s right, even AWS do not have an infinite compute resources!AWS Infinity Reserved InstancesBy using reserved instances, you are still guaranteed to be able to run yours regardless of available capacity for on-demand instances. They are truly reserved.
    If however, you centralise your reserved instances into your CB account, you will get the reservation pricing benefits at the top of the account tree, but you don’t get the capacity reservations as these are account specific.
  2. Reserved instances are specific to individual Availability Zones, so ensure you spread these evenly across your AZs to avoid wasting them (you are of course designing your apps to be resilient across AZs, right?) and give you maximum reserved coverage in the unlikely event of a full AZ outage.
  3. And finally… Reserved instances are a commercial tool applied after-the-fact, not against a specific instance. When using consolidated billing for reserved instances, the reservations are therefore effectively split evenly across all accounts. If you actually want to report back to each business unit / account owner on their billing including reserved instance, this could be tricky.

Find more posts in this series here:
Index of AWS Tips and Gotchas

Amazon AWS Tips and Gotchas – Part 9 – Scale-Up Patching

Amazon AWS Tips and Gotchas – Part 1 – AWS Intro, EBS and EC2

Although I have been very much aware of AWS for many years and understood it at a high level, I have never had the time to get deep down and dirty with the AWS platform… that is until now!

I have spent the past three weeks immersing myself in AWS via the most excellent ACloud.Guru Solution Architect Associate training course, followed by a one week intensive AWS instructor-led class from QA on AWS SA Associate and Professional.

While the 100 hours or so I have spent labbing and interacting with AWS is certainly not 10,000, it has given me some valuable insights on both how absolutely AWSome (sorry – had to be done!) the platform is, as well as experiencing a few eye openers which I felt were worth sharing.

It would be very easy for me to extoll the virtues of AWS, but I don’t think there would be much benefit to that. Everyone knows it is a great platform (but maybe I’ll do it later anyway)! In the meantime, I thought it would be worthwhile taking a bit more of a “warts and all” view of a few features. Hopefully, this will avoid others stepping into the potential traps which have come up directly or indirectly through my recent training materials, as well as being a memory aid to myself!

pretty cloud AWS EC2 EBS

The key thing is with all of these “gotchas”, they are not irreparable, and can generally be worked around by tweaking your infrastructure design. In addition, with the rate that AWS develop and update features on their platforms, it is likely that many of them will improve over the coming months / years anyway.

The general feeling around many of these “features” is that AWS are indirectly and gently encouraging you to avoid building your solutions on EC2 and other IaaS services, Instead, pushing you more towards using their more managed services such as RDS, Lambda, Elastic Beanstalk etc.

This did originally start off as a single “Top 10” post but realised quickly that there are a lot more than 10 items and some of them are pretty deep dive! As such, I have split the content into easily consumable chunks, with a few lightweight ones to get us started… keep your eyes open for a few whoppers later in the series!

The full list of posts will be available here:
Index of AWS Tips and Gotchas

AWS Tips and Gotchas – Part 1
  1. Storage for any single instance may not exceed 20,000 IOPS and 320MB/sec per EBS volume. This is really only something which will impact very significant workloads. The current “recommended” workaround for this is to do some pretty scary things such as in-guest RAID / striping!

    Doing this with RAID0 means you then immediately risk loss of the entire datastore if a single EBS volume in the set goes offline for even a few seconds. Alternatively, you can buy twice as much storage and waste compute resources doing RAID calculations. In addition, you then have to do some really kludgy things to get consistent snapshots from your volume, such as taking your service offline. 
    In reality, only the most extreme workloads hit this kind of scale up. The real answer (which is probably better in the long term) is to refactor your application or database for scale-out, a far more cloudy design.
    amazon AWS EBS
  2. The internet gateway service does not provide a native method for capping of outbound bandwidth. It doesn’t take a genius to work out that when outbound bandwidth is chargeable, you could walk away with a pretty significant bandwidth bill should something decide to attack your platform with a high volume of traffic. One potential method to work around this would be to use NAT instances. You can then control the bandwidth using 3rd party software in the NAT instance OS.
  3. There is no SLA for EC2 instances unless you run them across multiple Availability Zones. Of course with typical RTTs of a few milliseconds at most, there is very little reason not to stretch your solutions across multiple AZs. The only time you might keep in one AZ is if you have highly latency sensitive applications, or potentially the type of app which requires a serialised string of DB queries to generate a response to the end user.

    In a way I actually quite like this SLA requirement as it pushes customers who might otherwise have accepted the risk of a single DC, into designing something more robust and accepting the (often minor) additional costs. With the use of Auto Scaling and Elastic Load Balancing there is often no reason you can’t have a very highly available application split across two or more AZs, whilst using roughly the same number of servers as a single site solution.

    For example the following solution would be resilient to a single AZ failure, whilst using no more infrastructure than a typical resilient on-premises single site solution:Teahead AWS Simple HA Web Configuration
    No DR replication required, no crazy metro clustering setup, nothing; just a cost effective, scalable, highly resilient and simple setup capable of withstanding the loss of an entire data centre (though not a region, obviously).

Find more posts in this series here:
Index of AWS Tips and Gotchas

Amazon AWS Tips and Gotchas – Part 2 – AWS EBS & RDS MS SQL

 

%d bloggers like this: