Archive for Storage

Amazon AWS Tips and Gotchas – Part 2 – AWS EBS & RDS MS SQL

Continuing in this series of blog posts taking a bit of a “warts and all” view of a few Amazon AWS features, below are a handful more tips and gotchas when designing and implementing solutions on Amazon AWS, including EBS and MS SQL on RDS.

For the first post in this series with a bit of background on where it all originated from, see here:
http://tekhead.it/blog/2016/02/amazon-aws-tips-and-gotchas-part-1/

For more posts in this series, see here:
Index of AWS Tips and Gotchas

AWS Tips and Gotchas – Part 2 – EBS & RDS
  1. You cannot increase the size of EBS volumes without stopping the instance. If you are designing scale-out / high availability solution then this is not a big issue as you should be able to take some downtime on any individual node, but that downtime is going to be fairly significant, and the larger the volume, the more downtime you will incur. The actual process looks like this (summary below):
    • Stop the instance
    • Snapshot the volume
    • Create a new volume from the snapshot, with your new larger size
    • Detach the old volume
    • Attach the new volume and start the instance back up

    This is one of those features which is bread and butter for a vSphere or Hyper-V admin, and could be done online in seconds with the vast majority of guest operating systems.

    I think it really highlights the key difference between designing for AWS Cloud, and a traditional enterprise virtual infrastructure. In a solution where most of your hosts are ephemeral, this should not be a big issue. If you try to take a traditional enterprise approach, you may find yourself in hot water, having to take service downtime to make simple changes.

    I suggest where possible / appropriate, avoid using EBS and use alternative options such as S3 which can scale on demand.

    UPDATE 13th Feb 2017: Amazon have just released Elastic Volumes, which allow you to scale up EBS volumes on demand! Yay! More info here:
    Amazon EBS Update – New Elastic Volumes Change Everything

  2. Similar to resizing EBS volumes, you cannot hot-resize an instance, or indeed resize them / change their type in place. In order to change instance type you need to detach any EBS volumes (including root volumes if you wish to maintain them too), terminate the instance, create a new one and re-attach your volumes.
    Obviously you cannot re-attach a root volume if you are using instance storage (ephemeral) for this, so make sure you use EBS backed volumes if you want to maintain your root volumes for any scale-up elements of your solutions which cannot simply be re-created from a bootstrap script.
  3. If your application depends on Microsoft SQL, you are going to be in for a fairly unpleasant surprise! It is not currently possible to resize MS SQL volumes on Amazon RDS once they have been deployed! At all. Full stop. Nada.AWS MS SQL - say what nowThe recommendation from AWS is to deploy your estimated future capacity requirement from day one! Not very cloudy at all…Your only growth option when you hit your initial capacity limit is to migrate all the data to a new RDS instance and take some application downtime to fail over.This can be minimised by using things like log shipping from the source instance to get the target as close to up-to-date as possible, but you will still need to shut down and swing your applications, and frankly it’s a risky headache which would be better avoided if possible, and certainly not something you want to be doing on a regular basis.Probably best to design for your estimated growth, and add a percentage on top.

Find more posts in this series here:
Index of AWS Tips and Gotchas

Amazon AWS Tips and Gotchas – Part 3 – S3, Tags and ASG

Amazon AWS Tips and Gotchas – Part 1 – AWS Intro, EBS and EC2

Although I have been very much aware of AWS for many years and understood it at a high level, I have never had the time to get deep down and dirty with the AWS platform… that is until now!

I have spent the past three weeks immersing myself in AWS via the most excellent ACloud.Guru Solution Architect Associate training course, followed by a one week intensive AWS instructor-led class from QA on AWS SA Associate and Professional.

While the 100 hours or so I have spent labbing and interacting with AWS is certainly not 10,000, it has given me some valuable insights on both how absolutely AWSome (sorry – had to be done!) the platform is, as well as experiencing a few eye openers which I felt were worth sharing.

It would be very easy for me to extoll the virtues of AWS, but I don’t think there would be much benefit to that. Everyone knows it is a great platform (but maybe I’ll do it later anyway)! In the meantime, I thought it would be worthwhile taking a bit more of a “warts and all” view of a few features. Hopefully, this will avoid others stepping into the potential traps which have come up directly or indirectly through my recent training materials, as well as being a memory aid to myself!

pretty cloud AWS EC2 EBS

The key thing is with all of these “gotchas”, they are not irreparable, and can generally be worked around by tweaking your infrastructure design. In addition, with the rate that AWS develop and update features on their platforms, it is likely that many of them will improve over the coming months / years anyway.

The general feeling around many of these “features” is that AWS are indirectly and gently encouraging you to avoid building your solutions on EC2 and other IaaS services, Instead, pushing you more towards using their more managed services such as RDS, Lambda, Elastic Beanstalk etc.

This did originally start off as a single “Top 10” post but realised quickly that there are a lot more than 10 items and some of them are pretty deep dive! As such, I have split the content into easily consumable chunks, with a few lightweight ones to get us started… keep your eyes open for a few whoppers later in the series!

The full list of posts will be available here:
Index of AWS Tips and Gotchas

AWS Tips and Gotchas – Part 1
  1. Storage for any single instance may not exceed 20,000 IOPS and 320MB/sec per EBS volume. This is really only something which will impact very significant workloads. The current “recommended” workaround for this is to do some pretty scary things such as in-guest RAID / striping!

    Doing this with RAID0 means you then immediately risk loss of the entire datastore if a single EBS volume in the set goes offline for even a few seconds. Alternatively, you can buy twice as much storage and waste compute resources doing RAID calculations. In addition, you then have to do some really kludgy things to get consistent snapshots from your volume, such as taking your service offline. 
    In reality, only the most extreme workloads hit this kind of scale up. The real answer (which is probably better in the long term) is to refactor your application or database for scale-out, a far more cloudy design.
    amazon AWS EBS
  2. The internet gateway service does not provide a native method for capping of outbound bandwidth. It doesn’t take a genius to work out that when outbound bandwidth is chargeable, you could walk away with a pretty significant bandwidth bill should something decide to attack your platform with a high volume of traffic. One potential method to work around this would be to use NAT instances. You can then control the bandwidth using 3rd party software in the NAT instance OS.
  3. There is no SLA for EC2 instances unless you run them across multiple Availability Zones. Of course with typical RTTs of a few milliseconds at most, there is very little reason not to stretch your solutions across multiple AZs. The only time you might keep in one AZ is if you have highly latency sensitive applications, or potentially the type of app which requires a serialised string of DB queries to generate a response to the end user.

    In a way I actually quite like this SLA requirement as it pushes customers who might otherwise have accepted the risk of a single DC, into designing something more robust and accepting the (often minor) additional costs. With the use of Auto Scaling and Elastic Load Balancing there is often no reason you can’t have a very highly available application split across two or more AZs, whilst using roughly the same number of servers as a single site solution.

    For example the following solution would be resilient to a single AZ failure, whilst using no more infrastructure than a typical resilient on-premises single site solution:Teahead AWS Simple HA Web Configuration
    No DR replication required, no crazy metro clustering setup, nothing; just a cost effective, scalable, highly resilient and simple setup capable of withstanding the loss of an entire data centre (though not a region, obviously).

Find more posts in this series here:
Index of AWS Tips and Gotchas

Amazon AWS Tips and Gotchas – Part 2 – AWS EBS & RDS MS SQL

 

Looking Forward to Storage Field Day 9 (#SFD9)

Storage Field Day

So for those of you who love to nerd out on storage like I do, you have probably already heard of the awesome streaming events put on by Stephen Foskett and the crew from Tech Field Day, otherwise known as Storage Field Day. These have grown so popular that Stephen is having to put on extra events just to cater for demand, which I think speaks volumes as to their efficacy and indeed quality!

For those not yet indoctrinated, these events involve taking a group of around a dozen storage and technology delegates to visit a number of different startups (think Pure, NexGen, Coho, etc) and more established companies (think Intel!) to talk about the latest things going on both at those organisations and in the industry in general. Each session lasts a couple of hours, but is generally broken down into several bite sized chunks for consumption at your leisure.

As a stream viewer you get the opportunity to learn about your favourite vendors latest funky stuff and watch them answer questions about all the things you probably wanted to know but never got the chance to ask. It is also a great way to get your head around an unfamiliar technology or vendor. Lastly, if you watch live, you can also ask questions via twitter for the delegates to ask of the presenters.

As a delegate this goes to a whole new level as you get to spend almost an entire week mahoossively geeking out on tech, learning from some of the smartest people in the tech industry, and meeting with the senior people at some of the industry’s best-known companies. I find it generally safest just to wear multiple layers to avoid any embarrassing nerdgasms! 😉

So with that in mind I am really chuffed to have been invited back to attend Storage Field Day 9, next month (16th-18th March) in San Jose!

Not all of the companies have been announced as yet, but we already know that the likes of Cohesity, Intel, VMware & Violin Memory will be in attendance. More will be confirmed over the next coupe of weeks and having seen the provisional list I can tell you it is definitely going to be a great event!

vendors

Needless to say the lineup of delegates is awesome as usual, with many well known bloggers from the EU, US and APAC. Make sure you check them out and follow the crew on twitter if you are so inclined. Most delegates post their opinions around the vendors and tech both during and after the event, so make sure you check out their blog feeds. For example, here is mine:

http://www.tekhead.org/blog/feed/

If you want to tune in live, simply go to http://techfieldday.com from 16th-18th March (PST) or catch up with the recordings on youtube later.

Finally, be warned my Twitter stream does get rather busy during the event, so feel free to temporarily mute me if need be! 😉

Why are storage snapshots so painful?

Have you ever wondered why we don’t use snapshots more often than about every 5-15 minutes in most solutions, and in many others, a lot less often than that?

It’s pretty simple to be honest… The biggest problem with taking snapshots is quiescing the data stream to complete the activity. At a LUN level, this usually involves some form of locking mechanism to pause all IO while any metadata updates or data redirections are made, after which the IO is resumed.

For small machines and LUNs with minimal IO load this is generally such a quick operation that it has virtually no effect on the application user, and is pretty much transparent. For busy applications, however, data can be changing at such a massive rate that disrupting that IO stream, even for a few seconds can have a significant impact on performance and user experience. In addition the larger the number of snapshots in the snap tree, the more that performance is often degraded through the management of large numbers of snapshots, copy on write activities, and, of course, lots of locking.

This problem is then multiplied several times over when you want to get consistency across multiple machines, for example when you want to get point-in-time consistency for an entire application stack (Web / App / DB, etc).

So what do we typically do? We reduce the regularity at which we take these snaps in order to minimise the impact, whilst still having to meet the (usually near zero because all data is critical, right?) RPO set by the business.

At SFD8, we had a very well received presentation from INFINIDAT, a storage startup based in Israel and founded by industry legend Moshe Yanai (the guy who brought you EMC Symmetrix / VMAX, and subsequently XIV). Moshe’s “third generation” enterprise class storage system comes with one particular feature with which I was really interested; snapshots! Yes, I know it sounds like a boring “checkbox in an RFP” feature, but when I found out how it worked I was really impressed.

For every single write stripe which goes to disk, a checksum and a timestamp (from a high precision clock) are written. This forms the base on which the snapshot system is built (something they call InfiniSnap™).

If you have a micro-second accurate clock and timestamps on every write, then in order to achieve a snapshot you simply have to pick a date and time! Anything written earlier than this is not included in the current snap, and anything on or after the time is. This means no locking or pausing of IO during a snap, making the entire process a near zero time and a zero impact operation! A volume with or without snapshots, therefore has indistinguishable performance. Wow!

Screen Shot 2015-12-13 at 20.55.19

It sounds so simple it shouldn’t work, but according to INFINIDAT they can easily support up to 100,000 snaps per system, and even this isn’t even a real number. They made it up as it was a double figure percentage bigger than the next closest array on the market. They will also happily support more than this if you ask, they said that they just need to test it first. In addition, each snap group will support up to 25 snaps per second, and they guarantee an RPO of as low as 4 seconds, based on snapshots alone. You can then use point in time replication to create an asynchronous copy on another array if needed. Now that’s granular! 🙂

The one caveat I would add to this is that this does not yet appear to have a fix for ye old faithful crash consistent vs application consistent issue, but it’s a great start. Going back to the application stack “consistency group” concept, in theory, you generally only need to VSS the database VM, and as such it will be much easier and simpler to have a consistent snap across an app stack with minimal overhead. As we move more towards applications using No-SQL databases etc, this will also become less of an issue.

The above was just one of the cool features they covered in their presentation, from which the general consensus was very positive indeed! A couple of weeks ago I was also able to spend a little time with one of INFINIDAT’s customers who just so happened to be attending the same UKVMUG event. Their impressions in terms of the quality of the array build (with a claimed 99.99999% availability!), the management interface, general performance during initial testing, the compelling pricing, and of course, their very funky matrix-like chassis, were all very positive too.

If you want to see the INFINIDAT presentation from SFD8, make sure you have your thinking hat on and a large jug of coffee! Their very passionate CTO, Brian Carmody, was a very compelling speaker and was more than happy to get stuck into the detail of how the technology works. I definitely felt that I came away a little smarter having been a part of the audience! He also goes into some fascinating detail about genome sequencing, the concept of cost per genome and it’s likely massive impact on the storage industry and our lives in general! The video is worth a watch for this section alone…

Further Reading
Some of the other SFD8 delegates have their own takes on the presentation we saw. Check them out here:

Dan FrithINFINIDAT – What exactly is a “Moshe v3.0”?
Enrico Signoretti’s blog Juku.itInfinidat: awesome tech, great execution
Enrico Signoretti writing on El RegHas the next generation of monolithic storage arrived?
Ray LucchesiMobile devices as a cache for cloud data
Vipin V.K. – Infinibox – Enterprise storage solution from Infinidat
GreyBeards on Storage Podcast – Interview with Brian Carmody

Disclaimer/Disclosure: My flights, accommodation, meals, etc, at Storage Field Day 8 were provided by Tech Field Day, but there was no expectation or request for me to write about any of the vendors products or services and I was not compensated in any way for my time at the event.

%d bloggers like this: