VulcanCast Follow Up – A few thoughts on 60TB SSDs

vulcancast1

So last week I was kindly invited to share a ride in Marc Farley‘s car (not as dodgy as it sounds, I promise!).

The premise was to discuss the recent announcements around Seagate’s 60TB SSD, Samsung’s 30TB SSD, their potential use cases, and how on earth we can protect the quantities of data which will end up on these monster drives?!

Performance

As we dug into a little in the VulcanCast, many use cases will present themselves for drives of this type, but the biggest challenge is that the IOPS density of the drives not actually very high. On a 60TB drive with 150,000 read IOPS (and my guess but not confirmed is ~100,000 or fewer write IOPS), the average IOPS per GB is actually only a little higher than that of SAS 15K drives. When you start adding deduplication and compression into the mix, if you are able to achieve around 90-150TB per drive, you could easily be looking at IOPS/GB performance approaching smaller 10K SAS devices!

flashfastThe biggest benefit of course if that you achieve this performance in a minuscule footprint by comparison to any current spindle type. Power draw is orders of magnitude lower than 10/15K, and at least (by my estimates) at least 4x lower than using NL-SAS / SATA at peak, and way more at idle. As such, a chunk of the additional cost of using flash for secondary tier workloads, could be soaked up by your space and power savings, especially in high-density environments.

In addition, the consistency of the latency will open up some interesting additional options…

SAS bus speeds could also end up being a challenge. Modern storage arrays often utilise 12GB SAS to interconnect the shelves and disks, which gives you multiple SAS channels over which to transfer data. With over half a PB of usable storage in just a dozen drives, which could be 1PB with compression and dedupe, and that’s a lot of storage to stick on a single channel! In the long term, faster connectivity methods such as NVMe will help, but in the short-term we may even have to see some interesting scenarios with one controller (and channel) for every few drives, just to ensure we don’t saturate bandwidth too easily.

dogedataUse Cases

For me, the biggest use cases for this type of drive are going to be secondary storage workloads which require low(ish) latency, a reasonable number of predominantly Read IOPS, and consistent performance even when a little bit bursty. For example:

  • Unstructured data stores, such as file / NAS services where you may access data infrequently, possibly tiered with some faster flash for cache and big write bursts.
  • Media storage for photo and video sites (e.g. facebook, but there are plenty of smaller ones such as Flickr, Photobox, Funky Pigeon, Snapfish, etc. Indeed the same types of organisations we discussed at the Storage Field Day roundtable session on high performance object storage. Obviously one big disadvantage here, would be the inability to dedupe / compress very much as you typically can’t expect high ratios for media content, which then has the effect of pushing up the cost per usable GB.
  • Edge cache nodes for large media streaming services such as NetFlix where maximising capacity and performance in a small footprint to go in other providers data centres is pretty important,whilst being able to provide a consistent performance for many random read requests.

For very large storage use cases, I could easily see these drives replacing 10K and if the price can be brought down sufficiently, for highly dedupable (is that a word?) data types, starting to edge into competing with NL SAS / SATA in a few years.

Data Protection

Here’s where things start to get a little tricky… we are now talking about protecting data at such massive quantities, failure of just two drives within a short period, has the potential to cause the loss of many hundreds of terabytes of data. At the same time, adding additional drives for protection (at tens of thousands of dollars each) comes with a pretty hefty price tag!

dataprotectionUnless you are buying a significant number of drives, the cost of your “N+1”, RAID, erasure coding, etc is going to be so exorbitant, you may as well buy a larger number of small drives so you don’t waste all of that extra capacity. As such, I can’t see many people using these drives in quantities of less than 12-24 per device (or perhaps per RAIN set in a hyper-converged platform), which means even with a conservatively guestimated cost of $30k per drive, you’re looking at the best part of $350-$700k for your disks alone!

Let’s imagine then, the scenario where you have a single failed drive, and 60TB of your data is now hanging in the balance. Would you want to replace that drive in a RAID set, and based on the write rates suggested so far, wait 18-24 hours for it to resync? I would be pretty nervous to do that myself…

In addition, we need to consider the rate of change of the data. Let’s say our datastore consists of 12x60TB drives. We probably have about 550TB or more of usable capacity. Even with a rate of change of just 5%, we need to be capable of backing up 27TB from that single datastore per night just to keep up with the incrementals! If we were to use a traditional backup solution against something like this, to achieve this in a typical 10-hour backup window will generate a consistent 6Gbps, never mind any full backups!

Ok, let’s say we can achieve these kinds of backup rates comfortably. Fine. Now, what happens if we had failure of a shelf, parity group or pool of disks? We’ve probably just lost 250+TB of data (excluding compression or dedupe) which we now need to restore from backup. Unless you are comfortable with an RTO measured in days to weeks, you might find that the restore time for this, even over a 10Gbps network, is not going to meet your business requirements!!!

This leaves us with a conundrum of wondering how we increase the durability of the data against disk failures, and how do we minimise the rebuild time in the event of data media failure, whilst still keeping costs reasonably low.

durablenokiaToday, the best option seems to me to be the use of Erasure Coding. In the event of the loss of a drive, the data is then automatically rebuilt and redistributed across many or all of the remaining drives within the storage device. Even with say 12-24 drives in a “small” system, this would mean data being rebuilt back up to full protection in 30-60 minutes, instead of 18-24 hours! That said, this assumes the connectivity on the array bus / backplane is capable of handling the kinds of bandwidth generated by the rebuilds, and that this doesn’t have a massive adverse impact on the array processors!

The use of “instant restore” technologies, where you can mount data direct from the backup media to get up and running asap, then move the data transparently in the background also seems to me to be a reasonable mitigation. In order to maintain a decent level of performance, this will likely also drive the use of flash more in the data protection storage tiers as well as production.

The Tekhead Take

Whatever happens, the massive quantities of data we are beginning to see, and the drives we plan to store them on are going to need to lead us to new (as yet, not even invented) forms of data protection. We simply can’t keep up with the rates of growth without them!

VulcanCast

Catch the video here:

The video and full transcript are also available here:
Huge SSDs will force changes to data protection strategies – with @alexgalbraith

Amazon #AWS Tips and Gotchas – Part 7

amazon-web-services-logo

Continuing in this series of blog posts taking a bit of a “warts and all” view of a few Amazon AWS features, below are a handful more tips and gotchas when designing and implementing solutions on Amazon AWS.

For the first post in this series with a bit of background on where it all originated from, see here:
Amazon #AWS Tips and Gotchas – Part 1

For more posts in this series, see here:
Index of AWS Tips and Gotchas

Tips and Gotchas – Part 7

  1. As detailed in the EMR FAQ, EMR does not support multi-master config, only one master node per EMR cluster (plus of course, multiple slaves). If that master node goes offline, you lose your cluster and all data which is being processed at the time. The AWS recommended workaround for this is to checkpoint your EMR cluster regularly, which allows resuming of the cluster from the last checkpoint in the event of a failure.
    EMR
  2. Spot instances and sticky sessions do not play well together!!! If you use spot instances as a method for providing cheap burst resources, make sure your application is not dependent on sticky sessions.
    If it is, you risk losing user sessions when the spot instances are terminated with only 2 minutes notice.
    There are a couple of mitigation methods for this, the best of which is simply to not use sticky sessions, and store your session data in another system such as ElastiCache or DynamoDB (or both!).
    Alternatively, you could setup a script within the EC2 guest OS to monitor the Spot Instance Termination Notifications (http://169.254.169.254/latest/meta-data/spot/termination-time) and devise a method to cleanly migrate off any remaining sessions from your instance and remove it from the load balancer.
    NOTE: It is best to avoid terminating your spot instances yourself, as AWS will not charge you for the hour in which they terminate your instance, so you can save some budget over shutting your own instances down. 
  3. Placement groups were designed specifically for high bandwidth applications, which require low latency, 10Gbps connectivity between instances.
    pgsIf you do not start all instances in a placement group at the same time, you cannot guarantee that they will end up optimally close to each other later. Indeed, as stated in the placement groups KB “If you try to add more instances to the placement group later, or if you try to launch more than one instance type in the placement group, you increase your chances of getting an insufficient capacity error”.
    If you do want to add more instances to your placement group later, the best thing to do is stop and restart all of your instances concurrently.

Find more posts in this series here:
Index of AWS Tips and Gotchas

Amazon #AWS Tips and Gotchas – Part 6 – Dedicated VPCs

amazon-web-services-logo

Continuing in this series of blog posts taking a bit of a “warts and all” view of a few Amazon AWS features, below are a handful more tips and gotchas when designing and implementing solutions on Amazon AWS.

For the first post in this series with a bit of background on where it all originated from, see here:
Amazon #AWS Tips and Gotchas – Part 1

For more posts in this series, see here:
Index of AWS Tips and Gotchas

Tips and Gotchas – Part 6

12. Dedicated VPCs

Just a quick one this week, specifically something to watch out for otherwise you risk running up a scary bill very quickly!

When you create a new VPC, you have the option to create it as Default or Dedicated as per the screenshot below:

vpc1

Now here’s the rub… if you select dedicated VPC, this will actually cause every single EC2 instance from then on to be created on dedicated hardware (what AWS call single-tenant hardware, i.e. dedicated physical servers!) by default, within that VPC.

Also note that as per the Dedicated Instances KB article, “You can’t change the instance tenancy of a VPC after you create it”.

In other words, if you find you have created your VPC as a dedicated one, you will have to destroy and re-create everything within that VPC to get it back to default (i.e. multi-tenant/shared compute).

awsinvoiceAnyhoo, I said it was just a quick one this week…

Find more posts in this series here:
Index of AWS Tips and Gotchas

Amazon #AWS Tips and Gotchas – Part 7

Amazon #AWS Tips and Gotchas – Part 5 – Managing Multiple VPCs

amazon-web-services-logo

Continuing in this series of blog posts taking a bit of a “warts and all” view of a few Amazon AWS features, below are a handful more tips and gotchas when designing and implementing solutions on Amazon AWS.

For the first post in this series with a bit of background on where it all originated from, see here:
Amazon #AWS Tips and Gotchas – Part 1

For more posts in this series, see here:
Index of AWS Tips and Gotchas

Tips and Gotchas – Part 5

11. Managing Multiple VPCs & Accounts

Following on from the previous post, let us assume that instead of just talking about public services endpoints (e.g. S3, Glacier, etc), and instead we are talking about environments with multiple VPCs, possibly multiple accounts, and the potential addition of Direct Connect on top.

vpc1.png

Why would you do this? Well, there are numerous reasons for logically separating things such as your dev/test and production environments from a security and compliance perspective. The one that people sometimes get hung up on is why would I want more than one account? As it goes, some AWS customers run many tens or even hundreds of accounts! Here are a few examples:

  • The simplest answer to this is so that you can avoid being “CodeSpaced” by keeping copies of your data / backups in a second account with separate credentials!
  • Separation of applications which have no direct interaction, or perhaps minimal dependencies, to improve security.
  • Running separate applications for different business units in their own accounts to make for easier LoB billing.
  • Allowing different development teams to securely work on their own applications without risking impact to any other applications or data.
  • With the mergers and acquisitions growth strategy which many companies adopt, it is fairly common these days for companies to be picked up and bring their AWS accounts and resources with them.
  • Lastly, a very common design pattern for compliance is to use a separate account to gather all of your CloudTrail and other audit logs in a single account, inaccessible to anyone except your security team, and therefore secure from tampering.

The great thing is that with consolidated billing, you can have as many accounts as you like whilst still receiving a single monthly bill for your organisation!

We will now look at a few examples of ways to hang together your VPCs and accounts, and in the majority of cases, you can effectively consider the two as interchangeable in so far as the scope of this post.

Scenario A – Lots of Random VPC Peering and a Services VPC

This option is ok for small solutions but definitely does NOT scale and is also against best practice recommendations from AWS. As mentioned in the previous section, transitive peering is also not possible unless you are somehow proxying the connections, so if you are looking to add Direct Connect to this configuration, this just simply isn’t going to fly.

Imagine that all of the blue dotted arrows in the following diagram were VPC peering connections! Aaaaargh!

vpc2.png

Option B – Bastion Server in Services VPC

If each of your VPCs is independent, and you only need to manage them remotely (i.e. you are not passing significant traffic between many different VPCs, or from AWS to your MPLS, then a services VPC with a bastion server may be a reasonable option (hub and spoke):

http://docs.aws.amazon.com/AmazonVPC/latest/PeeringGuide/peering-configurations-full-access.html

In this example, you could push a Direct Connect VIF into VPC A and via your bastion server, manage servers in each of your other VPCs. This would NOT be appropriate if your other servers / clients on premises wanted to access those resources directly, however, and is more likely in the scenario where each VPC hosts some form of production or dev/test platform which is internet facing, and this is effectively your management connection in the back door.

You might also potentially aggregate all of your security logs etc into the bastion VPC.

vpc3.png

Scenario C – Full Mesh

This is like a neater version of Scenario A. Holy moly! Can you imagine trying to manage, support or troubleshoot this?

vpc4.png

Even something as simple as managing your subnets and route tables would become a living, breathing nightmare! Then what happens every time you want to add another VPC? shudder

If you require this level of inter-VPC communication, then my first question would be why are you splitting the workloads across so many dependent VPCs, and where is the business benefit to doing so? Better to look at rationalising your architecture than try to maintain something like this.

Scenario D – Lollipop Routing

If you absolutely must allow every VPC to talk to most or even every other VPC, and the quantity of VPCs is significant then it may be worthwhile looking at something more scalable and easy to manage.

This one is more scalable from a management perspective, but if I am honest, I am not massively keen on it! It feels a bit like AWS absolving themselves of all responsibility when it comes to designing and supporting more complex network configurations. It could potentially also work out rather expensive as you could end up needing a fairly hefty amount of Direct Connect bandwidth to support the potential quantity of traffic at this scale, as well as adding a load of unnecessary latency.

I would prefer that AWS simply allowed some form of auto configured mesh with a simple tag/label assigned to each VPC to allow traffic to route automatically. If only such a technology existed or could be used as a design template!?! (sarcasm mode off – MPLS anyone?)

I am confident that at the rate AWS are developing new services, providing automation of VPC peering won’t be miles off (as suggested by the word “presently” in the following slide from an AWS presentation available on slideshare from last July (2015):

vpc5.png

In the meantime, we are left with something that looks a bit like this:

vpc6.png

When reaching this kind of scale, there are also a few limitations you want to be aware of:

vpc7.png

And Finally… NOTE: Direct Connect is per-Region

When you procure a direct connect, you are not procuring a connection to “AWS”, you are procuring a connection to a specific region. If you want to be connected to multiple AWS regions, you will need to procure connections to each region individually.

To an extent I can see that this makes some logical sense. Let’s say they allowed access through one region to others, if you have connections to a single region and that region has a major issue, you could end up losing access to all regions.

What would be good though would be the ability to connect to two regions, which would then provide you with region resilient access to the entire AWS network of regions. Whether this will become a reality is yet to be seen, but I have heard rumblings that there may be some movement on this in the future.

Wrapping Things Up

As you can see, getting your VPC peering and Direct Connect working appropriately, especially at scale, is a bit of a minefield.

I would suggest that if you are seriously looking at using Direct Connect, and need some guidance you could do worse than have a chat with your ISP, MSP or hosting provider of choice. They can help you to work out a solution which is best for your businesses requirements!

Find more posts in this series here:
Index of AWS Tips and Gotchas

Further Reading

Here are links to a few resources used in the writing of this post, worthwhile reading if you want to understand the subject more thoroughly:

Amazon #AWS Tips and Gotchas – Part 6 – Dedicated VPCs

%d bloggers like this: