Tag Archive for performance

I Like Big Files and I Cannot Lie

You other vendors, can’t deny,
When an array walks in with an itty bitty waste [-ed capacity],
And many spindles in your face
You get sprung, want to pull up tough,
‘Cause you notice that storage was stuffed!

Ok… I’ll stop now! I’m just a bit sad and always wanted an excuse to to use that as a post opener! 🙂

There is a certain, quite specific type of customer whose main requirements revolve around the storage of large data sets consisting of thousands to millions of huge files. Think media / TV / movie companies, video surveillance or even PACS imaging and genomic sequencing. Ultimately we’re talking petabyte-scale capacities – more than your average enterprise needs to worry about!

How you approach storage of this type of data is worlds apart from your average solution!

The Challenges of “Chunky” Data

Typical challenges involve having multiple silos of your data across multiple locations, with different performance and workload characteristics. Then you have different storage protocols for different applications or phases in their data processing and delivery. Each of those silos then requires different skills to manage, and different capacity management regimes.

Sir Mixalot likes big files

On top of that, for the same reason as we moved away from parity groups in arrays to wide striping, these silos then have IO and networking hotspots, wasted capacity (sometimes referred to as trapped white space) and wasted performance, which cannot be shared across multiple systems.

Finally (and arguably most importantly), how do you ensure the integrity, resilience, and durability of this data, as by its very nature, it typically requires long-term retention?

Ideal Solution

What you really need is a single storage system which can not only scale to multi-petabyte capacities with multiple protocols, but is reasonably easy to manage, even with a high admin to capacity ratio.

You then need to ensure that data can also be protected against accidental, or malicious file modification or deletion.

Finally, you need the system to be able to replicate additional copies to remote sites, as backing up petabytes of data is simply unrealistic! Similarly, you may want multiple replicas or additional pools outside of your central repository which all replicate back to the mothership, for example for ROBO or multi-site solutions where editing large files needs to be done locally.

As my good friend Josh De Jong said recently:

Of course, the biggest drawback of using this approach is that you have one giant failure domain. If something somehow manages to proverbially poison your “data lake”, that’s a hell of a lot of data to lose in one go!

DellEMC Isilon

During our recent Tech Field Day 12 session at DellEMC, I was really interested to see how the DellEMC Isilon scale-out NAS system was capable of meeting many of these requirements, especially as this is a product which can trace its heritage all the way back to 2001! In fact, their average customer on Isilon is around 1PB in size, and their largest customer is using 144PB! Scalability, check!

The Isilon team also confirmed that around 70% of their 8,000+ customers trust the solution sufficiently to not use any external backup solution, trusting in SnapshotIQ, SyncIQ and in some cases SmartLock, to protect their data. That’s a pretty significant number!

One thing I am not so keen on with the Isilon (and to be fair, many other “traditional” /  old guard storage vendor offerings) is the complexity and breadth of the licensing; almost all of the interesting features each have to have their own license. If the main benefit to the data lake is simplicity, then I would far rather have a single price with perhaps one or two uplift options for licenses, than an a la carte menu.

In addition, the limit of 50 security domains provides some flexibility for service providers, but then limits the size of your “data lake” to 50 customers. It would be great to see this limit increased in future.Data Lake

The Tekhead Take

Organisations looking to retain data in these quantities need to weigh up the relative risks of using a single system for all storage, versus the costs of and complexity of multiple silos. Ultimately it is down to each individual organisation to work out what closest matches their requirements, but for the convenience of a single large repository of all of your data, the DellEMC Islion still remains a really interesting proposition.

Further Info

You can catch the full Isilon session at the link below:
Dell EMC Presents at Tech Field Day 12

Further Reading

Some of the other TFD delegates had their own takes on the presentation we saw. Check them out here:

Disclaimer: My flights, accommodation, meals, etc at Tech Field Day 12 were provided by Tech Field Day, but there was no expectation or request for me to write about any of the vendors products or services.

Site Performance and Downtime

Tekhead Logo

This is just a quick note to apologise for any downtime and/or slow performance at the moment.

My shared web host (Evohosting) primary web server has been providing shocking performance for the past 24 hours, with no signs of being fully resolved anytime soon. Even a classic “turn it off and on again” does not seem to have resolved.

Ironic really, as my previous post described, I enabled CloudFlare a few days back and the performance boost was awesome.

#SadPanda

Needless to say, I am currently considering alternative hosting options. Please bear with me…

UPDATE 9th October 2016

Evohosting have moved my site to a new server. Please let me know if you experience any further issues, via Twitter.

Amazon AWS Tips and Gotchas – Part 3 – S3, Tags and ASG

Continuing in this series of blog posts taking a bit of a “warts and all” view of a few Amazon AWS features, below are a handful more tips and gotchas when designing and implementing solutions on Amazon AWS, including AWS S3, Tags / Tagging as well as ASG (Auto-Scaling Groups).

For the first post in this series with a bit of background on where it all originated from, see here:
http://tekhead.it/blog/2016/02/amazon-aws-tips-and-gotchas-part-1/

For more posts in this series, see here:
Index of AWS Tips and Gotchas

AWS Tips And Gotchas – Part 3
  1. Individual S3 buckets are soft limited to 100 concurrent write transactions per second, and 300 reads initially and only partition as the storage performance quantities grow over time. This sounds like a lot but when you consider the average web page probably consists of 30-60 objects, it would not take a huge number of concurrent users hitting an application at the same time of day to start hitting limits on this.

    The first recommendation here, especially for read intensive workloads, is to cache the content from S3 using a service like CloudFront. This will immediately mean that for your object TTL you would only ever expect to see each object accessed a maximum of around 50 times (once per global edge location), assuming a global user base. A lot less than that if all of your users are in a small number of geographic regions.
    Second, do not use sequentially named S3 objects. Assign a prefix to the start of each filename which is a random set of characters, and will mean that in the background, S3 will shard the data across partitions rather than putting them all in one. This is very effectively explained here:
    http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

    Third, effectively shard your data across multiple S3 buckets in some logical fashion, ensuring you are also roughly spreading the read and write requests equally between them, therefore increasing your maximum IO linearly with every additional S3 bucket. You would then potentially need some form of service to keep a track of where your content lives; a common method for this is to store the S3 object locations in a DynamoDB table for resilient and fast retrieval.

    For extra fast retrieval you could also cache these S3 locations in memory using Elasticache (Memcached/Redis).AWS S3 cache all the things
    If you go down this route and assuming older data is less frequently accessed, I suggest you rebalance your data when new S3 buckets are added, otherwise you risk having hot and cold buckets, which defeats the objective of sharing them in the first place!

    Even better, just start with a decent number of S3 buckets anyway, as the buckets themselves are free; you are only charged for the content stored inside them! This, of course, adds some complexity for management and maintenance, so make sure you account for this in your designs!

    Lastly, use a CDN! That way your object access hit counts will be far lower, and your users will get improved performance from delivery of content from local pops! 🙂

  2. If you are using Tags as a method to assign permissions to users or even prevent accidental deletion of content or objects (something I’m not 100% sure I’m convinced is bullet proof but hey!), make sure you then deny the ability for users to modify those tags (duh!).

    For example, if you set a policy which states that any instance tagged with “PROD” may not be deleted without either MFA or elevated permissions, make sure you deny all ability for your users to edit said tags, otherwise they just need to change from PROD to BLAH and they can terminate the instance.AWS Tags Security

  3. This is a configuration point which can cost you a wee chunk of change if you make this error and don’t spot it quickly! When configuring your Auto-Scaling Group make sure the Grace Period is set sufficiently long to ensure your instances have time to start and complete all of their bootstrap scripts.

    If you don’t, the first time you start up your group it will boot an instance, start health checking it, decide the instance has failed, terminate that instance and boot a new one, start health checking it, decide the instance has failed, etc (ad infinitum).

    If your grace period is low this could mean spinning up as many as 60 or more instances in an hour, each with a minimum charge of an hour!Instead, work out your estimated Grace Period and consider adding an extra 20% wiggle room. Similarly, if your bootstrap script has a typo in it (as mine did in one test) which causes your health checks to fail, Auto-Scaling will keep terminating and instantiating new instances until you stop it. Make sure you have thoroughly tested your current bootstrap script prior to using it in an Auto-Scaling group!

    Update: One last point to highlight with this is some sound advice from James Kilby. Be aware as your environment changes that a sufficient grace period may be enough day one, but it might not be later on! Don’t set and forget this stuff, or you may find you come in one day with a big bill and a load of lost revenue when your site needed to scale up and couldn’t!

Find more posts in this series here:
Index of AWS Tips and Gotchas

Amazon AWS Tips and Gotchas – Part 4 – Direct Connect & Public / Private VIFs

How often do you upgrade your storage array software?

Upgrades are scary!

Having managed and implemented upgrades on highly available systems such as the old Sun StorageTech line of rebranded HDS USP/VSP arrays back in the day, I can tell you that we did not take upgrades lightly!

Unless there was a very compelling reason for an upgrade, the line taken was always “if it ain’t broke, don’t fix it”, but then we were looking after storage in a massively high security environment where even minor changes were taken very seriously indeed. When it came to storage we didn’t have or need anything very fancy at all, just a some high performance LUNs cut from boat loads of small capacity 15K drives, a bit of copy on write snappage to a set of 3rd party arrays and some dual site synchronous replication. Compared to some of the features and configurations of today, that’s actually pretty minimal!

Updates

Now this approach meant that the platform was very stable. Great! It also meant that because we only did upgrades once in a blue moon, the processes were not what you might call streamlined, and the changes made by each upgrade were typically numerous, thereby running a pretty decent risk of something breaking. It was also key to ensure that we checked the compatibility matrix for every release to ensure that the 3rd party arrays would continue to function.

They say that software is eating the world. I’d say it seems the same could be reasonably said for the hardware storage vendors we saw at Storage Field Day 8, as they seem to mostly be moving towards more Agile development models. Little and often means lower risk for every upgrade as there are fewer changes. New features and improvements can be released on a more regular basis (especially those taking advantage of flash technologies which are changing by the minute!). A significant number of the vendors we saw had internal release cycles of between 2 and 4 weeks and public release cycles of 2-8 weeks!

In the case of one vendor, Pure Storage, they are not only releasing code every couple of weeks, but customers have obviously taken this new approach on board with vigour! Around 91% of Pure’s customer base is currently using an array software version 8 months old or less. An impressive stat indeed!

This is Hardware. Software runs on it...

This is Hardware. Software runs on it…

This sounds like a relatively risky approach, but they mitigate it to a great extent by using the telemetric data uploaded every 30 seconds to their Pure1 SaaS management platform from customer arrays, building up a picture of both individual customers and their customer base as a whole. They then use their fingerprint engine to proactively pre-check every customer array to find out which may be susceptible to any potential defect in a new software release. Arrays which pass this pre-check have the upgrades rolled out remotely by Pure Storage engineers on a group by group basis to minimise risk. Obviously this is also done in conjunction and agreement with customers change windows etc. You wouldn’t expect your controllers to start failing over without any notice! 🙂

If I’m honest I am torn in two about this approach. The ancient storage curmudgeon in me says an array should just sit in the corner of the room quietly ticking away with minimal risk to availability and data durability (at least to known bugs anyway!). This new style of approach means that it doesn’t matter how many redundant bits of that rusty tin you have, as Scott D Lowe said last week:

That said we need to be realistic, we don’t live in ye olde world any more. Every part of the industry is moving towards more agile development techniques, driven largely by customer and consumer demand. If the “traditional” storage industry doesn’t follow suit, it risks being left behind by newer technologies such as SDS and hyper convergence.

There is one other key benefit to this deployment method which I haven’t mentioned of course; those big scary upgrades of the past now become minor updates, and the processes we wrap around them as fleshy sacks of water become mundane. That does sound quite tempting!

Perhaps upgrades aren’t that scary any more?

I’d love to hear your opinions either way, feel free to fire me a comment on twitter!

Further Reading
Some of the other SFD8 delegates have their own takes on the presentation we saw. Check them out here:

Dan Frithhttp://www.penguinpunk.net/blog/pure-storage-orange-is-the-new-black-now-what/

Scott D. Lowehttp://www.enterprisestorageguide.com/overcoming-new-vendor-risk-pure-storages-techniques

Pure1 Overview at SFD8

 
Disclaimer/Disclosure: My flights, accommodation, meals, etc, at Storage Field Day 8 were provided by Tech Field Day, but there was no expectation or request for me to write about any of the vendors products or services and I was not compensated in any way for my time at the event.

%d bloggers like this: