Tag Archive for SMB

Long Term Data Retention – What do I do?

One of the more common requirements I come across on a day to day basis working with organisations across a broad spectrum of industries is the question of how to manage long-term data retention.

Frankly, I have massively oversimplified the question as there are many more nuances to it that this! Some of the questions, discussion points and potential solutions I see when trying to scope out and define a long-term data retention strategy are below. We assume in this case that we are talking about backing up application data, but the same can apply to file data, such as from a file server.

Long Term Data Retention – Questions, questions, questions?!

Like beautiful snowflakes, ultimately it always comes back to gathering the requirements for the individual business.

What are the regulatory and compliance requirements for long-term retention of data, and what are the consequences for loss of that data? In the new world, this could be pretty serious, especially with things like GDPR right around the corner. Escalating this up the business hierarchy can get buy in from other parts of the business to provide additional budget outside of IT, for a solution to meet the actual requirements, not just a botch job which will likely fail when put to the test.

How long is the actual data retention required? Looking at most current applications, if we are relying on being able to read back data in 7 years, current or future backup software may still work, but will we have the kit to read the tapes or data? If using spinning rust as a storage media, do we expect to be able to migrate data from one disk system to another easily in future, and if so, how does that impact things like encryption, capacity, deduplication and compression of that data?

What is it that we are trying to protect against? Deliberate or accidental deletion, total destruction of a server, array or DC, or perhaps we just need to be able to prove what your data looked like at a specific date / time.

How granular does the data need to be? For example do we need to be able to pull a file version from a specific week in the past X years? The more granular we need to get, potentially the more expensive. If we have controls in place to protect archive data against accidental / deliberate deletion, then we may not actually need to keep more than a few days or weeks of backups (as an example).

The use of FIM (File Integrity Management) tooling can be very helpful in this regard, especially for flat file structures. They can track all changes to your file system and if something is removed or updated, you could alert your server teams to investigate why and restore it from a recent backup.

Can the application or server prevent deliberate or accidental data deletion? If the application can be treated as, or write to, WORM storage (Write Once Read Many times), then the risk of data loss is further reduced, especially if that storage can be replicated off site. This doesn’t really help much with things like SQL databases, however!

Where is the archive data for the application or solution actually held? Is it within the live system (e.g. the live DB), or can it be exported onto a tertiary archive system where it becomes Read Only to all parties, including administrators? Even better, can the application export the data into a generic format, more likely to be readable in 25+ years time (such as CSV, text etc)? This provides quite a bit more flexibility in terms of future access and recovery options.

Does the application or server provide RBAC, and has it actually been implemented yet? If we minimise the number of people who could update or delete data (maliciously or accidentally), we minimise the risk of data loss.

What is the budget for the solution? All singing, all dancing, physical or software solutions can be great, but you may not be able to afford them.

Are we looking for an appliance-based solution which includes storage, replication, backup plugins, etc, or do you already have the HW and just need some software? This often, but not always, comes down to a time vs budget question. Do you want to spend your team’s time managing clunky backup software, or just buying an appliance which does half the work for you and is policy based?

What are your sovereignty requirements for the data, and would a cloud-based service be appropriate for your business? It can be very cheap to store data in something like S3 or blob storage, if the business accepts this and you don’t need to pull any of the data back very often (if at all).

How quickly is the data required when requested, how large is a typical access request, and how often are they needed? If this can be hours or days, then an offline or cloud solution may be appropriate, but anything where immediate access is required, is a different story.

Similarly, will we want to restore or access this data in the event of a DR, does this solution form part of our DR strategy? Perhaps it’s only required for access to much older data because you are replicating the most recent data to a DR facility!

As we can see, there are many, many, [many!] things to think about when considering long-term retention of data in a backup or archive solution.

What brought this up Alex?…

… I hear you ask!

I recently attended Storage Field Day 13, where we had a presentation from a backup vendor, StorageCraft, who has been in the SMB and mid-market space for many years, and it got me thinking!

The latest iteration of their backup software provides a local cache with cloud integration, and the added ability to spin up a DR environment in the event of an outage to your primary DC. A pretty nifty feature if you are legally able to store your data outside of your local environment (they currently have DCs in the US and EU only).

They can also create backups using their proprietary SPF file format, which has apparently not changed since its inception around 15 years ago. There is also no concept of a media server, as each server manages its own backups (albeit with the ability to use a central scheduler tool). This gets around the issue of backup compatibility, though may limit their ability to provide additional data services for the backup files, such as encryption, dedupe or compression, outside that of the storage targets they reside on.

This is what tickled my mental matrix into deploying my keyboard! 🙂

Want to Know More?

The session was recorded and is now available to stream online:

StorageCraft Presents at Storage Field Day 13

Some of the other SFD13 delegates had their own thoughts on the session and StorageCraft in general. You can find them here:

Dan Frith – StorageCraft Are In Your Data Centre And In The Cloud

Scott Lowe – Backup and Recovery in the Cloud: Simplification is Actually Really Hard

Disclaimer/Disclosure: My flights, accommodation, meals, etc, at Storage Field Day 13 were provided by Tech Field Day / Gestalt IT, but there was no expectation or request for me to write about any of the vendors products or services and I was not compensated in any way for my time at the event.

Pure Storage Diversity – Time for the All Flash Vendor to go All File

It was only a couple of weeks ago I was saying to some colleagues that now Pure have finished with the whole IPO business, I thought they needed to diversify their portfolio a bit beyond the straight forward AFA.

I am very pleased to say they must have pre-read my mind and that’s exactly what they’ve announced today! 🙂

Not only is their new Pure FlashBlade platform designed to provide pretty much every file type (and object) you might require for your applications and users, it is also Scale Out, which is a key feature I am looking for more and more these days when researching new products for my customers.

FlashBlade.png

Not only is this a really interesting change in direction for Pure, but I see it as a pretty nifty bit of kit in and of itself! You would hope so, as Pure have been working on it in secret for the past two and a half years… 😮

For starters Pure have mixed and matched both Intel and ARM chips on every single blade, with different computational tasks being assigned to different chips, and a bit of FPGA technology thrown in for good measure. The latter being primarily used as a programmable data mover between the different elements of the blade, so as future flash technology becomes available, the FPGA can simply be re-coded instead of requiring total redesign / replacement with every generation. This will enable Pure to change out their flash as often as every 6 months in their production plants, taking maximum advantage of the falling prices in the NAND market.

This chip design was created to use the ARM processors as embedded and linked to the FPGAs, which effectively gives you a software overlay / management function, along with other low intensity, multi-threaded processes. The significant computational power of the Intel chips, particularly for single threaded workloads, rounds out the compute. From a nerdy technologists standpoint, all I can say is schweeeet!

The numbers they are suggesting are pretty impressive too! Each 4u appliance is capable of scaling out linearly with the following stats:

  • Up to 15x 8TB or 52TB blades, for a maximum of 1.6PB per 4u chassis
  • Up to 15GB/sec throughput per chassis, though I believe this is 4K 100% read, and real numbers might be around 1/3 of this.
  • 40Gbps ethernet out, with 2x 10Gbps per blade, connected to a broadcom based, custom, resilient backplane / switch layer within each chassis. Scaling to multiple chassis would require you to provide ToR switch ports for east-west traffic between chassis.
  • Overlaying this is Pure’s custom SDN code, which securely separates internal and external traffic, and uses multicast for auto-discovery of new devices.
  • Integrated DRAM and NV-RAM on every blade, along with PCIe access to the NAND.

The blades themselves look something like this:

blade.png

In terms of protocols, it will support NFSv3 out of the box on GA, with SMB and object storage over S3 supported shortly afterward. My understanding is that initial S3 support will be limited to basic commands, PUT, GET, etc, and more advanced feature support is in the pipeline. The initial release seems to be primarily targetted at the filer market, with object being the underlying architecture, but not the main event. As this support is built out later, the object offering could become more compelling.

The data itself is distributed and protected through the use of N+2 erasure coding, using however many blades are in the chassis. For example an 8 blade system would be configured as EC 6+2. As the number of blades in the system increases, garbage collection cycles are used to redistribute data to the new capacity, though I am still not 100% sure how this will work when your existing blades are almost full. The compute within each blade, however, acts independently of the storage and can access data resources across the chassis, so from the moment the additional blade is added, you have immediate access to the compute capacity for data processing.

My only query on this would be why Pure did not offer the ability to choose between Erasure Coding, which is ideal for lower performance requirements, and replicas, which would be handier for very low latency use cases? If you are putting it all on flash in the first place, instead of a hybrid model, there may be times when you want to keep that latency as low as possible.

The software platform they have design to manage this is called Elasticity, and to reduce the need to learn yet another interface, it looks very similar to the existing Pure management interfaces:

elasticity.png

A metadata engine with search functionality will be coming later, which will allow you to gain insights into the types of data you hold, and may potentially able to delve into the content of that data to find things such as social security numbers, etc. There are few details available on this at the time of writing.

As with the other Pure platforms, telemetry data is sent back to base on a regular basis, and Pure take care of all of the proactive maintenance and alerting for you. All of this data is presented through their Pure1 portal, which is pretty fully featured and intuitive.

I have to say I am genuinely surprised to see Pure come out with a solution with such completely bespoke hardware, when the entire industry is going in the direction of commodity + software, but the end result looks really promising. The sooner they can get SMB (not CIFS!) into the product the better, as this will allow them to begin competing properly with the likes of NetApp on the filer front.

As with many new products we tend to see on the market, the data services are not all there yet, but at the rate Pure do code releases, I don’t imagine it will be long before many of those RFP check boxes will be getting checked!

GA is expected during the second half of 2016.

Disclaimer/Disclosure: My accommodation, meals and event entry to Pure Accelerate were provided by Pure Storage, and my flights were provided by Tech Field Day, but there was no expectation or request for me to write about any of the products or services and I was not compensated in any way for my time at the event.

Where and why is my data growing?…

I’ve written recently about issues of data gravity and data inertia, and about how important analytics are to managing your data “stockpile”, but one thing I haven’t gone into is the constant challenge of actually understanding your data composition, i.e. what the hell am I actually storing?!

Looking back to my days as a Windows admin maintaining what for the time were some massive, multi-terabyte (ooer – it was 10 years ago to be fair), filers and shared document storage systems; we had little to tell us what the DNA of those file shares was, how much of it was documents and other business-related content, and how much of it was actually people storing their entire MP3 collections and “family photos” on their work shared drives (yes, 100% true!).

Back then our only method of combating these issues was to run TreeSize to see who was using most space, then do windows searches for specific file types and manually clear out the crud; an unenviable task which came across a few surprising finds I won’t go into just now (ooer for the second time)! The problem was that we just didn’t know what we had!

Ten years later I have spoken to customers who are consuming data at very significant rates, but don’t have a grip on where it’s all going…

With that in mind, I was really interested in what the chaps at Qumulo had come up with when they presented at SFD8 recently. As they said at the time, the management of storage is getting easier, but the management of data is getting very much harder! Their primary vision is therefore quite succinctly described as “Build visible data and make storage invisible”.

Their “Data Aware” scale-out NAS solution is based around providing near-realtime analytics on the metadata, and was designed to meet the requirements of the 600 companies and individuals they interviewed before they even came up with their idea!

The product is designed to be software only and subscription-based, though they also provide scale out physical 1u / 4u appliances as well. I guess the main concept there is “have it your way”; there are still plenty of customers out there who want to buy software solution which is pre-qualified and supported on specific hardware (which sounds like an oxymoron but each to their own I say)! Most of Qumulo’s customers today actually buy the appliances.

The coolest thing about their solution is definitely their unique file system (QSFS – Qumulo Scalable File System). It uses a very clever, proprietary method to track changes within the filesystem based on the aggregate of child attributes in the tree (see their SFD8 presentation for more info). As you then don’t need to necessarily walk the entire tree to get an answer to a query (it should be noted this would be one specifically catered for by Qumulo though). It can then present statistics based on those attributes in near-realtime.

Whiteboard Dude approves!

Whiteboard Dude approves!

I would have killed for this level and speed of insight back in my admin days, and frankly I have a few customers right now who would really benefit!

Taking this a step further, the analytics can also provide performance statistics based on file path and type, so for example it could show you where the hotspots are in your filesystem, and which clients are generating them.

Who's using my storage?

Who’s using my storage?

Stuff I would like to see in future versions (though I know they don’t chase the Service Provider market), would be things like the ability to present storage to more than one Active Directory domain, straight forward RBAC (Role Based Access Control) at the management layer, more of the standard data services you see from most vendors (the RFP tick box features). Being able to mix and match the physical appliance types would also be useful as you scale and your requirements change over time, but I guess if you need flexibility, go with the software-only solution.

At a non-feature level, it would be sensible if they could rename their aggregate terminology as I think it just confuses people (aggregates typically mean something else to most storage bods).

Capacity Visualisation

Capacity Visualisation

Overall though I think the Qumulo system is impressive, as are the founder’s credentials. Their CEO/CTO team of Peter Godman and Aaron Passey, with whom we had a good chinwag outside of the SFD8 arena, both played a big part in building the Isilon storage system. As an organisation they already regularly work with customers with over 10 billion files today and up to 4PB of storage.

If their system is capable of handling this kind of scalability having only come out of stealth 8 months ago, they’re definitely one to watch…

Further Reading
Some of the other SFD8 delegates have their own takes on the presentation we saw. Check them out here:

Dan Frith – Qumulo – Storage for people who care about their data

Scott D. Lowe – Data Awareness Is Increasingly Popular in the Storage Biz

Disclaimer/Disclosure: My flights, accommodation, meals, etc, at Storage Field Day 8 were provided by Tech Field Day, but there was no expectation or request for me to write about any of the vendors products or services and I was not compensated in any way for my time at the event.

 

Software Defined Storage Virtualisation – How useful is that then?

Ignoring the buzzword bingo post title, storage virtualisation is not a new thing (and for my American cousins, yes, it should be spelt with an s! 🙂 ).

NetApp have for example been doing a V-Series controller for many years which could virtualise pretty much any storage you stick in the back of it. It would then present it as NFS and layer on all of the standard ONTAP features.

The big advantage then was that you can use the features which might otherwise be missing from your primary or secondary storage tiers, as well as being able to mix and match different tiers of storage from the same platform.

In a previous role, we had an annual process to full backup and restore a 65TB Oracle database from one site to another over a rather slow link, using an ageing VTL that could just about cope with incrementals and not much more on a day to day basis. End to end this process took a month!

Then one year we came up with a plan to used virtualised NFS storage to do compressed RMAN backups, replicate the data using snap mirror and restore on the other side. It took us 3 days; an order of magnitude improvement!

That was 4 years ago, when the quantity of data globally was about 4x less than it is now; the problem of data inertia is only going to get worse as the worlds storage consumption doubles roughly every two years!

What businesses need is the flexibility to use a heterogeneous pool of storage of different tiers and vendors in different locations to move our data around as required to meet our current IT strategy, without having to change paths to data or take downtime (especially on non virtualised workloads which don’t have the benefits of Storage vMotion etc). These tiers need to provide the consistent performance defined by individual application requirements.

It’s for this reason that I was really interested in the presentation from Primary Data at Storage Field Day 8. They were founded just two years ago, came out of stealth at VMworld 2015, and plan to go GA with their first product in less than a month’s time. They also have some big technical guns in the form of their Chief Scientist, the inimitable Steve Wozniak!

One of the limitations of the system I used in the past was that it was ultimately a physical appliance, with all the usual drawbacks thereof. Primary Data are providing the power to abstract data services based on software only, presented in the most appropriate format for the workload at hand (e.g. for vSphere, Windows, Linux etc), so issues with data gravity and inertia are effectively mitigated. I immediately see three big benefits:

  • Not only can we decouple the physical location of the data from it’s logical representation and therefore move that data at will, we can also very quickly take advantage of emerging storage technologies such as VVOLs.
    Some companies who shall remain nameless (and happen to have just been bought by a four letter competitor) won’t have support for VVOLs for up to another 12 months on some of their products, but with the “shim” layer of storage virtualisation from Primary Data, we could do it today on virtually any storage platform whether it is VVOL compliant or not. Now that is cool!
  • By virtualising the data plane and effectively using the underlying storage as object storage / chains of blocks, they enable additional data services which may either not be included with the current storage, or may be an expensive add-on license. A perfect example of this is sync and async replication between heterogenous devices.
    Perhaps then you could spend the bulk of your budget on fast and expensive storage in your primary DC from vendor A, then replicate to your DR site asynchronously onto cheaper storage from vendor B, or even a hyper-converged storage environment using all local server media. The possibilities are broad to say the least!
  • The inclusion of policy based Quality of Service from day one. In Primary Data parlance, they call them SLOs – Service Level Objectives for applications with specific IOPS, latency etc.
    QoS does not even exist as a concept on many recent storage devices, much to the chagrin of many service providers for example, so being able to retrofit it would protect the ROI on existing spend whilst keeping the platform services up to date.

There are however still a few elements which to me are not yet perfect. Access to SMB requires a filter driver in Windows in front of the SMB client, so the client thinks it’s talking to an SMB server but it’s actually going via the control plane to route the data to the physical block chains. A bit of a pain to retrofit to any large legacy environment.

vSphere appears to be a first class tenant in the Primary Data solution, with VASA and NFS-VAAI supported out of the “virtual” box, however it would be nice to have Primary Data as a VASA Client too, so it could read and then surface all capabilities from the underlying storage straight through to the vSphere hosts.

You will still have to do some basic administration on your storage back end to present it through to Primary Data before you can start carving it up in their “Single Pane of Glass”. If they were to create array plugins which would allow you to remote manage many common arrays this would really make that SPoG shine! (Yes, I have a feverish unwavering objection to saying that acronym!)

I will certainly be keeping an eye on Primary Data as they come to market. Their initial offering would have solved a number of issues for me in previous roles if it had been available a few years earlier, and I can definitely see opportunities where it would work well in my current infrastructure. I guess it now becomes up to the market to decide whether they see the benefits too!

Further Reading
Some of the other SFD8 delegates have their own takes on the presentation we saw. Check them out here:

Ray Lucchesi – Primary data’s path to better data storage presented at SFD8

Dan Frith – Primary Data  Because we all want our storage to do well

Disclaimer/Disclosure: My flights, accommodation, meals, etc, at Storage Field Day 8 were provided by Tech Field Day, but there was no expectation or request for me to write about any of the vendors products or services and I was not compensated in any way for my time at the event.

%d bloggers like this: