How do you know which flash is right for you?

Skip Bacon –

On Monday, The Register published a story SPLASSHH! Tech giant IBM DIVE-BOMBS into all-flash array pool where IBM and its competitors commented on IBM’s announcement. I have to say it was quite interesting to see the vendors duke it out over their solutions. I wanted to jump into the conversation and share a perspective on what this means for customers.

All-flash arrays undoubtedly represent a major evolution in storage technology, and the finger-pointing and name calling by their vendors is a source of cheap entertainment (likely the only cheap part of the whole story!) But, as with so many past technology evolutions, the vendor focus on device features and functions distracts from the much bigger customer challenge – how to design, deploy, and support compute systems combining multiple types and flavors of devices from multiple vendors, while optimizing availability, performance, and cost. The emergence of faster flash arrays doesn’t magically solve this “systems integrator dilemma”; to the contrary, some of the inherent technology characteristics, especially at this relatively early stage of market maturity, make these challenges harder, not easier.

Customers looking to adopt all-flash arrays have to address 3 primary challenges:

1.     Developing accurate cost/benefit analysis. One of the primary benefits of flash storage is performance, and the AFA vendors love their “vanity” benchmarks that universally tout eye-poppingly high IOPS numbers to prove that benefit. The reality, however, (as reflected in many of the vendors’ comments) is that those benchmark numbers are typically generated using much smaller than real-world exchange sizes, and further using the read/write balance, sequentiality, level of data compressibility, and other factors most favorable to that vendor’s product. (Which of course is in no way intended to suggest that a technology vendor would ever “cook” a benchmark!)

Your real-world workload doesn’t look much like the vendor’s benchmark workload. Your mileage not only *may* vary, it’s going to vary – in all likelihood very considerably – as a result. Having an accurate expectation about achievable performance for your specific workload is critical to being an informed consumer with a realistic cost/benefit justification. This is especially critical given the price premium that many vendors are asking for those performance benefits.

2.      Validating vendor claims. Given the early state of this market segment and the often-conflicting claims thrown around by the vendors (as amply demonstrated in this article), it’s understandable that many customers want to put those claims to the test in their own labs before committing to purchase.

Gathering benchmark data to support an expensive purchase decision sounds like a great idea in theory. In practice, however, executing a high-IOPS benchmark accurately and repeatably across multiple vendor offerings is a distinctly non-trivial undertaking, involving multiple hardware and software components set up with a whole bunch (to use the correct technical descriptor) of parameters. The test configuration and execution have to be closely monitored to ensure that the load test setup actually generates the desired I/O pattern, that the infrastructure is optimally tuned, that conditions don’t vary from run to run, and of course that the vendor-reported performance numbers are accurate.

Inaccuracy and variability in this process serve no one’s interests. Customers certainly don’t want their decisions guided by flawed data, and vendors want their products to reliably exhibit the best possible performance. Until the market matures, however, to the point where vendors can and will provide defensible performance numbers for the broad mix of real-world workloads, benchmarking will continue to be prerequisite in many purchase decisions.

3.     Delivering production performance. Bolting an all-flash array into the existing production environment and achieving anything approaching the performance measured in the pristine, isolated lab environment is an even less trivial undertaking. There are a plethora of host and fabric conditions that have to be satisfied to realize high I/O performance across the entire customer-integrated system – literally from the flash cells in the array into host RAM on the host.

Monitoring, tuning, and controlling these conditions as the entire system changes around them is fundamental to ensuring consistent system performance. Yet in many of the production environments that we have assessed, these conditions are so imbalanced that the addition of significantly faster storage (which, ironically, is often the storage vendor’s universal prescription for “whatever ails the SAN”) would actually have made the entire system significantly slower. As Fibre Channel has only rudimentary flow control, the SAN as a system works best when request and response rates are well-balanced between hosts and storage; making any one part of the system much faster can badly upset that balance and result in conditions such as buffer-credit starvation and head-of-line blocking backing up across ISLs that badly impact overall performance and availability.

The only customer outcome worse than buying an expensive array that proves to be much less fast than expected would be for that array to impact the performance and/or availability of the entire production environment into which it is deployed.