Pages

Saturday, July 7, 2018

Thoughts on overload and redundancy

Term
Definition
Overload
To load something to excess
Redundancy
The inclusion of extra components which are not strictly necessary to functioning, in case of failure of other components.


Introduction


Earlier in the year, I read an article on the many different types of failures encountered during the building of the Humber Bridge in the UK. This article put me in search of other bridge-related failures but this time I looked for examples where overload was a contributing factor in the collapse of the bridge itself. Some examples go to great depths detailing how particular bridges were sized, the materials systems and processes used to build them, what caused them to fail, and the various governmental and engineering innovations that followed failure events to mitigate future risk. There is one bridge failure that I explored over a number of books and various online materials as this event was to prove pivotal in the way that it acted as a catalyst for governments and engineers to come up with new approaches for the design, regulation and lifecycle management of bridges in the United States.

The Silver Bridge Collapse

At approximately 5pm on Friday, December 15th 1967 the Silver Bridge collapsed into the Ohio River – tragically killing forty-six people.


At the time the bridge was built in 1928 the typical car was a Ford Model-T that weighed around 1,600 lbs and local law prohibited any truck that was more than 20,000 lbs. The bridge was designed with a safety factor of 1.5 so was capable of supporting cars up to 2,400 lbs and trucks up to 30,000 lbs.


As the years passed, vehicles got heavier, and by the sixties, a typical car now weighed 4,000 lbs, with trucks weighing up to 60,000 to 70,000 lbs.  The loads had almost tripled.




In the years before its collapse back to back traffic was typical and the tools and processes at the time used for bridge inspection were not capable of identifying a fracture that appeared in one of the eye-bars. Stress corrosion and corrosion fatigue where listed as the probable cause with loading being a major factor that helped collapse the bridge. The following year, then President Lyndon Johnson and Congress brought into law the National Bridge Inspection Standards, and the United States Department of Transport enforced tighter regulations around weight limits. The regulations also enforced regular inspection and brought with it the start of a national database holding construction and inspection information of all federal bridges.


Inspection, loading and redundancy are key elements associated with bridge engineering, and indeed they are also crucial elements of modern information technology infrastructure systems.

Taking DellEMCs VMAX & PowerMax product range as an example (the other products have similar), some great tools enable the presales sizing function and post-sale performance management.

  • Sizing: VMAX Sizer Gives subject matter experts from Dell EMC and selected partners the ability to size a solution based on your business objectives, capacity and workload requirements.
  • Performance Management: Unisphere offers real-time component health, response time alerting, component performance utilization, capacity utilization and enables pro-active decision making so you can make informed decisions about when its time to scale-up or scale-out.
  • Maintenance: Secure Remote Services (formerly ESRS) gives subject matter experts from Dell EMC secure access to systems during support activities and regular maintenance.

Talk to your local sales team!

Thanks for reading and please feel free to leave comments or feedback.

Wednesday, May 9, 2018

The Pipersbytes Library: Practical Storage Area Networking

When I got a date for an interview with EMC back in late 2006 I started reading this book I got from Amazon called “Practical Storage Area Networking” by Daniel Pollack, published in 2002. Pollack was a systems administrator with America Online at that time and I only bought the book as I had glanced at the acknowledgements page online and spotted that “EMC Corp.” was mentioned. I thought I should really read this as I’ve got an interview in a few weeks’ time with EMC!



Pollack’s book turned out to be one of the best sources of information I had on storage and workload prior to joining EMC and it still sits on my desk to this day if I need a gentle reminder of the principles he described. They are as useful now as they were then. At the time, I hadn’t read a lot of performance books or articles and storage was only something I’d tinkered with in the form of a few DAS boxes and an occasional login to a CX300. Therefore, what Pollack described in his book was new knowledge and I loved every word of it.

If I was to summarize the salient pieces of chapters two and three from his book it would go something like this: Storage workloads aren’t simply just a number of IOPS, it’s also the fact that these IOPS have a size that Pollack called bandwidth, (years later I now prefer the term throughput as opposed to bandwidth when describing the size of moving data). Combined, the number and throughput of IOPS in conjunction with their read/write ratio, random to sequential ratio and cache hit ratios is what determines the type and configuration of solution required to move the customer’s workload within a certain response time objective.

By the time I joined EMC in 2007, many companies including Compellent and EMC where well on the way to developing technologies that would take advantage of two other, very key workload characteristics not covered by Pollack: IO density and skew.  I've added these to a summary table below.

Term
Definition
IOPS
  • Workload requests sent to the array, measured as input/output operations per second (IOPS).
  • The number, size and different characteristics of IOPS puts pressure on different components of a storage system and through innovation, we have learned to use some of these characteristics to our advantage.

Throughput (MB/s or GB/s)
(Pollack called it bandwidth)
  • Throughput is the collective size of all IOPS at a point in time, typically measured as megabytes per second (MB/s), or on today's larger platforms gigabytes per second (GB/s).
  • The size of an individual IOP is known as a block, typically measured as kilobytes (kB).
  • I like to call it “the size of moving data”.
  • When modelling a solution the challenge is to find a series of pipes with a large enough diameter to let the moving data through.

Bandwidth

  • The maximum throughput of a device or component; think of this as the maximum diameter of a tube or pipe.
  • Be careful as this can be measured in MB/s or megabits per second (Mb/s). The latter typically used to describe the maximum capability of a port, lane or bus.


Read-write ratio
  • The number of reads and the number of writes.

Random sequential ratio
  • The number of random IOs and the number of sequential IOs.

Cache hit ratio
  • The number of IOs that are served from cache.

Response time
  • The amount of time it takes a storage system to respond to a request.

Skew
  • A measurement of active and non-active data.
  • Enables placement of data on different drive technologies for cost and performance optimization based on the changing temperature of active and non-active data typically grouped as extent groups or chunk sizes that vary depending on the technology used on the primary storage system.

IO density
  • Workload to capacity ratio, typically measured as the number of IOPS to either front-end or back-end capacity.


Numeral systems, measurement units and capacity

On September 23rd, 1999,  NASA lost contact with its Mars Climate Orbiter (MCO) as it burned up unexpectedly on the day that should have ended up in celebration of it entering Mar’s orbit. The failure was due to one team using English units (e.g. inches, feet, and pounds) while the other used metric units (e.g. centimeters, meters, and kilograms) for key spacecraft operations that steered the MCO through space. Instead of putting the MCO into Mars’ orbit, the failure put it on a trajectory too close to the planet with the result being it burned up in the Martian atmosphere.

This may be one of the most radical examples of what can happen when organizations do not clarify which measurement units are being used but it serves the purpose of highlighting the importance of understanding this critical factor. To draw a parallel with IT, organizations that don’t understand which measurement units have been expressed could end up with too much or too little capacity for data in transit, storage or both.

Numeral and measurement systems have had internationally defined standards as far back as the 19th century. Organizations such as the Internal Standards Organization (ISO) and the International Bureau des Poids et Mesures (BIPM), (known in English as the International Bureau of Weights and Measures) maintain and update these in conjunction with many other global organizations. Other organizations such as the International Electrotechnical Commission (IEC) and the Institute of Electrical and Electronics Engineers (IEEE) have worked with the ISO and the BIPM to define specifics for the field of IT.




The standards used in IT for decimal multiples are called the International System of Unit (SI Unit) prefixes and at time of writing (December 4th, 2017) are documented in the 8th edition, 2006  of the International System of Units brochure available at the following link: https://www.bipm.org/utils/common/pdf/si_brochure_8.pdf#page=127
This edition recognizes that the SI Unit prefixes should not be used when expressing binary multiples. Instead, the adoption of prefixes for binary multiples as defined in IEC 60027-2 and the IEEE should be used in the field of IT to avoid the incorrect usage of the SI prefixes. IEC60027-2 has since been superseded by IEC 80000-13:2008 IEC - SI Zone > Prefixes for binary multiples

Tables 1 & 2 detail the SI Unit (Decimal) Prefixes and IEC (Binary) Prefixes, including their names, factors, and origins.


Value (No. bytes)
Unit Prefix
Name
Factor
Origin
Name
Symbol
1,000
kilo
kB
Kilobyte
103
thousand
1,000,000
mega
MB
Megabyte
106
million
1,000,000,000
giga
GB
Gigabyte
109
billion
1,000,000,000,000
tera
TB
Terabyte
1012
trillion
1,000,000,000,000,000
peta
PB
Petabyte
1015
quadrillion
1,000,000,000,000,000,000
exa
XB
Exabyte
1018
quintillion
Table 1 SI Unit Prefixes (Decimal)




Value (No. bytes)
Unit Prefix
Name
Factor
Origin
Name
Symbol
1,024
kibi
kiB
Kibibyte
210
Kilobinary ~thousand
1,048,576
mebi
MiB
Mebibyte
220
Megabinary ~million
1,073,741,824
gibi
GiB
Gibibyte
230
Gigabinary ~billion
1,099,511,627,776
tebi
TiB
Tebibyte
240
Terabinary ~trillion
1,125,899,906,842,620
pebi
PiB
Pebibyte
250
Petabinary ~quadrillion
1,152,921,504,606,850,000
exbi
XiB
Exbibyte
260
Exabinary ~quintillion

Table 2 IEC Prefixes (Binary)


Using the data from Table 1 and Table 2 let’s say we express our need for a new capacity requirement as 512 TB, as opposed to the actual capacity requirement of 512 TiB.512TB, equals 512,000,000,000,000 bytes, whereas 512TiB equals 562,949,953,421,312 bytes.

Just like the error that led to the failure of the MCO, in this example using the incorrect measurement unit would result in a shortfall of 50,949,953,421,312 bytes, or 46.34TiB.
Now, there is some good news for the capacity ranges that are common on today’s storage platforms that may not result in something as catastrophic as the Mars example!  Typical primary storage requirements can be measured in or around trillions of bytes (TB or TiB) with a growing number creeping into the quadrillions (PB or PiB).

At these ranges, the margin of error if the measurements are mistaken is that you’ll either lose or gain up to 9.95%.  On its own, this may not necessarily represent a problem as some (but not all) solutions will observe just marginally more than what has been requested to cover things like pool reserved capacities or drive protection overheads.  However, if you’re on the side of the 9.95% shortfall and this gets compounded with an assumption on data reduction technologies that is beyond what the system can do for the workload in question – you may find the solution short on capacity right out of the gates.