Redundancy vs Armageddon

July 19, 2024

Yoooooo, hello there my Redundancy Rangers! My apologies for the long break between posts, but I was on sabbatical diving seep into AI, but todays happenings brought me out of my dark cave as I legitimately thought that World War III had erupted...let me 'splain!

For my clients I have all sorts of alerts reporting downtimes between critical infrastructure providers (think Microsoft, Google, Amazon, etc.), critical communication players (think Verizon, ATT, T-Mobile, etc.), Critical Internet Tools providers (Think Network Solutions, Cloudflare, etc.), Critical non-IT infrastructure (Airlines, Railways, etc.)...and...well, I can keep going, but essentially if things go boom, I can be on the forefront of what's going on and be a reliable and authoritative source of data for my clients. My reporting systems straddle both the public side of the internet, but also parts of the Dark Web...and no I wont show you the architecture or my methodology on the dark net...Nope. Made that mistake once and people in suits visited my squire who didn’t know how to protect himself, and he lost a few grand from getting hacked. Any who....back to the story!

Early in the AM I start seeing signs of mass American Flights being grounded...I was giddy to see what was happening. If it was just Americans systems, it would be a good read in and of itself...but then I started seeing Delta have the exact same issue...Which was a little unnerving. I mean, two major airlines seeing massive server booms, this is what got my attention. As I was booting my darknet exploring box, I start seeing many other airlines having the same issue.

Oh.

My.

GAWD!

I was certain World War III had just started. My darknet box could not boot fast enough. I was suddenly more awake than I have been in the middle of the night than I ever have been. I mean, from a security perspective, we know State sponsored bad actors have made attacks on our businesses, and we know they have the capability to attack our infrastructure...and If I were a State funded hacking group, and if my boss said "it's time"...well let's just say that in any war since the dawn of time, attacking infrastructure is a go to tactic.

Long story short, when I started seeing similar outages in Australia and some other countries I then cooled off a bit...for whatever reason knowing the hacking groups who have the ability to wage that level of attack, the attack vector was just too large. Even the best of the best could not pull that off...not unless they all managed to work together and that won't be happening for a long, long time...This had to be an inside job!

So let's get back to what happened. This issue was multifaceted...truly many people were involved in this worldwide mess...but let's start at the head!

Firstly, this was an operational error by CrowdStrike, a leading security provider. In a nutshell, a bad update was pushed to production. All production systems...All customer owned landscapes. Let's face it, whomever made cuts to the QA team at CrowdStrike really needs to hire some of those team members back.

We can guess as to how this happened...a lack of QA testing + a push from management to "get it out the door"...maybe a senior project manager citing the Pareto Principle, sprinkled with a dash of lack of IT leadership to push back when something might not be ready for prime time. Only thing that would make this worse is if someone cut the QA budget proclaiming that AI was the answer and this is the result. Whatever it was ultimately, we can be assured it was a lack process and policy that allowed that update to leave the door half baked.

The second major flaw I have seen that determined how long the issue affected people is lack of proper backups. This again can be lack of resources from the business, but we really need to focus on this part a little more. have a decent backup process and testing backups made the difference between well funded IT teams who were back up in a couple of hours, to those who are still fighting it now as I am writing this article and those who will be fighting to get their systems up over the weekend.

Third major flaw is architecture. Long story short, all layers should be able to be replaced from a properly executed backup. I have heard of some rather large enterprises who are still fighting this issue who have the DB on the same functional layer as the OS...come on now. Doc Holligray is sad...If you can't quickly recover the OS from backup without worrying about your data, then you have failed in your architecture planning. Every layer should be able to be safely handled on its own without affecting the other layers.

Fourth major flaw is lack of update rings...nuff said. Too much faith is put on SaaS providers and let's face it, until there are no more humans working at SaaS providers, then there will be errors. Let's get back to the days when we test things before rolling it out to our own systems. Offloading all IT responsibilities to outside people leads directly to this!

I can keep going, but lets start discussing remediation. First off, here is a link to the fix from Crowdstrike:

How to fix CrowdStrike Blue Screen error on Windows 11 | Windows Central

This fixes the current issue, but going forward, what should we do?

First off, let's all put down our pitchforks for a second and realize that this is not entirely a CrowdStrike problem. It's an industry wide problem driven by capitalisms constant search to squeeze more and more productivity from less and less producers. Until we solve that conundrum, we will see this problem...CrowdStrike was just the next victim in a long line of victims of corporations pushing for unsafe practices to save a penny.

Getting rid of CrowdStrike as a knee-jerk reaction is an act for the greatest of fools...all they are doing is moving the risk to a player that hasn’t learned their lesson...and worse yet, the core issue will still be there. Until we can convince capitalism to change its stripes all we can do is minimize the risk as much as we can. Given that we can only affect our own systems and have zero control over our vendors SaaS offerings, the way that I have solved this in systems I have designed that can accept zero downtime is stolen from an ISO 26262 Functional safety requirement – redundancy.

In cars we have redundancy upon redundancy...what some non-automotive folk see as waste, is actually a safety feature. If one system fails, the goal is to still allow the car to safely complete its journey...having all your communications travel on one communication method is short sighted and removes this redundancy and only egomaniacs would think otherwise.

I have worked on some rather large systems and I have on my resume something I call Hyper Redundant Architecture. I only called that because my wife made fun of the first name I gave it of "I.DARE.ANYONE.TO.TRY.AND.BREAK.THIS.THING.com" was too stupid...What hyper redundant architecture is...having your computer stacks on different technologies. At a simple level, lets say you have all your eggs in Amazons basket...well, put some of those eggs into Microsoft's, or Google's basket as well. You can use global load balancers to control the flow of data, sync databases on the back end...in this case, diversify your software stacks by using different vendors.

Long story short, if you have a mission critical system that have to be up 100% of the time, have not only hardware redundancy, but also redundancy in compute layers, SaaS providers, as many layers as can be achieved. You will increase cost a little bit, as well as overall complexity...but you will also rarely go down.

I don’t recommend this for smaller companies, but if you are at risk at losing hundreds of thousands dollars a minute, then maybe spending a few bucks here and there for a little more redundancy adhering to a more robust ISO 26262 Functional Safety standard normally used for mission critical things like the safety of your family in car...the added cost might just one day pay for itself...you can ask those who are still dealing with this how they would feel about it once they are done slamming down Redbulls/Monsters and can get a nap from all the firefighting practice they will get this weekend.

Good luck all of you still fighting this...My heart is with you.

As always, Contact Us to learn more!

About the Author

Doc Holligray is the alter ego of Alfredo Cartagena, a seasoned IT Professional who has worked professionally for over 35 years. His education started at age 7, when he got his first computer which came with a programming language. Over the next few years, he self taught himself programming and created a few games which were famous...in his own house. He started his first business at 15 with a totally cool name, and then found his first job when he was 32. In his career he has built engineering centers and manufacturing plants in the US as well as Europe and China. He is hyper focused on security as well as building a cooperative work culture.

Back to blog

Item added to your cart

Redundancy vs Armageddon

As always, Contact Us to learn more!

About the Author

Leave a comment