Enterprise Performance: over budget,over extended, under prepared. Sound familiar?

Rafal Los and I are always referring to one another as the step-brothers of the non-functional testing world, meaning that Performance and Security most often don't fare well when compared to our more popular sibling, Functional Testing. I challenged Raf once that we could take just about any of his blog entries and replace the word "security" with the word "performance" and it would have almost exactly the same meaning, call-to-action and value as his original blog post. (And if the title of this blog entry rings a bell then you’re probably also following Raf’s blog The Wh1t3 Rabbit.)
Here is one of Raf's recent blogs on Enterprise Security which I have changed just slightly to be read as a blog entry about Enterprise Performance:
Earlier today at the Cleveland, Ohio Information Performance Summit my friend and colleague Jim Smith (the CPO for Diebold Corp) presented on a topic that seemed to validate many of the things I've been saying lately ...quite frankly that was a bit of a relief.  You see, Jim’s a bit of a superstar in the System Performance world ...one of the youngest executives I know, and razor sharp wit.  You see, Jim's presentation today was on how the majority of enterprises are simply not built to thwart performance outages.
While the average Enterprise Performance organization keeps increasing capital and operational spending for performance... all those blinking lights in the closets and vast trove of tools all generate lots of interesting data, but what gets done with that data?  Furthermore, the more agents we drop onto desktops, laptops and mobile devices the slower they get and the louder the user backlash.  In reality - how much additional benefit does an enterprise get with each added new installable agent deployment?  It's tough to tell ...but the value doesn't increase like we'd like.
So now we're faced with a precarious situation.  The technology spend performance organizations are making isn't returning the kind of benefit they expect, and the user backlash is growing all while it's getting harder and harder to manage all these consoles, dashboards, boxes and tools ...what to do?!
I think we can all agree the answer isn't more randomly placed technology, for sure.  So what then? 
I've been talking and starting conversations about Enterprise Performance and exactly what it means to organizations in this position, and concepts like EPI (Enterprise Performance Intelligence) ... all of which contribute to what I think is a higher state of performance awareness, and responsiveness.  It's not how well you fortify the (virtual) castle walls anymore, since those walls have all but disappeared ...but rather how prepared you are for when the performance issue shows up inside your keep, and starts tarnishing your precious response times.
Jim and I diverge slightly though, I think that post-outage is one of the worst times to think about how you're going to build up a performance program for your enterprise ...as the decisions that are made at that time tend to be hasty, poor, and often forced.  When your organization's house is on fire, the pressure's on to put out the fire immediately, rather than to worry about long-term sustainability and strategic thinking.  I think the best time to formulate a strategy is pre-outage when you've got a rational, clear-thinking head on your shoulders.  Unfortunately, this is often the time when you probably won't have the funds ...details, details.
So let's get back to your enterprise, and what you can actually do to protect the company customers and partners from inferior performance ...
  • As with any battle plan, segment your defensive strategy into risk categories.  General performance issues (some of us call this background noise) should get one type of defensive strategy, while focused, targeted attacks need to have their own.  If you don't have both you're only defending against someone running a background job ...or that "cloud based compliance scanning service" which runs crawls against your IP space during a regularly scheduled maintenance window.  These aren't serious threats, much like the 'random batch job' you're stopping with that ancient-old piece of APM software.
  • Treat your peak-hour performance issues differently by focusing more of your attention there.  The mitigation strategy for the peak outage (or spike issue) is quite different than against nominal threats, and involves first off knowing your assets, then being able to clearly understand data movements, users and business processes.  This is often quite difficult - but necessary.
  • Prepare for failure in the face of the inevitable performance issue.  Once you've made your peace with the fact that the outage will eventually bring down your critical assets you actually feel better and can think clearly about what happens next.  How does your performance incident response organization mobilize, what resources and data do you have available to you on a moments' notice, and how prepared are you to disrupt critical business services to stop the bleeding?  These are all things that you not only must plan out, but try at regular intervals like disaster recovery drills.  Anyone who's been through this will agree that planning for incident response, and actually surviving through a critical situation are completely different animals.
  • Collaborate your performance technologies, as much as possible, as often as possible.  Taking information from your application logs, server diagnostics, network monitoring, anti-malware, and other systems and plugging them into a central system to perform advanced analytics - I'm talking way, way beyond traditional PIEM here - is what will save you.  Think outside the performance and scalability bubble and incorporate things like access control, application logging and behavioral analysis, system and network logging to track how users and systems interact with each other, and when deviations occur which warrant your immediate mobilization.

Remember, bad things will happen.  You will experience an outage, catastrophic performance incident, or failure of some type ...unless you're simply too ignorant to know the difference, in which case I can't help you.  The incident is important, but more important than the cause of the performance issue is how you, your organization, and your business respond.  Be prepared for the outage, test your response plan, and get real with your Enterprise Performance. ___________________________________________________
Out of respect for the original author and composition, here is the link to Raf’s excellent blog post on Enterprise Security: ”Enterprise Security - Over Budget, Over Extended, Under Prepared”.