Skip to content

On-call support sucks; here’s how to make it suck less

    
On-call support sucks; here’s how to make it suck less

It was the week after Thanksgiving, around 1 a.m., and the company was suffering the equivalent of a heart attack in its order-processing queue. 

There were already more than 100,000 backed-up orders, and every 10 minutes, the systems dumped roughly 1,000 more orders on top of the already clogged arteries. 

As ridiculously bad luck would have it, Jacob Mages-Haskins — a seasoned software engineer who went on to join Contrast Security as Staff Software Engineer in March 2022 — was the poor sod who wound up on rotation for on-call support that night.

This was only one of countless nightmares that makes on-call support utterly suck for engineers. In this particular case, it turned out that some worker processes were generating queries that locked the database, which led to the orders blocking each other from finishing. 

Getting stuck on on-call support can be a nightmare for engineers for a number of reasons, such as the engineers never having seen a particular issue before — a typical scenario for technicians who get plucked out of their comfort zone, pulled from their day-to-day focus and plugged into “Fix This, Fix That, Fix Anything and Everything!” mode. 

In a Code Patrol podcast chat, Mages-Haskins recounts how staff finally figured out that it was the post-holiday increase in online traffic that brought the underlying issue to light and caused the systems to spaz out.  

Please do not sweat. Before you start reliving your own support-call PTSD, be assured that things don’t have to be this bad. There are valuable processes that can help the luckless engineering bods who wind up on-call when the systems go bananas and the customers flood the business with their “WTH??” and “Help me!!!!” cries of despair. 

Obliterate on-call nightmares with instrumentation

These processes, which Mages-Haskins was kind enough to outline in our podcast chat, aren’t just meant to save support staffers’ sanity (though that’s a big plus). They’re meant to save the business from swirling down the drain. That post-Thanksgiving, early morning incident is a case in point: It had what Mages-Haskins described as a “real negative impact on the business.” 

If he and his team couldn't figure out an engineering solution to the problem, “it was going to  turn into a customer service nightmare,” he said. “It was a tough day for me.”

The Log4Shell remote code execution (RCE) vulnerability disclosed in the Log4j programming library was a similar nightmare, he notes: “We had to scramble that day to assess how exposed were our systems to this risk, and we needed to triage: ‘What changes do we need to do to protect ourselves from it?’ And how quickly could we make those changes?”

Some of the practices that Mages-Haskins has seen help in these often frenzied situations:  

  • System instrumentation. Instrumentation of products and production can give real-time information and real-time alerts. Instrumentation might watch website traffic, database queries or resource usage, such as  CPU usage. “The alerts can give engineers advance notice that, ‘Hey, this system is getting into danger,” Mages-Haskins says. 
  • Quality gates. Tools at this level try to catch the problems in the source code before they ever get to production. Such tools do things such as analyze the source code for different quality metrics, such as complexity or duplication, he explains. They can also scan code for issues such as poor formatting or for code blocks that are known to have poor performance. Such issues can lead to issues and cause headaches for those engineers who are on on-call duty, Mages-Haskins says. 

That second layer is, in fact, where Contrast Security’s product suite “really, really shines,” he notes. Cybersecurity vulnerabilities are “definitely a source of engineering on-call incidents,” Mages-Haskins said. Log4Shell was a case in point, as any on-call support staffer who suffered by working through remediation-intensive weekends around that December 2021 incident will tell you.

“With [The Contrast Secure Code Platform] … we can scan project libraries [such as the relevant Log4j library] or even your project's custom code for these cybersecurity risks,” he said. 

“And there's a few ways that an engineering team can accomplish this [by implementing] these tools,” he added. “The team, for example, could use GitHub actions to add Contrast to a link for their [Continuous Integration/Continuous Deployment (CI/CD)] pipelines. Or they could get pre-commit hooks that scan their code before their code changes or even [gets] committed to the remote source code repository.

“And honestly, with Contrast’s CodeSec [free code scanning tool], developers can start using all of these enterprise grade tooling resources to find cybersecurity vulnerabilities for free,” he continued. “It's amazing to me that, that we offer these things for free because in my day-to \-day work, I use CodeSec. It’s actually helped my team catch, problems with some of the third-party libraries that we use, and it catches them before the problems are introduced to our production code base.”

Have a listen to the podcast to hear more from our eating-our-own-dogfood engineer, and check out a demo of Contrast Secure Code Platform while you’re at it. 

Listen Now

 

Lisa Vaas, Senior Content Marketing Manager, Contrast Security

Lisa Vaas, Senior Content Marketing Manager, Contrast Security

Lisa Vaas is a content machine, having spent years churning out reporting and analysis on information security and other flavors of technology. She’s now keeping the content engines revved to help keep secure code flowing at Contrast Security.