April is here, and the end of the semester will be upon us before we know it. It was great to see so many of you in person at our spring all-hands meeting in March—I look forward to doing it again this fall.

More than impending finals, it’s the April 8 total solar eclipse that has captured my attention for the past few weeks. I’m hoping that it won’t be overcast the whole time, but it promises to be a spectacular event regardless. If you plan on going outside for some viewing, make sure to do so safely! If you plan on driving to get inside the path of totality, the Washington Post created a great playlist for the road trip (and yes, Bonnie Tyler is on it).

In last month’s newsletter, we discussed the last of the three themes I outlined for our year: process improvement through automation. This month, we’ll be following that up by considering a closely-related idea: blame-free culture.

Blame-free Culture

Last month we talked about the importance of process improvement and automation to achieve our goals and advance our major projects. Both concepts fit inside the broader idea of a culture of continuous improvement, and we learned that experimentation is the way that improvements are identified and selected. But what do we do when those experiments fail?

Failures in our professional lives are tough to swallow. Whether it’s a failure of an experiment, failures in our production systems and services, or a failure in a project deliverable—we are trained to expect that failures are damaging to our professional reputation, and possibly even fatal to our careers (every CISO or CIO has heard jokes about a “career-ending event”, trust me). Because these cultural expectations around failures surround us, for many of us we simply assume that failure is shameful. As a result, when failures happen, our tendency is to misdirect, obfuscate, and hide the impact as much as possible.

Ironically, that behavior is exactly the opposite of what will produce a culture of continuous improvement and learning. Failures are opportunities for growth, but we cannot learn from our mistakes if we don’t have a full accounting of what happened during the failure. A full and complete timeline of events that lead up to and caused the failure are critical to ensure that the team learns the right lessons. That means that we need openness and transparency after a failure, not evasion and concealment. For this to happen, each employee must feel truly safe enough to give a complete accounting of their actions—even if a mistake was made. This is the basis of a blame-free culture.

The reason behind adopting a blame-free approach is simple: it is more important for the team to grow than to punish an individual mistake. That’s it. I believe that everyone in our team wants to do their best work—they are not trying to sabotage our efforts. When mistakes happen, nine times out of ten no one will feel worse about it than the engineer or analyst who made the mistake, so there’s no benefit in punishing them further. On the contrary, that individual is probably the least likely person in the entire organization to make that same mistake a second time.

Improvements require experiments and a willingness to try new things. Experiments mean there will inevitably be some failures along the way, or unintended consequences we didn't foresee. But failures should be viewed as opportunities for learning and growth, not an occasion for blame. Each failure contains a lesson to make the next experiment better. Here’s the takeaway:

  • Culture of continuous improvement → culture of experimentation
  • Culture of experimentation → accepting failed experiments
  • Culture of continuous improvement is a culture that embraces failure as evidence of positive growth

Technology Services has implemented a Root Cause Analysis Board (RCAB) that regularly reviews the after-effects of outages and failures of process. The goal is to ensure that as an organization, we learn from our mistakes, and disseminate that learning back throughout the org, so we don’t make the same mistakes more than once. The basic principles of blame-free culture are baked directly in to this process: learning is the primary objective.

If you want to read more about blame-free culture and why failures happen in complex systems, I recommend Beyond Blame by Dave Zweiback.

IT Staff Advisory Committee

You might recall this new group being mentioned at the March Ed Talks, when the elected members were announced. The Information Technology Staff Advisory Council (ITSAC) is a new group that was created by ITAC in order to “foster open communication between Technology Services staff and leadership to enhance the overall work environment and culture”. The group was set up to have proportional representation from each of the major verticals in Technology Services, along with two at-large representatives. We have two of our team members on the committee:

  • Kyle Levenick is the representative for IT Security & Risk
  • Shem Miller was elected into one of the at-large seats

Reach out directly to Kyle or Shem if you have ideas about how to improve communication or culture, or if you see any broader issues that you think need to be addressed by leadership.

Wins & Successes

  • We continue growing the number of systems sending telemetry into Elastic, and the numbers are truly staggering: 2.25 billion log entries per day are currently collected into our SIEM. That’s 94 million events per hour, 1.5 million per minute, and 26 thousand per second!

  • An AWS network firewall was deployed in February with policies that block threat signatures, network traffic from Belarus and Russia, and a list of over 250 prohibited domains. This is a significant step towards moving our cloud recourse into compliance with boundary protection controls required by NIST 800-53.

  • Working with the AIP team inside IT Engineering, Kion was successfully soft-launched in March. This means that more cloud accounts are continually monitored for common security misconfigurations, and in some cases, can be automatically remediated. Kion also provides cloud customers with a single dashboard for cloud spending and budgeting.

  • A case study written by the Axonius marketing team on Texas A&M’s implementation of Axonius for IT asset management was published on the Axonius website: How Texas A&M Gained Insight and Clarity with Axonius. Kyle Levenick and Adam Mikeal presented about the same topic at this year’s State of Texas Information Security Forum.

Security by the Numbers

📈 Just in the last month:

  • 96.2% of all network connections from internet blocked at firewall 
  • 42.3B cyber attacks and malware blocked
  • 143 petabytes of network data scanned
  • 30k computers monitored; with 4.6B endpoint processes analyzed
  • 98.2M mail messages scanned for spam, phishing, viruses; 58.7M messages blocked at gateway
  • 2.5M auth events with Duo recorded across 289k active NetIDs
  • 291k devices tracked in the IT asset management system

Major Project Updates

Sign in with a NetID to see this content

 

Wrapping Up & Reminders

I’m sure that you have all read about the supply-chain attack that compromised xz Utils with a backdoor. Ars Technica has a phenomenal article by Dan Goodin that breaks it down, and shows just how sophisticated and planned this attack was (it started three years ago). There’s an infographic in the article that really shows the scale, and illustrates why this had the potential to be so much worse than the Solar Winds event from 2020. It seems almost miraculous that it was discovered at all—the Microsoft engineer who uncovered it only noticed because his SSH connections initiated from the command line went from taking about 0.2 seconds to 0.7 seconds. Details matter!

As always, thank you all for your hard work and dedication. I depend on you to share your ideas and suggestions with me, and I encourage you to schedule a meeting with me at any time if you want to talk (it doesn’t have to be about work!).

 

Adam Mikeal

Associate Vice President and Chief Information Security Officer