YOW! Takeaways: Cultivating Production Excellence

Presenter: Liz Fong-Jones (方禮真) (@lizthegrey)
Seen: 2019-12-12, at YOW! Melbourne 2019
Recording:
- QCon London 2019
- YOW! Melbourne 2019

Many organizations order the “Alphabet Soup” when embracing DevOps, and it rarely works. Our systems are made of computers and humans, so it’s important to ask “what do we want our tools to do”, instead of attempting to fit the process to the tool. A much healthier way to approach DevOps is to focus on investing in people, culture and process.

One of Liz’s key points is that our goal shouldn’t be Production Ownership, it should be Production Excellence: understanding and feeling empowered to improve production systems. This means designing systems for human operation, as well as reliability. It’s important to involve all the stakeholders, and to keep track of the human feelings. Eventually, people should feel confident about:

Knowing when things are too broken
Debugging and restoring systems
Collaborating to resolve complex outages, and
Removing unnecessary complexity.

Understanding when things are too broken is a complex art. If one request in 1,000 fails, does the user even notice? Does a mobile user’s network coverage fail more frequently than that? Our systems are always failing, even at microscopic levels. Define some expectations and objectives which barely keep a user happy. This unlocks capabilities for smarter prioritisation. If you have an objective, you can reframe it as a budget. That budget can:

Define whether something should wake an engineer up at 2am,
Help pick times to run risky experiments, and
Signal when the systems need more investment in reliability.

Once you have a view of whether things are broken, the remaining issue is whether people can investigate the systems effectively. Outages are never exactly identical, and failure modes can’t be predicted. The longer it takes to understand the system, or to answer questions you have, the longer it will take to resolve an incident.

For optimal confidence, our services must be “observable”. People should be able to examine events in context, and explain the variance in different scenarios, without attaching debuggers or pushing new code. Can you economically store that data, and have humans investigate it effectively? Can they revert the changes and investigate the issue during business hours?

To plan effectively, it’s important to know the risk factors of your systems. If you know a DB will be failing at an unknown time in the next year:

How many users will be affected?
How much impact?
Will this cost 50% of the error budget? 0.5%?

If you don’t understand the risk, there’s a chance you’ll spend a lot of effort engineering against something that would have had little impact. Then, there are “higher order” risks. Without enough observability, each incident will take minutes, hours or days longer to resolve. This multiplies every other risk factor! Also, if team collaboration is poor, you may stop hearing about important information. For example, if your customer support staff dread reaching out to the engineers, they’ll stop passing information on.

To quote Liz’s closing statement:

Success isn’t about heroism, or all the tools. It’s about the right tools, with the right people/knowledge, and general production excellence.