The DevOps movement has not only influenced the tools we use in modern development and operations engineering, but also how we work. As part of how we work, DevOps has changed how we respond when systems inevitably stop working or don't work as expected. Although "blameless" post-mortems are commonly espoused in the DevOps community, they often only tell us what not to do (blame), rather than focussing on what we should do. This session will provide methods and techniques for gathering information and effectively using that information to avoid and mitigate failure in the future.
I'll cover best practices for gathering systems-related data, including monitoring and logging. This session will also cover practices for gathering and recording people-related data; including methods we can adopt from police, accident investigators, and other safety management professions to learn the most from incidents.
After discussing how to gather data, I'll discuss how we can use the data to formulate actionable response plans and how to adjust existing organizational practices to avoid repeating failure.
I plan to keep the technical portions of this talk at a novice level so that it's accessible to both developers/engineers and those in non-technical roles who will be involved in incident response.