Building Site Reliability Engineering: A Crash Course
From Wikipedia: Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.
Over the past year Acquia has built their own SRE team to help their products and services scale with the demand of our growing number of customers. We wish to share our experience so that others are enabled to do the same and reap the rewards.
This presentation will discuss how the SRE team came about at Acquia, what achievements we have made so far, and the lessons we have learned along the way. We will then show the steps on how to introduce SRE to your workplace so you can deliver more reliable and scalable services to your customers! We will specifically cover:
- SRE's basic concepts and history from Google
- The management support you will need to get started
- Introducing the idea of service level objectives and error budgets
- Operational Responsibility Assessments as a tool to measure risk
- Creating a Launch Readiness Checklist to standardize and improve product launches
- Finding ideal candidates for your SRE team
The intended audience are software engineers, system administrators, and managers that have a desire to improve how they do their work and how their products/services perform.
Site Reliability Engineering: How Google Runs Production Systems, and The Practice of Cloud System Administration, Volume 2