Automation and Machine Learning with Site Reliability Engineering

ricardoamaro

"Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems."

In this session we are focusing on 3 of the main open questions:

  • How to automate those repetitive tasks that just generate Toil and no one wants to do?
  • How do we look at data and preview what's going to happen to our system in the future?
  • How do we reinforce "applying software engineering to an operations function"?

The Automation of operation processes is a critical target we pursue. As artificial intelligence (AI) and machine learning get better, the tasks that can be automated increases. Keeping the historical data to programmatically react to something new, fixing the issue and alert us on what is going to happen instead of having someone manually analyzing the past results and trying to preview the future.

HAL9000

"I've just picked up a fault in the AE35 unit. It's going to go 100% failure in 72 hours"
HAL 9000, 2001 a Space Odyssey

 

 

That gives us the chance to use our time for more innovative tasks and features development.

While this certainly is not an overnight achievement, lately we have seen the line between the work of machines and humans grow thin. Through the advances that machine learning and automation can offer, we can enable greater productivity among teams and businesses.

What level of knowledge should attendees have before walking into this session?
This session is targeted for people that want to learn some basics on how to improve automated response on critical systems and lower the level of manual work done in operations.

What should I expect from this session?
We will go over the basics of AI implementations, give examples and inspire change by using machine learning techniques, behavioral analytics, statistics and specific tools for the scoring/evaluation process to get complex systems smarter, faster and always on.

Why are companies pursuing this?
Companies are adopting AI and machine learning technology for its usefulness in augmenting human understanding of complex interaction and data sets by uncovering the unknowns. This enables prevention of chaotic situations that could have been avoided and also frees resources to do be more inovative and creative, bringing value to the businesses.

"I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhäuser Gate..."
Roy Batty, Blade Runner

 

Code used in this presentation: https://github.com/ricardoamaro/MachineLearning4SRE

Session Track

DevOps

Experience Level

Beginner

Drupal Version

When & Where

Time: 
Thursday, 28 September, 2017 - 13:35 to 14:00
Room: 
Strauss