When I heard about SRE for the first time my thoughts were mostly on the chaotic environment of the operational support team firefighting issues and the pain in order to maintain the reliability of the application which is deployed. Oh my bad, I was wrong initially. In this blog, I am trying to give some insights about trending Site Reliability Engineering(SRE) that will help both operational support engineers, companies or anyone who is having the pain of managing operations.
What exactly is the term SRE ?
Site Reliability Engineering is a set of principles and practices followed by engineers that will solve the problems of the Infrastructure running their applications and its operations, by the means of software engineering. Basically this is how we treat operations as a software problem based upon the core principles such as starting from Eliminating Toil and Simplicity. My main focus of the blog is to shift many companies from the technical support towards the SRE approach for reliability engineering which will basically make life more easier for the reliability engineers. Other core principles are Embracing Risks, Service Level Objectives, Monitoring Distributed Systems, Release Engineering and a lot more..
Choni’s approach towards an operational issue in his organization.
We can start our SRE journey with Choni, an illustrated technical support engineer!!!
Choni is working in the operations team of his company where he wants to ensure the reliability of the company’s main website. One day he got an alert to his email: “Mysqld is not running…Fix this issue immediately …” But this has been recovered after 5 mins. After 15 mins he got one more alert. This time the problem is more realistic and it is not recovering.
Since Choni got a consecutive second alert, He immediately checked to see the severity of the alert. And realized that the problem is a P0 issue that affects many of his users.
Definitely the first approach should be towards finding the root cause of this problem and solving it. But he cannot troubleshoot further with a daunting website eagerly waiting to get connected to Mysqld. So the best approach should be to change the connection to DR-Server and save the website. But unfortunately there is no such DR plan or runbooks. What happened to Choni was he took around 56 hours to fix his problem because of improper backup systems and absence of DR plan. The reason was, some of the Mysqld tables were rewritten by the unauthorised person and forced the Mysql engine to stop. How chaotic it was !!!
He started with the automation approach along with Runbooks to maintain his Multi Region and Multi-Node Mysql Deployment as Infrastructure-as-code format and maintained it in a code repository. Any configuration changes should be by the means of a commit.
Now he is running Mysqld with more focus on Security and Distributed system monitoring using Prometheus exporters. For Visualisation he used Grafana. Remember how Choni got the first alert , he received the alert as an email. So he wants some better solutions such as Messaging Platform Channel Integration eg: Mattermost or Any apps like Opsgenie that will remain alerted until you have responded. This really helps the life of a SRE Engineer.. Be on-call with responses to these bots.
Next he wanted to test the reliability of his new Mysql setup, he introduced the Chaos Engineering approaches by randomly and intentionally creating failures/problems. He is able to more fine tune the new Mysql setup to have optimal reliability and scalability. You can get more details of Chaos engineering from the references. Also he developed a Disaster Recovery Runbook.
Choni is now more likely to see how the SQL queries are performing and how the Mysql engine is behaving while taking the backups and a lot more such scenarios. All this can be tracked with help of metrics, logs and tracing of Mysql service. This is basically setting up the pillars of Observability for any system. After setting the Observability pattern to his new system he is able to give proper Service Level Agreements , Service Level Objectives and Service Level Indicators for his new system.
The above illustrated example is to get a beginner’s journey towards the SRE focused team building. This illustrated example is the outermost layer of onion which can be more and more deeper when you get more into the SRE implementation in your organization. Wishing you a happy SRE journey.
> SRE is when treating operations as if it’s a software problem
> Core principles such as Embracing Risk, Service Level Objectives, Eliminating Toil, Monitoring Distributed Systems.
> Logs, Traces And Metrics are the pillars of observability.
> SLAs, SLIs, SLOs are the fundamentals of the SRE.
If you want to reach out to me: https://www.linkedin.com/in/jeswinkninan/
Choni do not want to repeat this kind of problem and learned that many gaps need to be filled for the better reliability of his website in future. So he started learning about the SRE approaches and redesigned the way Mysqld was running before.