Site Reliability Engineering or SRE in short is a term that got originated at Google during the mid-2000s. It defines a set of principles and practices that Google uses to run their systems at scale. Seeing the effectiveness of having an SRE team and in order to manage their systems at scale, organisations from diverse industries have begun to incorporate these ideas into their engineering culture these days. SREs are responsible for a wide range of duties, including assuring the availability of the service, setting up proper observability, visibility platforms for monitoring the service they are taking care of and setting up deployment pipelines to release code faster into their service to name a few. Also, one of the key distinctions between an SRE and a DevOps engineer is that an SRE should focus more on the reliability of the service. Reliability is defined as a service’s capacity to deliver its promised performance within a predetermined window of time. In this blog, we discuss an important aspect of the job of an SRE, which is defining SLIs, SLOs, SLAs and Error Budget which specifies how reliable the product or service is.
To help the reader better understand these metrics, we use the DevOps Malayalam community website, https://devopsmalayalam.io/ as an example to define these terms. In most organisations, SRE teams are typically structured based on service; also in these organisations that have an SRE team, most of the applications are based on microservice architecture and each functionality within an app would be subdivided into a particular service, here in this case of the DevOps Malayalam website, it is a static page website but in large scale production systems, for example like Gmail, Twitter, WhatsApp etc, it is constituted by multiple microservices that handles a variety of functions, for example, if a service handles payments made by customers on the “https://devopsmalayalam.io/” website, an SRE (or SRE team) would be responsible for primarily ensuring the availability and reliability of just this service. As another example, an SRE(or SRE team) would be allocated to a service that handles customer data related to this website and so based on the functionality SRE teams are usually assigned to managing a service within an application.
So imagine you got assigned to one of these teams as an SRE that manages the reliability of the DevOps Malayalam website. Users from across the world access the website using this endpoint(URL) https://devopsmalayalam.io/ that has been deployed in this case in the hosting provider FlexiCloud. I’ll explain here in simple terms, how we need to define SLI/O/A for this service.
So the key ideas that we will cover in this blog post are listed below:
- First, we define what it means by SLI, SLO, SLA, and Error Budget.
- Show an example of implementing these metrics, based DevOps Malayalam website.
- Finally, a demo implementation of a simple method to configure an SLI based on this website’s HTTP response status code.
A few things for the reader to be aware of before we continue further:
- This blog takes references from Google SRE and Google SRE Workbook books. I’ve tried to explain SLI/O/A further based on my understanding of these metrics and also shared a simple demo implementation of an SLI that anyone can implement as an SRE.
- Also, SLI/O/A ‘s are usually defined keeping the business objective in mind these helps to achieve, but if we cannot connect them with business and customers in a real scenario there is no point in defining these metrics for a business or a service.
- Further, these numbers for metrics are defined post a collective discussion between Product, Development, and SRE teams, I’ve set the numbers in this example depending on my context to help the reader understand these metrics better. In actual situations, decisions on the metrics are made after talks among the parties involved and are justified by the business teams inside the company.
- The targeted audience for the blog is mainly beginners who are getting started in the SRE domain, to get an understanding of what these terminologies mean.
“So what is this SLI/O/A and Error Budget all about?”
Service Level Indicators (SLIs)
SLI is a measurement of some characteristics of the service. When I say characteristics, there are 4 Golden Signals that impact a customer experience, these are:
- Latency: The time it takes to service a request.
- Traffic: A measure of how much demand is being placed on your system.
- Error rate: The rate of requests that fail.
- Saturation: How “full” your service is.
To understand these characteristics better, you can read about them further in chapter 6 Monitoring Distributed Systems, of the Google SRE Workbook. So to explain SLI here in simple words based on our website, if we take Latency as a metric it means how much time it took for the user to get the response after calling the endpoint https://devopsmalayalam.io/. This is a quantitative metric, that is in an actual scenario an SLI here would look like for an example:
- The time taken to return a successful response for a request should be less than 100 milliseconds (ms).
What this means is that depending on where the user is located, the endpoint may load in less than 99 milliseconds for a user in the USA and may take up to 105 milliseconds for a user in Russia and as you can see its a measurement, here it’s based on time to load the URL successfully. So SLI metrics are usually defined as quantitative metrics. What it means is that an SRE can measure the latency metric or response time to determine how long it takes for a user to load or receive a response from the endpoint.
SLI can be represented as a formula:
Service Level Objectives (SLO)
The SLO defines a specific target based on the SLI we have set. For instance, in the earlier example of latency, the SLO would seem as follows after its defined.
- 99% of all requests should be served under 100 ms over a period of 1 hour or
- 99.9% of all requests should be served between 50 ms and 100 ms for a period of 1 month.
SLOs are a way to measure customer happiness and their expectations by ensuring that the SLIs are consistently met and are potentially reported before the customer notices an issue. So for our website, SLOs would read something like this: imagine this application was accessed by 100 users in an hour using the application endpoint https://devopsmalayalam.io/, if 99 users were able to access the service within 100ms, it means that the SLO we defined has been fulfilled.
Service Level Agreements (SLA)
SLAs are based on SLOs. Represents the consequences of what happens when availability or user expectation fails. In the actual scenario, SLAs may look like the below.
- 90% of all requests per day should be served under 100 ms; otherwise, 10$ of the daily subscription fee will be refunded to the customer, or
- The service should be available with an uptime commitment of 99.9% in a 30-day period; else 4 hours of free usage will be added for the customer for the service in question if SLA is not met.
Most of the time, SLAs don’t exist or they are merely informal agreements between the Product, Development, and SRE teams. For example, as you know Gmail, and Google Maps are services used by customers across the world for free, Google doesn’t have an SLA between themselves and its customer’s that if Gmail is down for 1 hour in a month they will pay say for example 10$ to all its customer base that got affected during the time of any outage or something like that, but in this case, there would be an SLA between Gmail Product team and its Development team and SRE team like if SLA is not met for a time period then no major deployments to the service till the SLOs are respected by the Product/Development/SRE team who were responsible for breaking the SLA defined.
The error budget, which indicates the level of unreliability permitted for our service, is essentially the opposite of availability. If your SLO says that 90% of requests should be successful in 1 hr, then your error budget allows 10% of requests to fail. That means out of 100 users accessing https://devopsmalayalam.io/ website, 10 users can fail to get a response within the limit set of 100 ms.
Error Budget can be calculated like below:
Error Budget = 100% − Success%
In our example case as already mentioned it’s 10% (100%-90% success rate).
Let’s move on to the second point in this blog, explaining these with an example.
Keeping the above things in mind, here I’m going to show how to implement a simple SLI and explain how to define SLO and SLA for a metric like HTTP response code for https://devopsmalayalam.io/. Typically, we use logs, metrics, or any other source of previous performance to calculate these measures, which requires configuring a data source and some complex calculations. Here we check the response code because availability is the first and most important feature that a service should offer. Uptime is another name for service availability. An application or service’s availability refers to its capacity to function at any time. A system will not function properly if it is not running.
Here we only calculate one SLI metric just to show an example and for anyone to get started, which is a periodic health check on the service (a ping or an HTTP GET). In an actual case, it differs and requires some calculations based on the formula mentioned above in the definition.
SLI/O/A metrics and Error Budget for DevOps Malayalam website based on response code.
So as an SRE, that cares about the reliability of the website https://devopsmalayalam.io/ we can define the metrics as follows.
SLI: The response code on calling the URL of the website https://devopsmalayalam.io/ should be a success, that is HTTP 200.
Ok so to explain this in simple terms, whenever the website is accessed it should be a success. Attached is a simple example below of calling the website endpoint using a tool like curl for reference to check its status code and you can see it returns a 200 status code which means it’s a success.
Now let’s see how we can define the SLO for this service.
SLO: 90% of the HTTP response code, measured over a period of 1 hour should be successful. That is if we access https://devopsmalayalam.io/ 100 times in an hour it should be successful (return HTTP 200) 90 times.
SLA: if 90% of requests did not return satisfactorily within the time period of 1 hour, we would return back say return $1 to the user’s account.
As already mentioned SLA is mainly between the Product team, the Development team, and the SRE team and not in general with users of the service, Gmail is an example.
Error Budget: In this example Error budget is 10%, which means 10 requests out of 100 within an hour are allowed to fail or send an HTTP response code other than 200 which is considered a failure.
Moving on to the final topic in this blog, which is implementing a simple SLI for a service.
To implement a sample SLI and to make these explanations simple, we’ll use a script that periodically checks the website https://devopsmalayalam.io/ health and plot it in Grafana. This idea for calculating SLI is derived from an example mentioned by authors in SRE Workbook, refer snapshot below.
Please note that at this point, in the demo part we are only implementing the SLI and not moving on to the SLO, SLA or Error Budget.
Before starting, it’s better to have an idea about the below.
For plotting the SLI, here I use a simple Golang script that checks the response code of the URL https://devopsmalayalam.io/, the logic for the code is pretty simple here we just check the HTTP response code and push it to Graphite, its time-series metrics collection tool so we cannot push string values instead we just push “integer 1 if the service is up( that is if the URL returns HTTP 200), and integer 0 if its down(for all other response code’s)”, a snippet of the code to calculate the SLI for this availability metric is mentioned below. To refer the full code, please check here.
In Grafana, we use Graphite as the data source and plot the status in the user interface as a dashboard, attached below for reference.
So this is a simple SLI metric representation, in reality, it will be more complex and based on logs and other metrics it will be calculated, the purpose here is to help new SREs get started with implementing something, to begin with.
Further, the above script can be implemented as a cron job and set at a fixed frequency so it calculates the SLI, in my case, I implement this as Lambda and set a trigger in the AWS Cloudwatch Event tool to trigger every 5 mins or so to push the status. This graphical representation of the status code can also be achieved to an extent by utilising the Blackbox exporter, Prometheus and Grafana as well. I just explained it in code because as an SRE 50% of the time you are expected to write code and 50% for working on engineering project works.
We haven’t calculated SLO, Error budget, using the script, I have just shown how to implement SLI only based on a suggestion mentioned in the Google SRE Workbook.
We come to the end of this blog, hope this helps anyone getting started as SRE to understand what these SLI/O/A is all about. Sharing the key takeaways from the blog.
- We defined what SLIs, SLO, SLA’s, and Error Budgets mean.
- Showed an example of how to configure a single SLI/O/A metric based on https://devopsmalayalam.io/ website’s response code.
- A demo implementation of a very basic SLI for DevOps Malayalam website based on its HTTP response code.
Additionally, sharing a few helpful links if you want to read more about any of the technical terms I’ve used here.
- Microservice Architecture
- DORA Metrics for DevOps Engineers
- How SRE relates to DevOps
- REST API
- Monitoring Distributed Systems
- SRE vs DevOps with Liz Fong-Jones and Seth Vargo
For developing the blog, the following articles are cited here:
If you would like to connect with me, I am on Linkedin.