Topics Discussed:
- What is Chaos Engineering
- How is it practiced
- Introduction to LitmusChaos
- How you can practice chaos with LitmusChaos
- LitmusChaos 2.0
- Future roadmap
Purpose of Quality Assurance:
> QA need to be in place to check and verify the quality of a product delivered
> A quality assurance system is meant to increase customer confidence and a company’s credibility, while also improving work processes and efficiency, and it enables a company to better compete with others.
Business continuity:
> It means, whatever happens, business should not go down.
> The key aspect of a product development team is business continuity planning and disaster recovery
> Data loss and data unavailability are problematic for an application. And Data loss can’t be acceptable in a business
> Chaos engineering will help you to get through the disaster situations and can provide business continuity.
> Key benefit of devops practice is, for a product, the old release cycle was three months. Now it is 10 times faster than the older method.
Cloud native deployment:
> An application developed by you have only 5% percent of your code, the rest belongs to someone or something else like platform, k8s etc.
> So chaos engineering is necessary to test and verify what can become wrong in a disastrous situation.
> What is chaos engineering? Anything can go wrong in the current world with applications.
> From Principlesofchaos.org, Its a discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
History:
> Chaos engineering introduced by Netflix in 2008.
> Netflix introduced chaos monkey. It’s like pulling down random nodes hosted in a DC to test resilience.
> But once the entire AWS region went down for Netflix for a few hours,outage occurred again. So Netflix started chaos kong project. and introduced the Simien army. 2014 chaos engineer role introduced.
> Another approach was injecting latency into existing solutions like storage. A .01 millisecond latency can lead to instability in a storage solution and it will lead to outage in the applications of servers attached to the storage.
> Reference: chaos engineering book by Nora Jones
> A Resilient system means it can accept a certain amount of failure.
What is game day: On a predefined date, creating artificial failures in the production system with defined blast radius and and then we will fix the issues occurring. So the artificial failures act like a vaccine
How is chaos engineering done?
1. define a steady state
2. you should be able to measure steady state of a system
3. SLO,SLI, MTTF, MTTR
SLO: Subject level objective – CEO/Cto
SLI: Subject level indicator – actual action happens -we need to define it before the tests
MTTF: Mean time to failure
MTTR: Mean time to recovery – system goes down and recovery time need to be reduced
> Blast radius need to be defined for each tests.
> And we need to decide which component we need to take down first?
> Once we fix the issues using experiments, then we need to automate it. And then need to run experiments with multiple criterions
Chaos tools example: gremlin, chaos tool stack, chaos mesh, litmus, chaos blade
Gremlin – Saas and closed product
Chaos mesh and litmus – CNCF projects
litmus is growing, 50 predefined cases are there and 1000 installations are happening per day.
Advantage of Litmus over saas based chaos tools:
> Litmus is open source
> community collaboration
Why did we start litmus?
Mayadata project – open ebs, a leading project in cncf . started in 2016. litmus born in 2017 and limtus introduced in kubecon . All contributed experiments are available in hub.litmus.chaos .
Life span of a project in CNCF is as below:
1. Sandbox
2. Incubating
3. Graduating
4. Litmus 2.2 released on august 15,2021
Tools available in Litmus 2.2
> Chaos centre – orchestration platform, helps us to integrate litmus with multiple k8s clusters Experiment can construct: we need to define sequential failures or a set of parallel failures
> Chaos agent: An agent which will run in a Kubernetes cluster. litmusctl to register a k8s cluster to chaos center.
> Gitops concept integrated with litmus: As a Gitops approach we can even trigger chaos experiments when we run Continuous Delivery.
> Chaos experiments can install in k8s cluster and can induce tests with the application release itself. It can use as an additional testing tool in the release process
> Custom hook or probe – to define pod deletion and then to define resilience score.
> To practice Limtus as a beginner – we can use potato head (A sample app provided) to test litmus tool for learning and experimenting. Also we can use test apps like bank of anthos app .
chaos engineering practice needs three roles:
> Master of disaster
> Team members to run experiments
> Viewer to observe the effects
We can stop and abort functionality as well with litmus if need to stop the run in between.
Litmus Vs other chaos tools:
> More than 40 plus tools in the market.
> Litmus and chaos mesh are in incubation in cncf and competitive tools. Both are k8s native and both are collaborating also.
> Community growth is a major difference when comparing other tools like gremlin.
Do we have a chaos engineering standard:
> The Chaos engineering workgroup started one month back in cncf. It will take some more months to form a standard. princplesofchaos.org is the ideology used to follow.
> some companies are already using litmus in their release cycles , ex: pravega (dell’s open source storage wing) .