Wednesday, August 25, 2021

Implementing “Andon (行灯)” in DevOps

 

Let’s first understand the meaning of Word “Andon (行灯)” in Japanese,

Andon(行灯) means — fixed paper-enclosed lantern; paper-covered wooden stand housing an (oil) lamp​

Dictionary — https://jisho.org/search/行灯

Toyota Production Systems (TPS) has introduced the word “Andon”. “Andon Cord” is a Lean manufacturing principle and tool used to notify management, maintenance, and other workers of a quality or process problem. The concept revolves around a device incorporating signal lights to indicate which assembly line workstation has a problem. Normally alerts are activated manually by a worker using a pull cord (Andon cord) or button or may be activated automatically by the production equipment itself. The idea behind is that by stopping the system you get an immediate opportunity for improvement, or find a root cause, as opposed to letting the defect move further down the line and be unresolved.

In the case when Ignored, In Steven Spear’s “The High-Velocity Edge”, he describes a horrifying story of missed opportunities leading up to the 2003 NASA Columbia space shuttle disaster. The short version of the story is that the thermal protection system on the left wing was damaged just after launch but didn’t become an issue until reentry 19 days later. After the disaster, an investigation board charged with reviewing the accident found there were at least eight attempted signaling events to notify the crew requesting that they need to “go-see” the damage, But nothing is done which led to disaster.

Now, what The Andon Cord means in Software Development?

During development, bugs are talked about openly, but the project moves forward regardless, and the bugs continue down the process.

This is where DevOps comes in and can incorporate the Andon Cord concept. There are two specific areas that come to mind, first, if an organization has truly flipped the testing pyramid and put full automated testing in place in conjunction with Continuous Integration, Specification by Example and a tool such as SonarQube, this can be the first place the Andon Cord concept can be employed. By forcing all new code to run the gauntlet of full automated testing (driven by Specification by Example) and SonarQube’s quality gates you are making sure it meets the expected behaviors specified by the customer and the code quality is in line with expected standards.

The second area DevOps can function as a sort of Andon Cord is with A/B Testing. When an organization has put into place a fully automated delivery pipeline they are able to quickly get code out to subsets of their customers (often the same day the code was developed) in order to create Feedback Loops that enforce that they are building what their clients want.

Are any industry giants really using it? Answer, YES.

Amazon and the Cord :

The Andon Cord has become a metaphor for some modern-day Web-Scale organizations as well. Jeff Bezos, the CEO of Amazon, described in a 2013 letter to the Amazon’s shareholders a practice he called the Customer Service Andon Cord. This was an established practice of metaphorically pulling an Andon Cord when they noticed a customer was overpaying or had overpaid for a service. Amazon would heuristically scan their systems looking for these kinds of potential customer service mismatches. These were considered defects at Amazon because they had a vision of being an organization that was always customer-centric. They would automatically refund a customer, without the customer even asking, if the service delivery was suboptimal. I have had this happen to me on a few occasions watching a movie on Amazon Prime, where the next day I received an email telling me they refunded my movie rental cost due to poor quality.

Netflix and The Chaos Cord :

Another example of an Andon Cord metaphor used in Web-Scale businesses is at Netflix. Netflix has an interesting way of exercising their Andon Cord, although they don’t actually call it an Andon Cord.

At Netflix, they actually inject this into their systems on purpose by intentionally trying to break systems in production. They have developed what is now famously called Chaos Monkey. Chaos Monkey is a process that randomly kills live running production servers. This behavior is known by everyone who works at Netflix. It’s part of their culture. There are no surprises about this practice. Developers plan their code and systems accordingly. As Told by Adrian Cockcroft, one of the primary architects behind Netflix’s IT infrastructure, that not knowing about the Chaos Monkey mode coming into a job interview at Netflix was pretty much an immediate no-hire decision.

Summary,

— NO permission needed to pull cord (open to everyone)

— DO NOT bury down an issue in useless paperwork and never-ending meetings (Act with priority)

— NO defect was too small

— Even if the cord was mistakenly pulled, the response should never be negative. (Build Trust)

— It’s not a tool that matter but culture and behavior behind the tool are important (Build Culture)

— Solving problem is NOT the goal, understanding how to solve the problem is a Goal.

Furthermore, the process of solving the issues can be controlled by a practice described by Dr. Edwards Deming called Plan Do Change Act (PDCA). PDCA loop. Plan (P) a countermeasure, implement the countermeasure (D), check or study the results (C), and act on the results either it’s fixed or start the next countermeasure (A).

Another point here is that implementing an Andon Cord in an organization is not something you do overnight. It takes a continuous improvement roadmap to get there and must have behavior reinforcement built into the process. It takes a fierce commitment and practice of improvement and an equally skilled leadership coaching approach. If you want to investigate the concept more deeply, recommend book — Mike Rother’s “Toyota Kata”.

Last but not least,

Do we need an actual physical device, maybe not. We can have mail or slack which virtually works as an “Andon Cord”. Personally, I would like to do a small DIY Project “Andon using Raspberry Pi 3”, which is basically building small and easy Andon cord using Raspberry Pi 3, so that the team can access it through the web and signal the problems. I will work on this and I will share implementation details in another blog.

Source:

Tuesday, August 17, 2021

Overview of GitOps

 

What is GitOps? Guide to GitOps — Continuous Delivery for Cloud Native applications

GitOps is a way to do Kubernetes cluster management and application delivery. It works by using Git as a single source of truth for declarative infrastructure and applications, together with tools ensuring the actual state of infrastructure and applications converges towards the desired state declared in Git. With Git at the center of your delivery pipelines, developers can make pull requests to accelerate and simplify application deployments and operations tasks to your infrastructure or container-orchestration system (e.g. Kubernetes).

The core idea of GitOps is having a Git repository that always contains declarative descriptions of the infrastructure currently desired in the production environment and an automated process to make the production environment match the described state in the repository. If you want to deploy a new application or update an existing one, you only need to update the repository — the automated process handles everything else. It’s like having cruise control for managing your applications in production.

Modern software development practices assume support for reviewing changes, tracking history, comparing versions, and rolling back bad updates; GitOps applies the same tooling and engineering perspective to managing the systems that deliver direct business value to users and customers.

Pull-based Deployments

more info @ https://gitops.tech

The Pull-based deployment strategy uses the same concepts as the push-based variant but differs in how the deployment pipeline works. Traditional CI/CD pipelines are triggered by an external event, for example when new code is pushed to an application repository. With the pull-based deployment approach, the operator is introduced. It takes over the role of the pipeline by continuously comparing the desired state in the environment repository with the actual state in the deployed infrastructure. Whenever differences are noticed, the operator updates the infrastructure to match the environment repository. Additionally the image registry can be monitored to find new versions of images to deploy.

Just like the push-based deployment, this variant updates the environment whenever the environment repository changes. However, with the operator, changes can also be noticed in the other direction. Whenever the deployed infrastructure changes in any way not described in the environment repository, these changes are reverted. This ensures that all changes are made traceable in the Git log, by making all direct changes to the cluster impossible.

In Kubernetes eco-system we have overwhelming numbers of tools to achieve GitOps. let me share some of the tools as below,

Tools

  • ArgoCD: A GitOps operator for Kubernetes with a web interface
  • Flux: The GitOps Kubernetes operator by the creators of GitOps — Weaveworks
  • Gitkube: A tool for building and deploying docker images on Kubernetes using git push
  • JenkinsX: Continuous Delivery on Kubernetes with built-in GitOps
  • Terragrunt: A wrapper for Terraform for keeping configurations DRY, and managing remote state
  • WKSctl: A tool for Kubernetes cluster configuration management based on GitOps principles
  • Helm Operator: An operator for using GitOps on K8s with Helm

Also check out Weavework’s Awesome-GitOps.

Benefits of GitOps

  1. Faster development
  2. Better Ops
  3. Stronger security guarantees
  4. Easier compliance and auditing

Demo time 😄 — We will be using Flux

https://github.com/fluxcd/flux

Prerequisites: You must have running Kubernetes cluster.

  1. Install “Fluxctl”. I have used Ubuntu 18.04 for demo.
sudo snap install fluxctl 

2. Create new namespace called “flux”

kubectl create ns flux

3. Setup flux with your environmental repo. We are using repo “flux-get-started”.

export GHUSER="YOURUSER"
fluxctl install \
--git-user=${GHUSER} \
--git-email=${GHUSER}@users.noreply.github.com \
--git-url=git@github.com:${GHUSER}/flux-get-started \
--git-path=namespaces,workloads \
--namespace=flux | kubectl apply -f -

4. Set Deploy key in Github. You will need your public key.

fluxctl identity --k8s-fwd-ns flux

5. At this point you must have following pods, Services running on your cluster. (In “flux” and “demo” namespace)

namespace: flux
namespace: demo

6. Let’s test what we have deployed.

kubectl -n demo port-forward deployment/podinfo 9898:9898 &
curl localhost:9898

7. Now, lets make small change in repo and commit it to master branch.

By default, Flux git pull frequency is set to 5 minutes. You can tell Flux to sync the changes immediately with:

fluxctl sync --k8s-fwd-ns flux

😃 Wow ㊗️ our changes from our repo has been successfully applied on cluster.

Let’s do one more test, assume that by mistake someone has reduced/deleted your pods on production cluster.

By default, Flux git pull frequency is set to 5 minutes. You can tell Flux to sync the changes immediately with:

fluxctl sync --k8s-fwd-ns flux

You have successfully restored your cluster in GitOps way. No Kubectl required!!

Whenever the deployed infrastructure changes in any way not described in the environment repository, these changes are reverted.


Thank You for reading.

Source: