Devops Archives - Site Reliability Engineer Blog

The Chaos Monkey Controversy: Why Overuse is Killing Developer Productivity

In the world of Site Reliability Engineering (SRE), few tools have sparked as much debate as Netflix’s Chaos Monkey. What started as an innovative approach to building resilient systems has become a lightning rod for controversy, dividing the engineering community into passionate camps. After years of watching teams struggle with its implementation, I’m convinced that the tool’s widespread misuse has created more problems than it solved.

The Promise vs. Reality

Netflix introduced Chaos Monkey in 2014 with a simple premise: randomly terminate instances in production to force teams to build more resilient systems. The idea was sound – if your system can’t handle the unexpected failure of a single component, it’s not truly robust. But somewhere between Netflix’s careful, strategic implementation and the broader industry adoption, something went wrong.

The evidence is everywhere. Reddit discussions in r/sre are filled with horror stories of cascading failures triggered by poorly timed Chaos Monkey deployments. HackerNews threads document teams spending more time recovering from simulated failures than actually improving their systems. As one frustrated developer put it: “Chaos Monkey causes chaos, it does not fix it.”

The Case Against Overuse

The critics’ arguments are compelling and backed by real-world evidence:

Reduced Developer Productivity: Constant, unpredictable deployments disrupt development workflows. Teams report spending excessive time rolling back changes and restoring services instead of building new features. When your developers are in perpetual recovery mode, innovation suffers.

False Sense of Security: Perhaps more dangerously, frequent disruptions create complacency. Teams become so accustomed to recovering from simulated failures that they lose focus on preventative measures. The perception of resilience becomes skewed – surviving artificial chaos doesn’t guarantee handling real-world incidents.

Operational Overhead Explosion: Every Chaos Monkey deployment requires investigation, analysis, and often rollback procedures. This consumes significant resources that could be better spent on proactive improvements and strategic technical debt reduction.

Misinterpretation of Failure Data: Controlled, short-lived interruptions bear little resemblance to sustained, complex outages. The data collected during Chaos Monkey tests cannot reliably inform long-term architectural decisions because the test environment fundamentally differs from genuine failure scenarios.

The Netflix Model: What Actually Works

Netflix’s original implementation wasn’t the blanket deployment strategy many teams have adopted. Their approach was surgical, strategic, and tied to specific maturity milestones. They deployed Chaos Monkey on reduced scales, focused on services that were already stable, and closely monitored the results.

The key difference? Netflix treated Chaos Monkey as a precision instrument, not a blunt force tool. Their engineers understood that the goal wasn’t to cause disruption – it was to observe system behavior under stress and identify specific weaknesses.

The Middle Ground: Disciplined Chaos

This doesn’t mean Chaos Monkey is inherently flawed. When used correctly, it can provide valuable insights. The critical factors for success include:

Service Maturity Requirements: Deploy only on stable, well-understood services. Using Chaos Monkey on immature systems is like stress-testing a house of cards – you’ll learn it falls down, but not much else.

Controlled Frequency: Deployments should be carefully scheduled and spaced, not constant background noise. Teams need time to implement improvements between tests.

Clear Success Metrics: Define specific learning objectives before each deployment. What exactly are you trying to discover or validate?

Comprehensive Monitoring: Ensure you can observe and analyze the complete impact of each test, not just surface-level metrics.

My Take: Strategy Over Chaos

After witnessing both spectacular failures and genuine successes with Chaos Monkey, I believe the tool’s value lies in its strategic application, not its frequency of use. The teams that succeed with chaos engineering treat it like any other engineering discipline – with careful planning, clear objectives, and disciplined execution.

The controversy surrounding Chaos Monkey ultimately reflects a broader issue in our industry: the tendency to adopt powerful tools without fully understanding their appropriate context and limitations. Netflix built Chaos Monkey for their specific needs, scale, and organizational maturity. The problems arise when teams apply it blindly without considering their own unique circumstances.

Looking Forward

The evolution toward tools like Chaos Gorilla and more sophisticated chaos engineering frameworks shows the industry is learning. We’re moving away from random disruption toward targeted, hypothesis-driven resilience testing. This represents a maturation of the chaos engineering discipline – from “break things and see what happens” to “systematically validate our assumptions about system behavior.”

The lesson isn’t to abandon chaos engineering, but to approach it with the same rigor we apply to any other engineering practice. Used strategically, tools like Chaos Monkey can strengthen systems. Used carelessly, they strengthen nothing but frustration levels.

The choice is ours: embrace disciplined chaos or remain victims of chaotic discipline.

Collecting Datadog APM traces using Grafana Alloy and Tempo

The Problem

Hybrid cloud is the future, but monitoring remains stuck in the past. Many organizations embrace hybrid infrastructure, yet struggle with fragmented observability tools. Why? Because monitoring providers still operate in silos.

One of the primary reasons hybrid monitoring isn’t as prevalent is the challenge of instrumentation. Many cloud providers offer their own monitoring solutions. Instrumentation libraries are often incompatible with one another, making cross-platform integration difficult.

The good news? With OpenTelemetry, Grafana, and Datadog, hybrid monitoring is becoming easier and more flexible. 🚀

The Solution

One promising development is the rise of open-source, vendor-neutral instrumentation frameworks like OpenTelemetry.

In essence, open standards are reducing incompatibility issues and allowing “a vendor-agnostic approach to get data from the sources you need to the observability service of your choice.”

A step toward solving the hybrid cloud monitoring challenge came when Grafana introduced the otelcol.receiver.datadog component.

Now, with otelcol.receiver.datadog, Grafana users can ingest and process Datadog telemetry directly within OpenTelemetry pipelines, unlocking several advantages:

Expanding Grafana’s Reach to Datadog Customers
Seamless Integration with OpenTelemetry Pipelines
Avoiding Vendor Lock-in While Retaining Datadog’s Strengths
Cost Optimization by Centralizing Hybrid Monitoring

How it works together?

Requirements

Grafana – for web UI
Grafana Alloy – to receive, process and export telemetry data
Grafana Tempo – to collect and visualize traces from Datadog instrumented apps
Datadog agent – with enabled APM feature
Instrumented application with Datadog trace library

Quick check list:

you have running: Grafana, Alloy and Tempo services
You have running Datadog agents
You have instrumented applications with Datadog trace library
You add Tempo as Datasource in Grafana

Grafana Alloy configuration

The core of the solution is to set up Datadog receiver in Alloy config:

receivers:
  datadog:
    endpoint: "0.0.0.0:9126"
    output:
      traces: [otelcol.exporter.otlp.tempo.input]

# I am skipping extra steps which you might want to use to pre-proces data

otelcol:
  exporter:
    otlp:
      tempo:
        client:
          endpoint: "https://tempo-distributor.example.com:443"
          tls:
            insecure: false
            insecure_skip_verify: true

To avoid ports conflict with Datadog agent we choose 9126 as a port for Alloy Datadog receiver.

Running Alloy as Daemonset with hostNetwork access allow agent to be present on each Node.

alloy:
  extraPorts:
    - name: "datadog"
      port: 9126
      targetPort: 9126
      protocol: "TCP"

controller:
  hostNetwork: true
  hostPID: true

service:
  internalTrafficPolicy: "Local"

Application set up

Application has to be configured to send APM traces to Node IP:1926.

The Node IP can be extracted form Kubernetes meta information and passed as environment variable:

env:
- name: NODE_IP
  valueFrom:
    fieldRef:
      fieldPath: status.hostIP

Datadog Agent configuration

With Datadog Agent we have two options:

Send traces to Alloy Datadog receiver as addition to Datadog host
Send traces only to Alloy Datadog receiver

With option 1 we assume you are using both Datadog and Grafana solutions in hybrid mode. In that case Datadog agent has to have following configuration:

agents:
  containers:
    agent:
      env:
        - name: DD_ADDITIONAL_ENDPOINTS
          value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
        - name: DD_APM_ADDITIONAL_ENDPOINTS
          value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
    traceAgent:
      enabled: true
      env:
        - name: DD_ADDITIONAL_ENDPOINTS
          value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
        - name: DD_APM_ADDITIONAL_ENDPOINTS
          value: '{"http://alloy-service:9126": ["datadog-receiver"]}'

For option 2 where traces must go only to Alloy Datadog receiver following configuration might work:

datadog:
  apm:
    socketEnabled: true                    # Use the Unix Domain Socket (default). Can be true even if port is enabled.
    portEnabled: true                      # Enable TCP port 8126 for traces.
    useLocalService: false 

  env:
    - name: DD_APM_DD_URL
      value: "http://alloy-service:9126"  # URL for OTel collector’s Datadog trace receiver
    - name: DD_APM_NON_LOCAL_TRAFFIC
      value: "true"

After all done and agents restarted you can navigate to Explore page in Grafana, select Tempo as datasource and get recent Datadog traces.

Conclusions

Kudos to Grafana and Datadog – Their collaboration on the otelcol.receiver.datadog makes transitioning between monitoring platforms smoother than ever, reducing friction for hybrid observability.
Hybrid Monitoring is the New Normal – Applications no longer rely on a single monitoring provider throughout their lifespan. As infrastructure evolves, businesses will inevitably switch or integrate multiple observability tools.
Stay Agile with Open Standards – Using OpenTelemetry ensures flexibility, allowing teams to adapt their monitoring stack without vendor lock-in, keeping observability seamless across hybrid and multi-cloud environments.

By embracing open standards, organizations can future-proof their monitoring strategies while ensuring complete visibility across their hybrid infrastructure. 🚀

Grafana a new look

As much as l like managed monitoring solutions it’s hard not to be excited about setting up your own platform based on Prometheus Grafana stack and save some bucks for the company.

After 5 years of using proprietary solutions I am again on the way to get into Grafana world.

Here I want to wrap up what I learnt so far about Grafana.

Big ecosystem of tools

The biggest change for me is ecosystem of projects which supports almost all imaginable monitoring needs.

For Logs: Loki and Promtail
For Traces: Tempo
For Profiling: Peryscope
For open telemetry collecting: Alloy
For eBPF: Beyla
For Synthetics: Prometheus BlackBox and K6 API
and many more

Automated Deployment process

A lot was done in the deployment process where using Helm charts you can get it up and running in a minutes.

All that comes with scalability and high availability in mind.

Though you would need to connect and customize each component, it’s still can be just a Helm configuration change.

Huge ecosystem of Components to simplify instrumentation

It’s clear your code has to be instrumented to be observable. Grafana and their Open Source community developed a number of tools to make it easy and simple:

Compatible receivers for big providers like Datadog
New discovery methods like eBPF
No code way of instrumentation like sidecar container

And a lot more see the Alloy components section to get started.

New fresh UI

Grafana Logs and Metrics get new look.

You can see multiple logs and metrics on the same page.

Summary

It’s cool, it’s fresh and it’s a lot of fun to use!

Brief Summary of Site Reliability Engineering best practices

The goal of SRE is to accelerate product development teams and keep services running in reliable and continuous way.

This article is a collection practical notes on explaining what is SRE, what kind of work SREs does and what type of processes they develop. The practices are based on Google SRE workbook.

This is a long article and If you will make it to the end I applaud to you!

But, don’t stop here, go and read both Google SRE books which mentioned in References. Then learn about Prometheus and ELK stacks – they are open source tools that help to implement SRE practices.
That should keep your busy for at least a year. I wish you best of luck!

SRE practices include

SLOs and SLIs
Monitoring
Alerting
Toil reduction
Simplicity

SLO and SLI

SLO is Service Level Objective is a goal that service provider wants to reach.

Practicality of SLO: SLOs are tools to help determine what engineering work to prioritize. SLOs define the concept of error budget.

Talking about error budget, lets see next how error budget have to be approached.

Error budget approach

There are SLOs which all stakeholder in org. approved
It is possible to meet SLOs needs under normal conditions
Org. is committed to using error budget for decision making and prioritizing

This are essentials steps to have error budget approach in place, your work is to go as close as possible to fulfill them.

What to measure using SLIs

SLIs is the ration between two numbers: the good and the total:

Number of successful HTTP request / total HTTP requests
Number of consumed jobs in a queue / total number of jobs in a queue

SLI divided on specification and implementation. For example:

Specification: ration of requests loaded in < 100 ms
Implementation based on: a) server logs b) Javascript on client

SLO and SLI in practice

The strategy to implement SLO, SLI in your company is to start small. Consider following aspects when working on your first SLO.

Choose one application for which you want to define SLOs
Decide on few keys SLIs specs that matter to your service and users
Consider common ways and tasks through which your users interact with service
Draw a high-level architecture diagram of your system
Show key components. The requests flow. The data flow

The result is narrow and focused prove of concept that would help to make benefits of SLO, SLI concise and clear.

Type of SLIs

SLI is Service Level Indicator is a measurement the service provider uses for the SLO goal.

There are several types of measurement you might choose from depend on type of your service:

Availability – The proportion of request which result in successful state
Latency – The proportion of request below some time threshold
Freshness – The proportion of data transferred to some time threshold. Replication or Data pipeline
Correctness – The proportion of input that produce correct output
Durability – The proportion of records written that can be successfully read

Summary of first actions toward SLOs and SLIs

Setup white box monitoring: prometheus, datadog, newrelic
Develop key SLOs and SLOs response process like incident management
Error budget enforcement decisions. Written error budget policy. Priority to reliability when error budget is spent.
Continuous improvement of SLOs target. Monthly SLOs review.
Count outages and measure user happiness
Create a training program to train developers on SLOs and other reliability concepts
Create SLOs dashboard

This is essential starting points to implement SLOs in your company. It will bring more confidence and better decision making in your services.

Monitoring

How SRE define monitoring

Alert on condition that require attention
Investigate and diagnose issues
Display information about the system visually
Gain insight into system health and resource usage for long-term planning
Compare the behavior of the system before and after a change, or between two control groups

Features of monitoring you have to know and tune

Speed. Freshness of data.
Data retention and calculations
Interfaces: graphs, tables, charts. High level or low level.
Alerts: multiple categories, notifications flow, suppress functionality.

Source of monitoring

Metrics
Logs

That is high level overview. Details depend on your tools and platform. For the ones who are only starting I recommend to look for open source projects such as Prometheus, Grafana and ElasticSearch(ELK) monitoring stack.

Alerting

Alerting considerations

Precision – The proportion of events detected that were significant
Recall – The proportion of significant events detected
Detection time – How long it takes to send notification in various conditions
Reset time – How long alerts fire after an issue is resolved

Ways to alerts

There are several strategies on how alerts could be setup. Recommendation is to combine several strategies to enhance your alerts quality from different directions.

First and simple one:

Target error rate ≥ SLO threshold.
- Example: For 10 minutes window error rate exceeds the SLO
- Upsides: Fast recall time
- Downsides: Precision is low
Increased alert windows.
- Example: if an event consumes 5% of the 30-day error budget – a 36-hour window.
- Upsides: good detection time
- Downside: poor reset time
Increment alert duration. For how long alert should be triggered to be significant.
- Upsides: Alerts can be higher precision.
- Downside: poor recall and poor detection time
Alert on burn rate. How fast, relative to SLO, the service consume error budget.
- Example: 5% error budget over 1 hour period.
- Upside: Good precision, short time window, good detection time.
- Downside: low recall, long reset time
Multiple burn rate alerts. Depend on burn rate determine severity of alert which lead to page notification or a ticket
- Upsides: good recall, good precision
- Downsides: More parameters to manage, long reset time
Multi window, multi burn alerts.
- Upsides: Flexible alert framework, good precision, good recall
- Downside: even more harder to manage, lots of parameters to specify

The same as for Monitoring an open source tool to be alert on metrics is combination of Prometheus and Alertmanager. As for logs look for Kibana project which has Alerting on log patterns.

Toil reduction

Definition: toils is repetitive, predictable, constant stream of tasks related to maintaining a service

What is toil

Manual. When the tmp directory on a web server reach 95% utilization, you need to login and find a space to clean up
Repetitive. A full tmp director is unlikely to be a one time event
Automatable. If the instructions are well defined then it’s better to automate the problem detection and remediation
Reactive. When you receive too many alerts of “disks full”, they distract more than help. So, potentially high-severity alerts could be missed
Lacks enduring value. The satisfaction of completed tasks is short term, because it is to prevent the issue in the future
Grow at least as fast as it’s source. Growing popularity of the service will require more infrastructure and more toil work

Potential benefits of toil automation

Engineering work might reduce toil in the future
Increased team morale
Less context switching for interrupts, which raises team productivity
Increased process clarity and standardization
Enhanced technical skills
Reduced training time
Fewer outages attributable to human errors
Improved security
Shorter response times for user requests

How to measure toil

Identify it.
Measure the amount of human effort applied to this toil
Track these measurements before, during and after toil reduction efforts

Toil categorization

Business processes. Most common source of toil.
Production interrupts. The key tasks to keep system running.
Product releases. Depending on the tooling and release size they could generate toil.(release requests, rollbacks, hot fixes and repetitive manual configuration changes)
Migrations. Large scale migration or even small database structure change likely done manually as one time effort. Such thinking is a mistake, because this work is repetitive.
Cost engineering and capacity planning. Ensure a cost-effective baseline. Prepare for critical high traffic events.
Troubleshooting

Toil management strategies in practices

Basics:

Identify and measure
Engineer toil out of the system
Reject the toil
Use SLO to reduce toil

Organizational:

Start with human-backed interfaces. For complex business problems start with partially automated approach.
Get support from management and colleagues. Toil reduction is worthwhile goal.
Promote toil reduction as a feature. Create strong business case for toil reduction.
Start small and then improve

Standardization and automation:

Increase uniformity. Lean to standard tools, equipment and processes.
Access risk within automation. Automation with admin-level privileges should have safety mechanism which checks automation actions against the system. It will prevent outages caused by bugs in automation tools.
Automate toil response. Think how to approach toil automation. It shouldn’t eliminate human understanding of what’s going on.
Use open source and third-party tools.

In general:

Use feedback to improve. Seek for feedback from users who interact with your tools, workflows and automation.

Simplicity

Measure complexity

Training time. How long it take for newcomer engineer to get on full speed.
Explanation time. The time it takes to provide a view on system internals.
Administrative diversity. How many ways are there to configure similar settings
Diversity of deployed configuration
Age. How old is the system

SRE work on simplicity

SRE understand the systems end to prevent and fix source of complexity
SRE should be involved in design, system architecture, configuration, deployment processes, or elsewhere.
SRE leadership empower SRE teams to push for simplicity, and to explicitly reward these efforts.

Conclusions

SRE practices require significant amount of time and skilled SRE people to implement right
A lot of tools are involved in day to day SRE work
SRE processes is one of a key to success of tech company

References

Google SRE Workbook
Google SRE book

Filebeats configuration for Kubernetes

filebeat.autodiscover:
  providers:
  - type: kubernetes
    node: ${NODE_NAME}
    hints.enabled: true
    hints.default_config:
      type: container
      paths:
        - /var/log/containers/*${data.kubernetes.container.id}.log

What’s so cool about above configuration

Filebeat Autodiscover

When you run applications on containers, they become moving targets to the monitoring system. Autodiscover allows you to track them and adapt settings as changes happen.

The Kubernetes autodiscover provider watches for Kubernetes nodes, pods, services to start, update, and stop.
As well it recognise a lot of additional labels and statuses related to Kubernetes objects.

Hints based autodiscover

Filebeat supports autodiscover based on hints from the provider. The hints system looks for hints in Kubernetes Pod annotations or Docker labels that have the prefix co.elastic.logs. As soon as the container starts, Filebeat will check if it contains any hints and launch the proper config for it. Hints tell Filebeat how to get logs for the given container.

Type Container

Use the container input to read containers log files.

This input searches for container logs under the given path, and parse them into common message lines, extracting timestamps too. Everything happens before line filtering, multiline, and JSON decoding, so this input can be used in combination with those settings.

Conclusions

Kubernetes logs autodiscovery and JSON decoding provide very good visibility into log stream. Labels and JSON log fields are properly named and parsed. Using ES and Kibana we can search through logs with easy queries and filter by fields.

References

Kubernetes autodiscover provider

Integrating Flask with Jaeger tracing on Kuberentes

Distributed applications and microservices required high level of observability. In this article we will integrate a Flask micro framework with Jaeger tracing tool. All code will be deployed to Kubernetes minikube cluster.

Flask

Let’s build a simple task manager service using Flask framework.

Code

tasks.py

from flask import Flask, jsonify
app = Flask(__name__)

@app.route('/')

tasks = {"tasks":[
        {"name":"task 1", "uri":"/task1"},
        {"name":"task 2", "uri":"/task2"}
    ]}

def  root():
	"Service root"
	return  jsonify({"url":"/tasks")
                     
@app.route('/tasks')
def  tasks():
	"Tasks list"
	return  jsonify(tasks)

if __name__ == '__main__':
  "Start up"
  app.run(debug=True, host='0.0.0.0',port=5000)

Disaster recovery of single node Kubernetes control plane

Overview

There are many possible root causes why control plane might become unavailable. Lets review most common scenarios and mitigation steps.

Mitigation steps in this article build around AWS public cloud features, but all popular public cloud offerings have similar functionality.

Apiserver VM shutdown or apiserver crashing

Results

unable to stop, update, or start new pods, services, replication controller
existing pods and services should continue to work normally, unless they depend on the Kubernetes API

AWS best practices for Lambda functions in Production

Hey folks,

just a month ago I have been involved in AWS project based on Lambda functions. In this article I will explain what I learned so far and how to create production Lambda AWS environment with best practices in mind.

I will start from top level and will explain everything you need to have basic infrastructure supporting your Lambda functions and other applications in your cloud.

VPC

First, you need to created dedicated VPC and reserve range of IPs which doesn’t conflict with your other networks in case you would need to pair them together. As a general rule you should never use default VPC for production needs.
Create a security group which only allow 80 and 443 incoming traffic.

Subnets

You would need at least 4 subnets, two private and two public. Each type of subnet have to split in at least two different availability zones.
Public subnet have to contain AWS services endpoints and your servers which needs to have direct connection to internet like ELB, API gateway endpoints or bastion host (your ssh jump server).
Private subnet have to contain all your infrastructure servers like web servers, database server or backend applications.

Note that You should never place your infrastructure servers in public subnets.

Internet gateway and NAT

To function properly your VPC have to be attached to internet gateway and your private subnet should have NAT service enabled.

MySQL

For the database I use MySQL RDS. You need to disable public access to the instance and deploy it into private subnet. In security group add port 3306 for incoming connections and only from internal IP range. So, we have double protection here with security group and internal dns name for database.
There are a lot of best practices of how to setup production ready mysql instance, so I will skip most of it, but what you definitely need is to have read replica and shadow copy enabled. Make sure you set maintenance window which is right for you.

Lambda functions

To have access to our private database Lambda functions needs to be deployed inside the same VPC in private subnets. To setup https endpoints for lambda functions you would need to attach API gateway. In Lambda security groups add ports 80 and 443 for incoming connections.

That’s pretty much it, but very often you will have other web applications running in your vpc and to route traffic properly between Lambda and other apps you would need some web proxy like nginx.

Nginx

To have common entry point for your web applications and Lambda function Nginx is the best way to go. There is a new possibility to use ELB for that, but it isn’t good enough yet.

To have reliable and secure setup of nginx you would need to use common pattern of AWS which include: ELB, Autoscaling group, Launch configuration and security groups.

On the configuration side nginx will proxy traffic to Lambda functions through API gateway.

Elastic load balancer

Here you need to decide what kind of ELB suits your needs. I choose ELB with HTTPS support which provide SSL termination. In the ELB security group I added ports 80 and 443 for all incoming traffic.

Launch configuration

Within Launch configuration you need to define what kind of instance you want to launch when autoscaling is trigger in.

Autoscaling group

ASG define what is desired number of instance you want to run at any given moment. Using metrics such as CPU you can setup it to scale up or down to desired maximum or minimum number of instances.

Almost there!

Last step is to connect ELB with ASG and with Launch configuration!

Note I have skipped setting up of Target group and health checks, but they are pretty much basics.

That’s it!

Now you have a good start to develop with AWS Lambda in conjunction with general approach of web tier architecture.

What’s next?

Second part of the topic is to setup CI and automation. Next time I will write how to code infrastructure with terraform, create nginx image with packer and run configuration management with ansible.

Continuous integration and deployment with Google Cloud Builder and Kubernetes

Pipeline of Continuous Integration(CI) for containers has several basic steps. Lets see what they are:

Setup a trigger

Listen to a change in repositories(github, bitbucket) such like pull request, new tag or new branch.

It is basic step for any CI/CD tool and for google cloud builder it is pretty trivial task to setup. Check out Container Registry – Build Triggers tool in google cloud console.

Build an image

When change to repository occur we want to start build of new Docker container image for a change. Good practice is to tag new image with branch name and git reference hash. E.g. master-00covfefe

With cloud builder you face two choices: use a Dockerfile or cloudbuild.yaml file. With Dockerfile option steps are predetermined and don’t give you too much flexibility.
With cloudbuild.yaml you can customise every step of your pipeline.
In the following example first command is doing a build step using Dockerfile and second command tag new image with branch-revision pattern(remember master-00covfefe):

steps:
- name: 'gcr.io/cloud-builders/docker'
  args: [ 'build', '-t', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app', '.' ]

- name: 'gcr.io/cloud-builders/docker'
  args: [ 'tag', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']

Push new image to Container Registry

One important note that cloudbuild.yaml file has special directive “image” which publish image to registry, but that directive only executed at the end of all steps. So, in order to perform deployment step you need to publish image as a separate step.

- name: 'gcr.io/cloud-builders/docker'
  args: ['push', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']

Deploy new image to Kubernetes

When new image is in registry it’s time to trigger deployment step. In this example it is deployment to Kubernetes cluster.
This step require Google Cloud Builder user to have Edit permissions to kubernetes cluster. In google cloud it is a user with “@cloudbuild.gserviceaccount.com” domain. You need to give that user Edit access to kubernetes using IAM console.
Second requirement is to specify zone and cluster cloudbuild.yaml using env variables. That will tell kubectl command to which cluster to deploy.

- name: 'gcr.io/cloud-builders/kubectl'
  args: ['set', 'image', 'deployment/my-nodejs-app-deployment', 'my-nodejs-app=eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']
  env:
  - 'CLOUDSDK_COMPUTE_ZONE=europe-west1-d'
  - 'CLOUDSDK_CONTAINER_CLUSTER=staging-cluster'

What next

At this point the CI/CD job is done. Possible next steps to improve your pipeline can be:

Send notification to Slack or Hipchat to let everyone know about new version deployment.
Run user acceptance tests to check that all functions perform well.
Run load tests and stress tests to check that new version has no degradation in performance.

Full cloudbuild.yaml file example

steps:
#build steps
- name: 'gcr.io/cloud-builders/docker'
  args: [ 'build', '-t', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app', '.' ]

- name: 'gcr.io/cloud-builders/docker'
  args: [ 'tag', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']

- name: 'gcr.io/cloud-builders/docker'
  args: ['push', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']

#deployment step
- name: 'gcr.io/cloud-builders/kubectl'
  args: ['set', 'image', 'deployment/my-nodejs-app-deployment', 'my-nodejs-app=eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']
  env:
  - 'CLOUDSDK_COMPUTE_ZONE=europe-west1-d'
  - 'CLOUDSDK_CONTAINER_CLUSTER=staging-cluster'

#image update steps(two tags: latest and branch-revision)
images:
- 'eu.gcr.io/$PROJECT_ID/my-nodejs-app'
- 'eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID'

#tags for container builder
tags:
  - "frontend"
  - "nodejs"
  - "dev-team-1"

Prepare Application Launch Checklist

Introduction

Application Launch checklist is aimed for Devops, Sysops and anyone whois job to make website available and reliable.
The checklist better works for applications which are going to be Live in near feature, but also useful to validate your Devops processes for already running applications.

This checklist is a complied notes from Launch Checklist for Google Cloud Platform. It is mostly targeted on Devops work routines and in a nutshell explain first and necessary Devops steps into launching applications.

Software Architecture Documentation

Create an Architectural Summary. Include an overall architectural diagram, a summary of the process flows, detail the service interaction points.
List and describe how each service is used. Include use of any 3rd-party APIs.
Make it easy accessible and available – the best as wiki pages.

Builds and Releases

Document your build and release, configuration, and security management processes.
Automate build process. Include automated testing and packaging.
Automate release process to provision package between environments. Include rollback functionality.
Version your configuration and put it into Configuration Management system like Saltstack, Puppet or Ansible.
Simulate build and release failures. Are you able to roll back effectively? Is the process documented?

Disaster recovery

Document your routine backup, regular maintenance, and disaster recovery processes.
Test your restore process with real data. Determine time required for a full restore and reflect this in the disaster recovery processes.
Automate as much as possible.
Simulate major outages and test your Disaster Recovery processes
Simulate individual services failure to test your incidents recovery process

Monitoring

Document and define your system monitoring and alerting processes.
Validate that your system monitoring and alerting are sufficient and effective.

Final thoughts

I can not overstate how much final outcome depend on the level of interaction between Developers, Sysops and Devops teams in your organisation.

After application will go live convert it to a training program for every new devop before she will start site support.