Devops Archives - Site Reliability Engineer Blog

Cost saving strategy for Kubernetes platform

Working on cost saving strategy involves looking at the problem from very different dimensions.
Overall Kubernetes costs can be split on compute costs, networking costs, storage costs, licensing cost and SaaS costs.
In this part I will cover: Right-size infrastructure & use autoscaling

Right-size infrastructure & use autoscaling

Kubernetes Nodes autoscaler

In a cloud environment Kubernetes Nodes autoscaler plays important role in delivering just enough resources for your cluster. In a nutshell:

it adds new nodes according to demand – up scale
it consolidates under utilized nodes – down scale

It directly depends on Pod resource requests. Choosing the right requests is the key component for cost-effective strategy.
For self hosted solutions you might want to look into project like Karpenter.
Alternative scaling strategy is to add or remove nodes by schedule for a specific time plan. Using that approach you can simple specify the time of the day when you need to upscale and the time when you downscale.

Multi provider DNS management with Terraform and Pulumi

The Problem

Every DNS provider is very specific how they create DNS records. Using Terraform or Pulumi don’t guarantee multi provider support out of the box.

One example where AWS Route53 support values for multiple IP binding to the same name record. Where Cloudflare must have a dedicated record for each IP.

Theses API difference make it harder to write code which will work for multiple providers.

For AWS Route53 a single record can be created like this:

mydomain.local: IP1, IP2, IP3

For Cloudflare it would be 3 different records:

mydomain.local: IP1
mydomain.local: IP2
mydomain.local: IP3

The Solution 1: Use flexibility of programming language available with Pulumi

Pulumi has a first hand here since you can use the power of programming language to handle custom logic.

DNS data structure:

mydomain1.com: 
 - IP1
 - IP2 
 - IP3
mydomain2.com:
 - IP4
 - IP5 
 - IP6
mydomain3.com: 
 - IP7
 - IP8 
 - IP9

Using Python or Javascript we can expand this structure for Cloudflare provider or keep as is for AWS Route53.

In Cloudflare case we will create new record for each new IP

import pulumi
import pulumi_cloudflare as cloudflare
import yaml

# Load the configuration from a YAML file
yaml_file = "dns_records.yaml"
with open(yaml_file, "r") as file:
    dns_config = yaml.safe_load(file)

# Cloudflare Zone ID (Replace with your actual Cloudflare Zone ID)
zone_id = "your_cloudflare_zone_id"

# Iterate through domains and their associated IPs to create A records
for domain, ips in dns_config.items():
    if isinstance(ips, list):  # Ensure it's a list of IPs
        for ip in ips:
            record_name = domain
            cloudflare.Record(
                f"{record_name}-{ip.replace('.', '-')}",
                zone_id=zone_id,
                name=record_name,
                type="A",
                value=ip,
                ttl=3600,  # Set TTL (adjust as needed)
            )

# Export the created records
pulumi.export("dns_records", dns_config)

and since AWS Route53 support IPs list, so the code would look like:

for domain, ips in dns_config.items():
    if isinstance(ips, list) and ips:  # Ensure it's a list of IPs and not empty
        aws.route53.Record(
            f"{domain}-record",
            zone_id=hosted_zone_id,
            name=domain,
            type="A",
            ttl=300,  # Set TTL (adjust as needed)
            records=ips,  # AWS Route 53 supports multiple IPs in a single record
        )

Solution 2 – Using Terraform for each loop

It’s quite possible to achieve the same using Terraform starting with version 0.12 which introduce dynamic block.

Same data structure:

mydomain1.com: 
  - 192.168.1.1
  - 192.168.1.2
  - 192.168.1.3
mydomain2.com:
  - 10.0.0.1
  - 10.0.0.2
  - 10.0.0.3
mydomain3.com: 
  - 172.16.0.1
  - 172.16.0.2
  - 172.16.0.3

Terraform example for AWS Route53

provider "aws" {
  region = "us-east-1"  # Change this to your preferred region
}

variable "hosted_zone_id" {
  type = string
}

variable "dns_records" {
  type = map(list(string))
}

resource "aws_route53_record" "dns_records" {
  for_each = var.dns_records

  zone_id = var.hosted_zone_id
  name    = each.key
  type    = "A"
  ttl     = 300
  records = each.value
}

Quite simple using for_each loop, but will not work with Cloudflare, because of the mentioned compatibility issue. So, we need new record for each IP.

Terraform example for Cloudflare

# Create multiple records for each domain, one per IP
resource "cloudflare_record" "dns_records" {
  for_each = { for k, v in var.dns_records : k => flatten([for ip in v : { domain = k, ip = ip }]) }

  zone_id = var.cloudflare_zone_id
  name    = each.value.domain
  type    = "A"
  value   = each.value.ip
  ttl     = 3600
  proxied = false  # Set to true if using Cloudflare proxy
}

Conclusions

Pulumi: Flexible and easy to start. Data is separate from code, making it easy to add providers or change logic.
Terraform: Less complex and easier to support long-term but depends on data format
Both solutions require programming skills or expertise in Terraform language.

Brief Summary of Site Reliability Engineering best practices

The goal of SRE is to accelerate product development teams and keep services running in reliable and continuous way.

This article is a collection practical notes on explaining what is SRE, what kind of work SREs does and what type of processes they develop. The practices are based on Google SRE workbook.

This is a long article and If you will make it to the end I applaud to you!

But, don’t stop here, go and read both Google SRE books which mentioned in References. Then learn about Prometheus and ELK stacks – they are open source tools that help to implement SRE practices.
That should keep your busy for at least a year. I wish you best of luck!

SRE practices include

SLOs and SLIs
Monitoring
Alerting
Toil reduction
Simplicity

SLO and SLI

SLO is Service Level Objective is a goal that service provider wants to reach.

Practicality of SLO: SLOs are tools to help determine what engineering work to prioritize. SLOs define the concept of error budget.

Talking about error budget, lets see next how error budget have to be approached.

Error budget approach

There are SLOs which all stakeholder in org. approved
It is possible to meet SLOs needs under normal conditions
Org. is committed to using error budget for decision making and prioritizing

This are essentials steps to have error budget approach in place, your work is to go as close as possible to fulfill them.

What to measure using SLIs

SLIs is the ration between two numbers: the good and the total:

Number of successful HTTP request / total HTTP requests
Number of consumed jobs in a queue / total number of jobs in a queue

SLI divided on specification and implementation. For example:

Specification: ration of requests loaded in < 100 ms
Implementation based on: a) server logs b) Javascript on client

SLO and SLI in practice

The strategy to implement SLO, SLI in your company is to start small. Consider following aspects when working on your first SLO.

Choose one application for which you want to define SLOs
Decide on few keys SLIs specs that matter to your service and users
Consider common ways and tasks through which your users interact with service
Draw a high-level architecture diagram of your system
Show key components. The requests flow. The data flow

The result is narrow and focused prove of concept that would help to make benefits of SLO, SLI concise and clear.

Type of SLIs

SLI is Service Level Indicator is a measurement the service provider uses for the SLO goal.

There are several types of measurement you might choose from depend on type of your service:

Availability – The proportion of request which result in successful state
Latency – The proportion of request below some time threshold
Freshness – The proportion of data transferred to some time threshold. Replication or Data pipeline
Correctness – The proportion of input that produce correct output
Durability – The proportion of records written that can be successfully read

Summary of first actions toward SLOs and SLIs

Setup white box monitoring: prometheus, datadog, newrelic
Develop key SLOs and SLOs response process like incident management
Error budget enforcement decisions. Written error budget policy. Priority to reliability when error budget is spent.
Continuous improvement of SLOs target. Monthly SLOs review.
Count outages and measure user happiness
Create a training program to train developers on SLOs and other reliability concepts
Create SLOs dashboard

This is essential starting points to implement SLOs in your company. It will bring more confidence and better decision making in your services.

Monitoring

How SRE define monitoring

Alert on condition that require attention
Investigate and diagnose issues
Display information about the system visually
Gain insight into system health and resource usage for long-term planning
Compare the behavior of the system before and after a change, or between two control groups

Features of monitoring you have to know and tune

Speed. Freshness of data.
Data retention and calculations
Interfaces: graphs, tables, charts. High level or low level.
Alerts: multiple categories, notifications flow, suppress functionality.

Source of monitoring

Metrics
Logs

That is high level overview. Details depend on your tools and platform. For the ones who are only starting I recommend to look for open source projects such as Prometheus, Grafana and ElasticSearch(ELK) monitoring stack.

Alerting

Alerting considerations

Precision – The proportion of events detected that were significant
Recall – The proportion of significant events detected
Detection time – How long it takes to send notification in various conditions
Reset time – How long alerts fire after an issue is resolved

Ways to alerts

There are several strategies on how alerts could be setup. Recommendation is to combine several strategies to enhance your alerts quality from different directions.

First and simple one:

Target error rate ≥ SLO threshold.
- Example: For 10 minutes window error rate exceeds the SLO
- Upsides: Fast recall time
- Downsides: Precision is low
Increased alert windows.
- Example: if an event consumes 5% of the 30-day error budget – a 36-hour window.
- Upsides: good detection time
- Downside: poor reset time
Increment alert duration. For how long alert should be triggered to be significant.
- Upsides: Alerts can be higher precision.
- Downside: poor recall and poor detection time
Alert on burn rate. How fast, relative to SLO, the service consume error budget.
- Example: 5% error budget over 1 hour period.
- Upside: Good precision, short time window, good detection time.
- Downside: low recall, long reset time
Multiple burn rate alerts. Depend on burn rate determine severity of alert which lead to page notification or a ticket
- Upsides: good recall, good precision
- Downsides: More parameters to manage, long reset time
Multi window, multi burn alerts.
- Upsides: Flexible alert framework, good precision, good recall
- Downside: even more harder to manage, lots of parameters to specify

The same as for Monitoring an open source tool to be alert on metrics is combination of Prometheus and Alertmanager. As for logs look for Kibana project which has Alerting on log patterns.

Toil reduction

Definition: toils is repetitive, predictable, constant stream of tasks related to maintaining a service

What is toil

Manual. When the tmp directory on a web server reach 95% utilization, you need to login and find a space to clean up
Repetitive. A full tmp director is unlikely to be a one time event
Automatable. If the instructions are well defined then it’s better to automate the problem detection and remediation
Reactive. When you receive too many alerts of “disks full”, they distract more than help. So, potentially high-severity alerts could be missed
Lacks enduring value. The satisfaction of completed tasks is short term, because it is to prevent the issue in the future
Grow at least as fast as it’s source. Growing popularity of the service will require more infrastructure and more toil work

Potential benefits of toil automation

Engineering work might reduce toil in the future
Increased team morale
Less context switching for interrupts, which raises team productivity
Increased process clarity and standardization
Enhanced technical skills
Reduced training time
Fewer outages attributable to human errors
Improved security
Shorter response times for user requests

How to measure toil

Identify it.
Measure the amount of human effort applied to this toil
Track these measurements before, during and after toil reduction efforts

Toil categorization

Business processes. Most common source of toil.
Production interrupts. The key tasks to keep system running.
Product releases. Depending on the tooling and release size they could generate toil.(release requests, rollbacks, hot fixes and repetitive manual configuration changes)
Migrations. Large scale migration or even small database structure change likely done manually as one time effort. Such thinking is a mistake, because this work is repetitive.
Cost engineering and capacity planning. Ensure a cost-effective baseline. Prepare for critical high traffic events.
Troubleshooting

Toil management strategies in practices

Basics:

Identify and measure
Engineer toil out of the system
Reject the toil
Use SLO to reduce toil

Organizational:

Start with human-backed interfaces. For complex business problems start with partially automated approach.
Get support from management and colleagues. Toil reduction is worthwhile goal.
Promote toil reduction as a feature. Create strong business case for toil reduction.
Start small and then improve

Standardization and automation:

Increase uniformity. Lean to standard tools, equipment and processes.
Access risk within automation. Automation with admin-level privileges should have safety mechanism which checks automation actions against the system. It will prevent outages caused by bugs in automation tools.
Automate toil response. Think how to approach toil automation. It shouldn’t eliminate human understanding of what’s going on.
Use open source and third-party tools.

In general:

Use feedback to improve. Seek for feedback from users who interact with your tools, workflows and automation.

Simplicity

Measure complexity

Training time. How long it take for newcomer engineer to get on full speed.
Explanation time. The time it takes to provide a view on system internals.
Administrative diversity. How many ways are there to configure similar settings
Diversity of deployed configuration
Age. How old is the system

SRE work on simplicity

SRE understand the systems end to prevent and fix source of complexity
SRE should be involved in design, system architecture, configuration, deployment processes, or elsewhere.
SRE leadership empower SRE teams to push for simplicity, and to explicitly reward these efforts.

Conclusions

SRE practices require significant amount of time and skilled SRE people to implement right
A lot of tools are involved in day to day SRE work
SRE processes is one of a key to success of tech company

References

Google SRE Workbook
Google SRE book

First class incident response process

If you don’t have incident response process the chances are that sooner than later you will need one.

To understand where to start check out this first class incident response process documentation. It cover:

how to prepare to incident
what actions to take during incident
what are post incident steps

Continuous integration and deployment with Google Cloud Builder and Kubernetes

Pipeline of Continuous Integration(CI) for containers has several basic steps. Lets see what they are:

Setup a trigger

Listen to a change in repositories(github, bitbucket) such like pull request, new tag or new branch.

It is basic step for any CI/CD tool and for google cloud builder it is pretty trivial task to setup. Check out Container Registry – Build Triggers tool in google cloud console.

Build an image

When change to repository occur we want to start build of new Docker container image for a change. Good practice is to tag new image with branch name and git reference hash. E.g. master-00covfefe

With cloud builder you face two choices: use a Dockerfile or cloudbuild.yaml file. With Dockerfile option steps are predetermined and don’t give you too much flexibility.
With cloudbuild.yaml you can customise every step of your pipeline.
In the following example first command is doing a build step using Dockerfile and second command tag new image with branch-revision pattern(remember master-00covfefe):

steps:
- name: 'gcr.io/cloud-builders/docker'
  args: [ 'build', '-t', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app', '.' ]

- name: 'gcr.io/cloud-builders/docker'
  args: [ 'tag', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']

Push new image to Container Registry

One important note that cloudbuild.yaml file has special directive “image” which publish image to registry, but that directive only executed at the end of all steps. So, in order to perform deployment step you need to publish image as a separate step.

- name: 'gcr.io/cloud-builders/docker'
  args: ['push', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']

Deploy new image to Kubernetes

When new image is in registry it’s time to trigger deployment step. In this example it is deployment to Kubernetes cluster.
This step require Google Cloud Builder user to have Edit permissions to kubernetes cluster. In google cloud it is a user with “@cloudbuild.gserviceaccount.com” domain. You need to give that user Edit access to kubernetes using IAM console.
Second requirement is to specify zone and cluster cloudbuild.yaml using env variables. That will tell kubectl command to which cluster to deploy.

- name: 'gcr.io/cloud-builders/kubectl'
  args: ['set', 'image', 'deployment/my-nodejs-app-deployment', 'my-nodejs-app=eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']
  env:
  - 'CLOUDSDK_COMPUTE_ZONE=europe-west1-d'
  - 'CLOUDSDK_CONTAINER_CLUSTER=staging-cluster'

What next

At this point the CI/CD job is done. Possible next steps to improve your pipeline can be:

Send notification to Slack or Hipchat to let everyone know about new version deployment.
Run user acceptance tests to check that all functions perform well.
Run load tests and stress tests to check that new version has no degradation in performance.

Full cloudbuild.yaml file example

steps:
#build steps
- name: 'gcr.io/cloud-builders/docker'
  args: [ 'build', '-t', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app', '.' ]

- name: 'gcr.io/cloud-builders/docker'
  args: [ 'tag', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']

- name: 'gcr.io/cloud-builders/docker'
  args: ['push', 'eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']

#deployment step
- name: 'gcr.io/cloud-builders/kubectl'
  args: ['set', 'image', 'deployment/my-nodejs-app-deployment', 'my-nodejs-app=eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID']
  env:
  - 'CLOUDSDK_COMPUTE_ZONE=europe-west1-d'
  - 'CLOUDSDK_CONTAINER_CLUSTER=staging-cluster'

#image update steps(two tags: latest and branch-revision)
images:
- 'eu.gcr.io/$PROJECT_ID/my-nodejs-app'
- 'eu.gcr.io/$PROJECT_ID/my-nodejs-app:$BRANCH_NAME-$REVISION_ID'

#tags for container builder
tags:
  - "frontend"
  - "nodejs"
  - "dev-team-1"

Undocumented locust load API calls

Hey folks,

today I found that locust load do support API calls, so you can grab statistics and manage it’s execution.

If you like me hear about it first time here is few examples:

Start locust task

curl -XPOST $LOCUST_URL/swarm -d"locust_count=100&hatch_rate=10"

Get stats

curl $LOCUST_URL/stats/requests

Stop locust task

curl $LOCUST_URL/stop

All requests API endpoints can be found in locust source code on github

Another useful feature which is worth to mention is locust extension ability.
Using it you can add your own routes to API

from locust import web

@web.app.route("/added_page")
def my_added_page():
    return "Another page"

Unlike API calls extensions is documented.

Happy coding everyone!

Prepare Application Launch Checklist

Introduction

Application Launch checklist is aimed for Devops, Sysops and anyone whois job to make website available and reliable.
The checklist better works for applications which are going to be Live in near feature, but also useful to validate your Devops processes for already running applications.

This checklist is a complied notes from Launch Checklist for Google Cloud Platform. It is mostly targeted on Devops work routines and in a nutshell explain first and necessary Devops steps into launching applications.

Software Architecture Documentation

Create an Architectural Summary. Include an overall architectural diagram, a summary of the process flows, detail the service interaction points.
List and describe how each service is used. Include use of any 3rd-party APIs.
Make it easy accessible and available – the best as wiki pages.

Builds and Releases

Document your build and release, configuration, and security management processes.
Automate build process. Include automated testing and packaging.
Automate release process to provision package between environments. Include rollback functionality.
Version your configuration and put it into Configuration Management system like Saltstack, Puppet or Ansible.
Simulate build and release failures. Are you able to roll back effectively? Is the process documented?

Disaster recovery

Document your routine backup, regular maintenance, and disaster recovery processes.
Test your restore process with real data. Determine time required for a full restore and reflect this in the disaster recovery processes.
Automate as much as possible.
Simulate major outages and test your Disaster Recovery processes
Simulate individual services failure to test your incidents recovery process

Monitoring

Document and define your system monitoring and alerting processes.
Validate that your system monitoring and alerting are sufficient and effective.

Final thoughts

I can not overstate how much final outcome depend on the level of interaction between Developers, Sysops and Devops teams in your organisation.

After application will go live convert it to a training program for every new devop before she will start site support.

Making simple Splunk Nginx dashboard

As a DevOps guy I often do incident analysis, post deployment monitoring and usual logs checks. If you also is using Splunk as me when let me show for you few effective Splunk commands for Nginx logs monitoring.

Extract fileds

To make commands works Nginx log fields have to be extracted into variables.
Where are 2 ways to extract fields:

By default Splunk recognise “access_combined” log format which is default format for Nginx. If it is your case congratulations nothing to do for you!
For custom format of logs you will need to create regular expression. Splunk has built in user interface to extract fields or you can provide regular expression manually.

Website traffic over time and error rate

Unexpected spike in traffic or in error rate are always first thing to look for. Following command build a time chart with response codes. Codes 200/300 is your normal traffic and 400/500 is errors.

timechart count(status) span=1m by status

Response time

How do you know if your website running slowly?
For response time I suggest to use 20, 85 and 95 percentile as metrics.
You also can think of average response time metric, but low average response time doesn’t show that website is OK, so I am not using that metric in the query.

timechart perc20(request_time), perc85(request_time), perc95(request_time) span=1m

Traffic by IP

Checking which IPs are most popular is a good way to spot bad guys or misbehaving bot.

top limit=20 clientip

Top of error page

Looking for pages which produce most errors like 500 Internal Server Error or not found pages like 404? Following two queries give you exactly that information.
Top error pages

search status >= 500 | stats count(status) as cnt by uri, status | sort cnt desc

Top 40x error pages

search status >= 400 AND status < 500 | stats count(status) by uri, status | sort cnt desc

Number of timeouts(>30s) per upstream

When you are using Nginx as a proxy server it is very useful to see if any of upstreams are getting timeouts.
Timeouts could be a symptom for: slow application performance, not enough system resources or just upstream server is down.

search upstream_response_time >= 30 | stats count(upstream_response_time) as upstreams by upstream

Most time consuming upstreams

Most time consuming upstreams showing which of servers are already overloaded by requests and giving you a hint when application needs to be scaled

stats sum(upstream_response_time), count(upstream) by upstream

In conclusion

Splunk functions like timechart, stats and top is your best friends for data aggregation. They are like unix tools – the more tools you know the more easier is to build powerful commands.