Can ArgoCD replace Helm managers?

Argocd gitops flux
Argocd gitops flux cycle

ArgoCD is a tool which provides GitOps way of deployment. It supports various format of applications including Helm charts.

Helm is one of the most popular packaging format for Kubernetes applications. It give a rise to various tools to manage helm chart like helmfiles, helmify and others.

Why ArgoCD is replacing Helm managers?

For a long time Helm managers help to deploy Helm charts via different deploy strategies. It serves well, but the big disadvantage was a code drift which accumulate over time.

To avoid code drift one solution was to schedule a period job which will apply latest changes to Helm charts. Sometimes it works, but most often if one of the charts get install issue all other charts stop being updated as well.
So, it wasn’t ideal way to handle it.

In that situation ArgoCD comes to a rescue. ArgoCD allow to control the update of each chart as a dedicated process where issue in one chart will not block updates on other charts. Helm charts in ArgoCD is a first class citizen and support most of it features.

How it looks like in ArgoCD?

To try it out lets start with a simple Application which will install Prometheus helm chart with a custom configuration.

Prometheus helm chart can be found at https://prometheus-community.github.io/helm-charts

Custom configuration will be stored in our internal project https://git.example.com/org/value-files.git

Thanks to multi source support in Application we can use official chart and custom configuration together like in the following example:

apiVersion: argoproj.io/v1alpha1
kind: Application
spec:
  sources:
  - repoURL: 'https://prometheus-community.github.io/helm-charts'
    chart: prometheus
    targetRevision: 15.7.1
    helm:
      valueFiles:
      - $values/charts/prometheus/values.yaml
  - repoURL: 'https://git.example.com/org/value-files.git'
    targetRevision: dev
    ref: values

Where `$values is a reference to root folder of values-files repository. That values file override default prometheus settings, so we can configure it for our needs.

What’s next?

There are several distinct feature which Helm managers support, so your next step is to check if ArgoCD cover all your needs.

One of the notable feature is secrets support. Many managers support SOPS format of secrets where ArgoCD don’t give you any solution, so it’s up to you how to manage secrets.

Other important feature is order of execution which can be important part of your Helm manager setup. ArgoCD don’t have built-in replacement, so you have to rely on Helm format to support dependencies. One way is to build umbrella charts for the complex applications.

Most important change is the GitOps style of deployment. ArgoCD doesn’t run CI/CD pipeline, so you don’t have a feature like helm diff to preview changes before applying. It will apply them as soon as they become available in the linked git repository.

Conclusions

  1. ArgoCD can replace Helm managers, but it strongly depend on your project needs.
  2. ArgoCD introduce new challenges like secrets managers and order of execution.
  3. It introduce GitOps deployment style and replace usual CI/CD pipeline, so new “quality gates” needs to be build to be ready for production environment.

Path to Staff Engineer role

Staff Engineer

Someday in your career you take your time to think what’s next for me? What is my next challenge to grow in a career ladder? If your answer is Staff Engineer, Tech lead, Team lead or Principal Engineer then you are on the way to Staff-plus Engineer role.

As a Staff-plus Engineer myself I want to share my few tips for you.

Start working on Staff engineer package

  1. What’s your Staff level project?
    • What did you do and what was impact?
    • What behavior did you demonstrate and how complex projects were?
  2. Link your Design Documents, RFCs and Proposals to support your package with your design and architecture contribution.
  3. Can you quantify the impact of your projects?
    • Did it helps to increase revenue?
    • Save costs?
  4. What glue work did you do to organization?
    • What’s the impact of the glue work?
  5. Who have you mentored and through what accomplishments?

Sharpen your soft skills

  1. Communication.
    • Keep people informed and connected.
  2. Negotiation
    • Be ready to resolve difficult situations
  3. Presentation
    • Know your audience. Learn one or two presentation frameworks.
  4. Networking
    • Don’t be shame to get in touch with other Staff-plus engineers in your company.
    • They are the best people to help you navigate the role inside the company.

Learn Staff Engineer tools

  1. Leadership
    • Become a problem solver. Be a visionary in your area. Stay up to date with technologies in your area.
  2. Planning and Goal orientation
    • Have a vision of what you do next in one or another area.
    • Take active participate on Planning meetings.
  3. Collaboration and contribution
    • Propose new Decision docs, Proposals and RFCs. Make sure they are reviewed and discussed.
    • Implement POC to demo the idea out.
  4. Team work and mentorship
    • Help your peers. Be a problem solver. Be visible. Become a go to person.
    • Try on a mentorship role.

Summary

Getting Staff engineer role could take months or years to accomplish. However, it’s crucial to view this milestone not as the ultimate destination, but rather as a guiding roadmap for your career development. It’s not the title itself that holds the most significance, but rather the daily challenge we face and the positive impact we make as we progress in our journey.

Resources

  1. About glue work.
  2. Staff Engineer book

Certified Kubernetes Administrator

The CKA exam was a real challenge and great experience over all.

My first impression now I am officially part of Kubernetes world and CNCF.

But, I just love Kubernetes and it’s eco-system of open source projects and I had a lot of fun during my preparation and CKA exam. When you combine work with fun you can achieve really great results.

My advise for all future CKA participants is to have 80% of real Kubernetes practice and other 20% you can learn along the exam year.

From a practical aspect I have used Minikube to tackle with Deployments and other Kuberentes API related tasks. As for cluster administration experience I mostly used Vagrant boxes and Google Cloud VMs to get multi nodes setup and network capabilities.

My last advice is to build your Kubernetes knowledge like a layered system. Start with foundation and build it block by block. Finish it with production ready Kubernetes cluster and your exam will become an easy walk in the park.

Have fun!

Implementing RED method with Istio

In this article I describe how to quickly get started with SLO and SLI practice using Istio. if you are new SLO and SLI read a Brief Summary of SRE best practices.

RED or Rate Errors Duration

The RED method is an easy to understand monitoring methodology. For every service, monitor:

Rate – how many operations per second is placed on a service
Errors – what percentage of the traffic return in error state
Duration – the time it take to serve a request or process a job

Rate, Errors, Duration are a good Service Level Indicators to start, because undesirable change in any of it directly impact your users.
Examples of Service Level Objectives for RED indicators:

  • Number of Requests are above 1000 operations per minute or Number of Requests don’t drop more than on 10% for 10 minutes period
  • Errors rate are below 0.5%
  • Request Latency for 99% of operations are less than 100ms

For a microservices platform, RED is a perfect metrics to start with. But how to collect and visualize them for a big number of services?

Enter Istio.

Deploying Kubernetes, Istio and demo Apps

Tools for a demo

  • minikube – local Kubernetes cluster to test our setup (optional if you have a demo cluster)
  • skaffold – an application for building and deploying demo microservices
  • istioctl – istio command line tool
  • Online Boutique – a cloud-native microservices demo application

Kubernetes and Istio

For our demo we would need following cluster specs with default Istio setup.

minikube start --cpus=4 --memory 4096 --disk-size 32g
curl -L https://istio.io/downloadIstio | sh -
cd istio-1.6.4
bin/istioctl install --set profile=demo
kubectl label namespace default istio-injection=enabled

Demo applications

For test applications I am using Online Boutique which is cloud native microservices demo from Google Cloud Platform team.

git clone https://github.com/GoogleCloudPlatform/microservices-demo.git
cd microservices-demo
skaffold run # this will build and deploy applications. take about 20 minutes

Verify that applications are installed:

kubectl get pods
NAME                                     READY   STATUS    RESTARTS   AGE
adservice-85cb97f6c8-fkbf2               2/2     Running   0          96s
cartservice-6f955b89f4-tb6vg             2/2     Running   2          96s
checkoutservice-5856dcfdd5-s7phn         2/2     Running   0          96s
currencyservice-9c888cdbc-4lxhs          2/2     Running   0          96s
emailservice-6bb8bbc6f7-m5ldg            2/2     Running   0          96s
frontend-68646cffc4-n4jqj                2/2     Running   0          96s
loadgenerator-5f86f94b89-xc7hf           2/2     Running   3          95s
paymentservice-56ddc9454b-lrpsm          2/2     Running   0          95s
productcatalogservice-5dd6f89b89-bmnr4   2/2     Running   0          95s
recommendationservice-868bc84d65-cgj2j   2/2     Running   0          95s
redis-cart-b55b4cf66-t29mk               2/2     Running   0          95s
shippingservice-cd4c57b99-r8bl7          2/2     Running   0          95s

Notice 2/2 Ready containers. One of the container is Istio sidecar proxy which will do all the job for collecting RED metrics.

RED metrics Visualization

Setup a proxy to Grafana dashboard

kubectl -n istio-system port-forward $(kubectl -n istio-system get pod -l app=grafana -o jsonpath='{.items[0].metadata.name}') 3000:3000 &

Open Grafana Istio Service Mesh dashboard for a global look at RED metrics

Detailed board for any service with RED metrics looks like:

Conclusions

Istio provide Rate, Errors and Duration metrics out of the box which is a big leap toward SLO and SLI practice for all services.

Incorporating Istio in your platform is a big toward observability, security and control of your service mesh.

Brief Summary of Site Reliability Engineering best practices

The goal of SRE is to accelerate product development teams and keep services running in reliable and continuous way.

This article is a collection practical notes on explaining what is SRE, what kind of work SREs does and what type of processes they develop. The practices are based on Google SRE workbook.

This is a long article and If you will make it to the end I applaud to you!

But, don’t stop here, go and read both Google SRE books which mentioned in References. Then learn about Prometheus and ELK stacks – they are open source tools that help to implement SRE practices.
That should keep your busy for at least a year. I wish you best of luck!

SRE practices include

  • SLOs and SLIs
  • Monitoring
  • Alerting
  • Toil reduction
  • Simplicity

SLO and SLI

SLO is Service Level Objective is a goal that service provider wants to reach.

Practicality of SLO: SLOs are tools to help determine what engineering work to prioritize. SLOs define the concept of error budget.

Talking about error budget, lets see next how error budget have to be approached.

Error budget approach

  • There are SLOs which all stakeholder in org. approved
  • It is possible to meet SLOs needs under normal conditions
  • Org. is committed to using error budget for decision making and prioritizing

This are essentials steps to have error budget approach in place, your work is to go as close as possible to fulfill them.

What to measure using SLIs

SLIs is the ration between two numbers: the good and the total:

  • Number of successful HTTP request / total HTTP requests
  • Number of consumed jobs in a queue / total number of jobs in a queue

SLI divided on specification and implementation. For example:

  • Specification: ration of requests loaded in < 100 ms
  • Implementation based on: a) server logs b) Javascript on client

SLO and SLI in practice

The strategy to implement SLO, SLI in your company is to start small. Consider following aspects when working on your first SLO.

  • Choose one application for which you want to define SLOs
  • Decide on few keys SLIs specs that matter to your service and users
  • Consider common ways and tasks through which your users interact with service
  • Draw a high-level architecture diagram of your system
  • Show key components. The requests flow. The data flow

The result is narrow and focused prove of concept that would help to make benefits of SLO, SLI concise and clear.

Type of SLIs

SLI is Service Level Indicator is a measurement the service provider uses for the SLO goal.

There are several types of measurement you might choose from depend on type of your service:

  • Availability – The proportion of request which result in successful state
  • Latency – The proportion of request below some time threshold
  • Freshness – The proportion of data transferred to some time threshold. Replication or Data pipeline
  • Correctness – The proportion of input that produce correct output
  • Durability – The proportion of records written that can be successfully read

Summary of first actions toward SLOs and SLIs

  • Setup white box monitoring: prometheus, datadog, newrelic
  • Develop key SLOs and SLOs response process like incident management
  • Error budget enforcement decisions. Written error budget policy. Priority to reliability when error budget is spent.
  • Continuous improvement of SLOs target. Monthly SLOs review.
  • Count outages and measure user happiness
  • Create a training program to train developers on SLOs and other reliability concepts
  • Create SLOs dashboard

This is essential starting points to implement SLOs in your company. It will bring more confidence and better decision making in your services.

Monitoring

How SRE define monitoring

  • Alert on condition that require attention
  • Investigate and diagnose issues
  • Display information about the system visually
  • Gain insight into system health and resource usage for long-term planning
  • Compare the behavior of the system before and after a change, or between two control groups

Features of monitoring you have to know and tune

  • Speed. Freshness of data.
  • Data retention and calculations
  • Interfaces: graphs, tables, charts. High level or low level.
  • Alerts: multiple categories, notifications flow, suppress functionality.

Source of monitoring

  • Metrics
  • Logs

That is high level overview. Details depend on your tools and platform. For the ones who are only starting I recommend to look for open source projects such as Prometheus, Grafana and ElasticSearch(ELK) monitoring stack.

Alerting

Alerting considerations

  • Precision – The proportion of events detected that were significant
  • Recall – The proportion of significant events detected
  • Detection time – How long it takes to send notification in various conditions
  • Reset time – How long alerts fire after an issue is resolved

Ways to alerts

There are several strategies on how alerts could be setup. Recommendation is to combine several strategies to enhance your alerts quality from different directions.

First and simple one:

  • Target error rate ≥ SLO threshold.
    • Example: For 10 minutes window error rate exceeds the SLO
    • Upsides: Fast recall time
    • Downsides: Precision is low
  • Increased alert windows.
    • Example: if an event consumes 5% of the 30-day error budget – a 36-hour window.
    • Upsides: good detection time
    • Downside: poor reset time
  • Increment alert duration. For how long alert should be triggered to be significant.
    • Upsides: Alerts can be higher precision.
    • Downside: poor recall and poor detection time
  • Alert on burn rate. How fast, relative to SLO, the service consume error budget.
    • Example: 5% error budget over 1 hour period.
    • Upside: Good precision, short time window, good detection time.
    • Downside: low recall, long reset time
  • Multiple burn rate alerts. Depend on burn rate determine severity of alert which lead to page notification or a ticket
    • Upsides: good recall, good precision
    • Downsides: More parameters to manage, long reset time
  • Multi window, multi burn alerts.
    • Upsides: Flexible alert framework, good precision, good recall
    • Downside: even more harder to manage, lots of parameters to specify

The same as for Monitoring an open source tool to be alert on metrics is combination of Prometheus and Alertmanager. As for logs look for Kibana project which has Alerting on log patterns.

Toil reduction

Definition: toils is repetitive, predictable, constant stream of tasks related to maintaining a service

What is toil

  • Manual. When the tmp directory on a web server reach 95% utilization, you need to login and find a space to clean up
  • Repetitive. A full tmp director is unlikely to be a one time event
  • Automatable. If the instructions are well defined then it’s better to automate the problem detection and remediation
  • Reactive. When you receive too many alerts of “disks full”, they distract more than help. So, potentially high-severity alerts could be missed
  • Lacks enduring value. The satisfaction of completed tasks is short term, because it is to prevent the issue in the future
  • Grow at least as fast as it’s source. Growing popularity of the service will require more infrastructure and more toil work

Potential benefits of toil automation

  • Engineering work might reduce toil in the future
  • Increased team morale
  • Less context switching for interrupts, which raises team productivity
  • Increased process clarity and standardization
  • Enhanced technical skills
  • Reduced training time
  • Fewer outages attributable to human errors
  • Improved security
  • Shorter response times for user requests

How to measure toil

  • Identify it.
  • Measure the amount of human effort applied to this toil
  • Track these measurements before, during and after toil reduction efforts

Toil categorization

  • Business processes. Most common source of toil.
  • Production interrupts. The key tasks to keep system running.
  • Product releases. Depending on the tooling and release size they could generate toil.(release requests, rollbacks, hot fixes and repetitive manual configuration changes)
  • Migrations. Large scale migration or even small database structure change likely done manually as one time effort. Such thinking is a mistake, because this work is repetitive.
  • Cost engineering and capacity planning. Ensure a cost-effective baseline. Prepare for critical high traffic events.
  • Troubleshooting

Toil management strategies in practices

Basics:

  • Identify and measure
  • Engineer toil out of the system
  • Reject the toil
  • Use SLO to reduce toil

Organizational:

  • Start with human-backed interfaces. For complex business problems start with partially automated approach.
  • Get support from management and colleagues. Toil reduction is worthwhile goal.
  • Promote toil reduction as a feature. Create strong business case for toil reduction.
  • Start small and then improve

Standardization and automation:

  • Increase uniformity. Lean to standard tools, equipment and processes.
  • Access risk within automation. Automation with admin-level privileges should have safety mechanism which checks automation actions against the system. It will prevent outages caused by bugs in automation tools.
  • Automate toil response. Think how to approach toil automation. It shouldn’t eliminate human understanding of what’s going on.
  • Use open source and third-party tools.

In general:

  • Use feedback to improve. Seek for feedback from users who interact with your tools, workflows and automation.

Simplicity

Measure complexity

  • Training time. How long it take for newcomer engineer to get on full speed.
  • Explanation time. The time it takes to provide a view on system internals.
  • Administrative diversity. How many ways are there to configure similar settings
  • Diversity of deployed configuration
  • Age. How old is the system

SRE work on simplicity

  • SRE understand the systems end to prevent and fix source of complexity
  • SRE should be involved in design, system architecture, configuration, deployment processes, or elsewhere.
  • SRE leadership empower SRE teams to push for simplicity, and to explicitly reward these efforts.

Conclusions

  • SRE practices require significant amount of time and skilled SRE people to implement right
  • A lot of tools are involved in day to day SRE work
  • SRE processes is one of a key to success of tech company

References

Simple LRU cache implementation on Python 3

What is LRU Cache?

This is caching item replacement policy, so the least used items will be discarded first.

Problem

The LRU Cache algorithm requires keeping track of what was used when, which is expensive if one wants to make sure the algorithm always discards the least recently used item.

Solution

Approaching a problem I was thinking of two capabilities for data structures: a FIFO queue and Hash table.

FIFO queue will be responsible to evict least used items. Hash table will be responsible to get cached items. Using this data structures make both operations in O(1) time complexity.

Python collections.OrderedDict combine both of this capabilities:

  • Queue: dict items are ordered as FIFO queue, so inserts and evictions are done in O(1)
  • Hash table: dict keys provide access to data in O(1) time
import collections
LRUCache = collections.OrderedDict()

Insert operation:

LRUCache[key] = value

It will add a key to the end of the dict, so position of new items is always fixed.

Check for hit:

if key in LRUCache:
  LRUCache.move_to_end(key)

Here we are doing two things: 1) Checking if key exist 2)If key is exist we move they key to the end of dict, so the keys which got a hit always update it’s position to become like newest key.

Discard or evict operation:

if len(LRUCache) > CACHE_SIZE:
  evict_key, evict_val = LRUCache.popitem(last=False)

Evict operation pop item from the beginning of the dict, so removing the oldest key in the dict.

Python 3 implementation

"""Simple LRUCache implementation."""

import collections
import os
import random
from math import factorial

CACHE_SIZE = 100
if os.getenv('CACHE_SIZE'):
    CACHE_SIZE = int(os.getenv('CACHE_SIZE'))

SAMPLE_SIZE = 100
if os.getenv('SAMPLE_SIZE'):
    SAMPLE_SIZE = int(os.getenv('SAMPLE_SIZE'))

LRUCache = collections.OrderedDict()


def expensive_call(number):
    """Calculate factorial. Example of expensive call."""
    return factorial(number)


if __name__ == '__main__':

    test_cases = random.choices(
        [x for x in range(SAMPLE_SIZE*3)],
        [x for x in range(SAMPLE_SIZE*3)],
        k=SAMPLE_SIZE
    )

    for test in test_cases:
        if test in LRUCache:
            print("hit:", test, LRUCache[test])
            # Update position of the hit item to first. Optional.
            LRUCache.move_to_end(test, last=True)
        else:
            LRUCache[test] = expensive_call(test)
            print("miss:", test, LRUCache[test])
        if len(LRUCache) > CACHE_SIZE:
            evict_key, evict_val = LRUCache.popitem(last=False)
            print("evict:", evict_key, evict_val)

As a use case I have used LRU cache to cache the output of expensive function call like factorial.

Sample size and Cache size are controllable through environment variables. Try to run it on small numbers to see how it behave:

CACHE_SIZE=4 SAMPLE_SIZE=10 python lru.py

Next steps are

  • Encapsulate business logic into class
  • Add Python magic functions to provide Pythonic way of dealing with class objects
  • Add unit test

References

Filebeats configuration for Kubernetes

filebeat.autodiscover:
  providers:
  - type: kubernetes
    node: ${NODE_NAME}
    hints.enabled: true
    hints.default_config:
      type: container
      paths:
        - /var/log/containers/*${data.kubernetes.container.id}.log

What’s so cool about above configuration

Filebeat Autodiscover

When you run applications on containers, they become moving targets to the monitoring system. Autodiscover allows you to track them and adapt settings as changes happen.

The Kubernetes autodiscover provider watches for Kubernetes nodes, pods, services to start, update, and stop.
As well it recognise a lot of additional labels and statuses related to Kubernetes objects.

Hints based autodiscover

Filebeat supports autodiscover based on hints from the provider. The hints system looks for hints in Kubernetes Pod annotations or Docker labels that have the prefix co.elastic.logs. As soon as the container starts, Filebeat will check if it contains any hints and launch the proper config for it. Hints tell Filebeat how to get logs for the given container.

Type Container

Use the container input to read containers log files.

This input searches for container logs under the given path, and parse them into common message lines, extracting timestamps too. Everything happens before line filtering, multiline, and JSON decoding, so this input can be used in combination with those settings.

Conclusions

Kubernetes logs autodiscovery and JSON decoding provide very good visibility into log stream. Labels and JSON log fields are properly named and parsed. Using ES and Kibana we can search through logs with easy queries and filter by fields.

References

Integrating Flask with Jaeger tracing on Kuberentes

Distributed applications and microservices required high level of observability. In this article we will integrate a Flask micro framework with Jaeger tracing tool. All code will be deployed to Kubernetes minikube cluster.

Flask

Let’s build a simple task manager service using Flask framework.

Code

tasks.py

from flask import Flask, jsonify
app = Flask(__name__)

@app.route('/')

tasks = {"tasks":[
        {"name":"task 1", "uri":"/task1"},
        {"name":"task 2", "uri":"/task2"}
    ]}

def  root():
	"Service root"
	return  jsonify({"url":"/tasks")
                     
@app.route('/tasks')
def  tasks():
	"Tasks list"
	return  jsonify(tasks)

if __name__ == '__main__':
  "Start up"
  app.run(debug=True, host='0.0.0.0',port=5000)
Continue reading Integrating Flask with Jaeger tracing on Kuberentes

Backup and restore of Etcd cluster

Kubernetes disaster recovery plan is usually consist of backing up etcd cluster and having infrastructure as a code to provision new set of servers in the cloud. Let’s see how to do first – backup etcd in two basic and easy ways.

Etcd backup

The only stateful component of Kubernetes cluster is etcd server. The etcd server is where Kuberenetes store all API objects and configuration.
Backing up this storage is sufficient for complete recovery of Kubernetes cluster state.

Backup with etcdctl

etcdctl is command line tool to manage etcd server and it’s date.
command to make a backup is:

Making a backup

ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshot.db

command to restore snapshot is:

ETCDCTL_API=3 etcdctl snapshot restore snapshot.db

Note: For https endpoints you might need to specify paths to certificate keys in order to access etcd server api with etcdctl.

Store backup at remote storage

It’s important to backup data on remote storage like s3. It’s guarantee that a copy of etcd data will be available even if control plane volume is unaccessible or corrupted.

  • Make an s3 bucket.
  • Copy snapshot.db to s3 with new filename
  • Setup s3 object expiration to clean up old backup files
# new s3 bucket for etcd backups
aws s3 mb etcd-backup
# define a backup filename based on current date and time
filename=`date +%F-%H-%M`.db
aws s3 cp ./snapshot.db s3://etcd-backup/etcd-data/$filename
# set backup life cycle configuration for backup files rotation
aws s3api put-bucket-lifecycle-configuration --bucket my-bucket --life
cycle-configuration  file://lifecycle.json

Example of lifecycle.json which transition backups to s3 Glacier:

{
              "Rules": [
                  {
                      "ID": "Move rotated backups to Glacier",
                      "Prefix": "etcd-data/",
                      "Status": "Enabled",
                      "Transitions": [
                          {
                              "Date": "2015-11-10T00:00:00.000Z",
                              "StorageClass": "GLACIER"
                          }
                      ]
                  },
                  {
                      "Status": "Enabled",
                      "Prefix": "",
                      "NoncurrentVersionTransitions": [
                          {
                              "NoncurrentDays": 2,
                              "StorageClass": "GLACIER"
                          }
                      ],
                      "ID": "Move old versions to Glacier"
                  }
              ]
          }

Simplify etcd backup with Velero

Velero is powerful Kubernetes backup tool. It simplify many operation tasks.
As a result using Velero it’s easier to:

  • Choose what to backup(objects, volumes or everything)
  • Choose what NOT to backup(e.g. secrets)
  • Schedule cluster backups
  • Store backups on remote storage
  • Fast disaster recovery process

Install and configure Velero

1)Download latest version at Velero github page

2)Create AWS credential file:

[default]
aws_access_key_id=<your AWS access key ID>
aws_secret_access_key=<your AWS secret access key>

3)Create s3 bucket for etcd-backups

aws s3 mb s3://kubernetes-velero-backup-bucket

4)Install velero to kubernetes cluster:

velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.0.0 --bucket kubernetes-velero-backup-bucket --secret-file ./aws-iam-creds --backup-location-config region=us-east-1 --snapshot-location-config region=us-east-1

Note: we use s3 plugin to access remote storage. Velero support many different storage providers. See which works for you best.

Schedule automated backups

1)Schedule daily backups:

velero schedule create <SCHEDULE NAME> --schedule "0 7 * * *"

2)Create a backup manually:

velero backup create <BACKUP NAME>

Disaster Recovery with Velero

Note: You might need to re-install Velero in case of full etcd data loss.

When Velero is up disaster recovery process are simple and straightforward:

1)Update your backup storage location to read-only mode

kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
    --namespace velero \
    --type merge \
    --patch '{"spec":{"accessMode":"ReadOnly"}}'

By default, <STORAGE LOCATION NAME> is expected to be named default, however the name can be changed by specifying --default-backup-storage-location on velero server.

2)Create a restore with your most recent Velero Backup:

velero restore create --from-backup <SCHEDULE NAME>-<TIMESTAMP>

3)When ready, revert your backup storage location to read-write mode:

kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
   --namespace velero \
   --type merge \
   --patch '{"spec":{"accessMode":"ReadWrite"}}'

Conclusions

  • Kubernetes cluster with infrequent change to API server is great choice for single control plane setup.
  • Frequent backups of etcd cluster will minimize time window of potential data loss.

Having fun with Kubernetes deployment

Install deployment

Install nginx 1.12.2 with 2 pods

If you need to have 2 pods from the start then it could be done in three easy steps:

  • Create deployment template with nginx version 1.12.2
  • Edit nginx.yaml to update replicas count.
  • Apply deployment template to Kubernetes cluster
# step 1
kubectl create  deployment nginx --save-config=true --image=nginx:1.12.2 --dry-run=client -o yaml > nginx.yaml
# step 2
edit nginx.yaml
# step 3
kubectl apply --record=true -f nginx.yaml

Notice use of --record=true to save the state of what caused the deployment change

Auto-scaling deployment

Deployments can be scaled manually or automatically. Let’s see how it could be done in few simple commands.

Scaling manually up to 4 pods

kubectl scale deployment nginx --replicas=4 --record=true

Scaling manually down to 2 pods

kubectl scale deployment nginx --replicas=2 --record=true

Automatically scale up and down

Automatically scale up to 4 pods and down to 2 pods based on cpu usage

kubectl autoscale deployment nginx --min=2 --max 4

You can adjust when to scale up/down using --cpu-percent(e.g. --cpu-percent=80) flag

Continue reading Having fun with Kubernetes deployment