🚀 Supercharge Your Python Code Reviews: Automate with GPT and OpenAPI Endpoints

A common approach to getting code reviews with ChatGPT is by posting a prompt along with a code snippet. While effective, this process can be time-consuming and repetitive. If you want to streamline, customize, or fully automate your code review process, meet gptscript.

Gptscript is a powerful automation tool designed to build tools of any complexity on top of GPT models.

At the core of gptscript is a script defining a series of steps to execute, along with available tools such as Git operations, file reading, web scraping, and more.

In this guide, we’ll demonstrate how to perform a Python code review using a simple script (python-code-review.gpt).

tools: sys.read

You are an expert Python developer, your task is to review a set of pull requests.
You are given a list of filenames and their partial contents, but note that you might not have the full context of the code.

Only review lines of code which have been changed (added or removed) in the pull request. The code looks similar to the output of a git diff command. Lines which have been removed are prefixed with a minus (-) and lines which have been added are prefixed with a plus (+). Other lines are added to provide context but should be ignored in the review.

Begin your review by evaluating the changed code using a risk score similar to a LOGAF score but measured from 1 to 5, where 1 is the lowest risk to the code base if the code is merged and 5 is the highest risk which would likely break something or be unsafe.

In your feedback, focus on highlighting potential bugs, improving readability if it is a problem, making code cleaner, and maximising the performance of the programming language. Flag any API keys or secrets present in the code in plain text immediately as highest risk. Rate the changes based on SOLID principles if applicable.

Do not comment on breaking functions down into smaller, more manageable functions unless it is a huge problem. Also be aware that there will be libraries and techniques used which you are not familiar with, so do not comment on those unless you are confident that there is a problem.

Use markdown formatting for the feedback details. Also do not include the filename or risk level in the feedback details.

Ensure the feedback details are brief, concise, accurate. If there are multiple similar issues, only comment on the most critical.

Include brief example code snippets in the feedback details for your suggested changes when you're confident your suggestions are improvements. Use the same programming language as the file under review.
If there are multiple improvements you suggest in the feedback details, use an ordered list to indicate the priority of the changes.

Format the response in a valid Markdown format as a list of feedbacks, where the value is an object containing the filename ("fileName"), risk score ("riskScore") and the feedback ("details"). The schema of the Markdown feedback object must be:

## File: filename
Risk: riskScore

Details: details


The content for review is provided as input file.

Testing time

You will need gptscript installed.
The prompt from above as python-code-review.gpt file.
File to review. I am using the following Python code (code.py):

colors = {
"apple": "red",
"banana": "yellow",
"cherry": "red",
"mango": "red",
"lemon": "yellow",
"plum": "purple"
}

common = {}
for k, v in colors.items():
if v in common:
common[v] += 1
else:
common[v] = 1

common = sorted(common.items(), key=lambda x:x[1], reverse=True)
print(common[0][0])

Run gptscript with promptfile and code as first two inputs:

gptscript python-code-review.gpt code.py

Output:

## File: code.py
Risk: 2

Details:
1. The sorting of the `common` dictionary could be optimized by using the `collections.Counter` class, which is specifically designed for counting hashable objects. This would make the code more readable and efficient.

```python
from collections import Counter

common = Counter(colors.values())
most_common_color = common.most_common(1)[0][0]
print(most_common_color)
```

2. Consider using more descriptive variable names to improve readability, such as `color_counts` instead of `common`.

The results is in Markdown syntax which is easy to read for a human.

But, if you want to add automation I would prefer to change output toJSON format or a format of your choice which suits you tools.

Let’s refactor the promt to request JSON output:

Format the response in a valid JSON format as a list of feedbacks, where the value is an object containing the filename ("fileName"),  risk score ("riskScore") and the feedback ("details"). 
The schema of the JSON feedback object must be:
{
  "fileName": {
    "type": "string"
  },
  "riskScore": {
    "type": "number"
  },
  "details": {
    "type": "string"
  }
}


The content for review is provided as input file.

Re-run the script and you will get something like this:

[
{
"fileName": "code.py",
"riskScore": 2,
"details": "1. Consider using a `defaultdict` from the `collections` module to simplify the counting logic. This will make the code cleaner and more efficient.\n\nExample:\n```python\nfrom collections import defaultdict\n\ncommon = defaultdict(int)\nfor v in colors.values():\n common[v] += 1\n```\n\n2. The sorting and accessing the first element can be improved for readability by using `max` with a key function.\n\nExample:\n```python\nmost_common_color = max(common.items(), key=lambda x: x[1])[0]\nprint(most_common_color)\n```"
}
]

Remember, LLMs output are not determined, you can get different result for the same request.

Now let’s review what actually happens when you run gptscript with the prompt.

What It Does in a Nutshell

  1. Extracts code for review: Captures content from the input file(code.py).
  2. Sets context for the LLM: Instructs the LLM to act as an expert Python developer tasked with providing a detailed and sophisticated code review.
  3. Defines Structured Output: Returns results in two fields:
    Risk: A score indicating the potential risk associated with the changes.
    Details: A comprehensive explanation of the changes and their implications.

Conclusion

Gptscript is a powerful tool to kickstart your journey into automation using OpenAPI models. By defining custom scripts, you can streamline complex workflows, such as automated code reviews, with minimal effort.

This example just scratches the surface—there’s much more you can achieve with gptscript. Explore additional examples from the gptscript project to discover more possibilities and enhance your automation capabilities.

Happy automating!

Tech Team Leaders’ Guide to Strategy

Building a tech strategy is a core responsibility of the CTO, VP of Engineering, or Head of Engineering. Involving team leaders in this process ensures a more grounded and effective approach.

Tech team leaders play a crucial role by defining roadmaps for their teams, which, in turn, provide the foundation for an effective high-level strategy. To achieve the best results, continuous collaboration between leadership and team leaders is essential.

Let’s explore how to create a roadmap that is both practical and aligned with the company’s overall vision.

Building an Effective Team Roadmap

A team roadmap is a strategic document that outlines product needs, infrastructure requirements, modernization efforts, and compliance and security considerations, among other critical aspects.

An effective roadmap goes beyond listing high-level initiatives or goals. It expands on each goal using the Diagnosis, Policy, and Actions framework, helping to answer the Why, What, and How of every initiative. This approach fosters trust, alignment, and transparency with top-level leadership.

The Diagnosis, Policy, and Actions framework, developed by Richard Rumelt, consists of:

  • Diagnosis – Defining the problem that needs to be addressed
  • Policy – Establishing guiding principles and constraints for the solution
  • Actions – Defining concrete steps to implement the solution within the given policy

Let’s explore a few examples.

Example 1: Modernizing the Infrastructure

Diagnosis: Our current infrastructure relies on outdated and proprietary components, leading to scalability challenges, high maintenance costs, and slow adoption of new technologies.

Policy: Prioritize open-source and cloud-native solutions for new developments. Maintain legacy systems where necessary but avoid further expansion of proprietary technologies.

Actions:

  1. Identify and replace critical proprietary components with open-source or cloud-native alternatives.
  2. Standardize infrastructure automation and provisioning to improve scalability and maintainability.
  3. Update internal documentation and on-boarding materials to reflect new infrastructure standards.

Example 2: Upgrade the Database

Diagnosis: The current database version has reached end-of-life and is no longer receiving security updates or feature enhancements. An upgrade is necessary to maintain security, stability, and performance.

Policy: The database upgrade must be performed with zero downtime to avoid service disruptions.

Actions:

  1. Test new database version in the QA environment to ensure compatibility
  2. Create a full backup of the existing database.
  3. Implement a Blue-Green deployment strategy to minimize risk during the upgrade.
  4. Communicate the upgrade plan and schedule a rollout window.

Example 3: Improve Cloud Cost Efficiency

Diagnosis: Cloud expenses represent a significant portion of overall costs. Unused or underutilized resources contribute to unnecessary costs.

Policy: Optimize cloud usage by right-sizing instances, using auto-scaling, and enforcing cost-control policies.

Actions:

  1. Conduct an audit of cloud resources to identify inefficiencies.
  2. Implement auto-scaling policies for workloads with variable demand.
  3. Use reserved or spot instances for predictable workloads.
  4. Set up monitoring and alerts for unexpected cost spikes.

Conclusion

  1. Structuring your team’s roadmap using the Diagnosis, Policy, and Actions framework ensures clear prioritization and alignment with the company’s overall strategy.
  2. This approach facilitates productive discussions with top-level leadership, leading to better decision-making.
  3. It improves transparency, trust and accountability across all levels of the organization.

Have you faced challenges when implementing a strategic roadmap? How did you overcome them? Drop a comment below and let’s learn from each other!

Collecting Datadog APM traces using Grafana Alloy and Tempo

The Problem

Hybrid cloud is the future, but monitoring remains stuck in the past. Many organizations embrace hybrid infrastructure, yet struggle with fragmented observability tools. Why? Because monitoring providers still operate in silos.

One of the primary reasons hybrid monitoring isn’t as prevalent is the challenge of instrumentation. Many cloud providers offer their own monitoring solutions. Instrumentation libraries are often incompatible with one another, making cross-platform integration difficult.

The good news? With OpenTelemetry, Grafana, and Datadog, hybrid monitoring is becoming easier and more flexible. 🚀

The Solution

One promising development is the rise of open-source, vendor-neutral instrumentation frameworks like OpenTelemetry.

In essence, open standards are reducing incompatibility issues and allowing “a vendor-agnostic approach to get data from the sources you need to the observability service of your choice.”

A step toward solving the hybrid cloud monitoring challenge came when Grafana introduced the otelcol.receiver.datadog component.

Now, with otelcol.receiver.datadog, Grafana users can ingest and process Datadog telemetry directly within OpenTelemetry pipelines, unlocking several advantages:

  1. Expanding Grafana’s Reach to Datadog Customers
  2. Seamless Integration with OpenTelemetry Pipelines
  3. Avoiding Vendor Lock-in While Retaining Datadog’s Strengths
  4. Cost Optimization by Centralizing Hybrid Monitoring

How it works together?

Requirements

  1. Grafana – for web UI
  2. Grafana Alloy – to receive, process and export telemetry data
  3. Grafana Tempo – to collect and visualize traces from Datadog instrumented apps
  4. Datadog agent – with enabled APM feature
  5. Instrumented application with Datadog trace library

Quick check list:

  1. you have running: Grafana, Alloy and Tempo services
  2. You have running Datadog agents
  3. You have instrumented applications with Datadog trace library
  4. You add Tempo as Datasource in Grafana

Grafana Alloy configuration

The core of the solution is to set up Datadog receiver in Alloy config:

receivers:
datadog:
endpoint: "0.0.0.0:9126"
output:
traces: [otelcol.exporter.otlp.tempo.input]

# I am skipping extra steps which you might want to use to pre-proces data

otelcol:
exporter:
otlp:
tempo:
client:
endpoint: "https://tempo-distributor.example.com:443"
tls:
insecure: false
insecure_skip_verify: true

To avoid ports conflict with Datadog agent we choose 9126 as a port for Alloy Datadog receiver.

Running Alloy as Daemonset with hostNetwork access allow agent to be present on each Node.

alloy:
extraPorts:
- name: "datadog"
port: 9126
targetPort: 9126
protocol: "TCP"

controller:
hostNetwork: true
hostPID: true

service:
internalTrafficPolicy: "Local"

Application set up

Application has to be configured to send APM traces to Node IP:1926.

The Node IP can be extracted form Kubernetes meta information and passed as environment variable:

env:
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP

Datadog Agent configuration

With Datadog Agent we have two options:

  1. Send traces to Alloy Datadog receiver as addition to Datadog host
  2. Send traces only to Alloy Datadog receiver

With option 1 we assume you are using both Datadog and Grafana solutions in hybrid mode. In that case Datadog agent has to have following configuration:

agents:
containers:
agent:
env:
- name: DD_ADDITIONAL_ENDPOINTS
value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
- name: DD_APM_ADDITIONAL_ENDPOINTS
value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
traceAgent:
enabled: true
env:
- name: DD_ADDITIONAL_ENDPOINTS
value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
- name: DD_APM_ADDITIONAL_ENDPOINTS
value: '{"http://alloy-service:9126": ["datadog-receiver"]}'

For option 2 where traces must go only to Alloy Datadog receiver following configuration might work:

datadog:
apm:
socketEnabled: true # Use the Unix Domain Socket (default). Can be true even if port is enabled.
portEnabled: true # Enable TCP port 8126 for traces.
useLocalService: false

env:
- name: DD_APM_DD_URL
value: "http://alloy-service:9126" # URL for OTel collector’s Datadog trace receiver
- name: DD_APM_NON_LOCAL_TRAFFIC
value: "true"

After all done and agents restarted you can navigate to Explore page in Grafana, select Tempo as datasource and get recent Datadog traces.

Conclusions

  1. Kudos to Grafana and Datadog – Their collaboration on the otelcol.receiver.datadog makes transitioning between monitoring platforms smoother than ever, reducing friction for hybrid observability.
  2. Hybrid Monitoring is the New Normal – Applications no longer rely on a single monitoring provider throughout their lifespan. As infrastructure evolves, businesses will inevitably switch or integrate multiple observability tools.
  3. Stay Agile with Open Standards – Using OpenTelemetry ensures flexibility, allowing teams to adapt their monitoring stack without vendor lock-in, keeping observability seamless across hybrid and multi-cloud environments.

By embracing open standards, organizations can future-proof their monitoring strategies while ensuring complete visibility across their hybrid infrastructure. 🚀

Cost saving strategy for Kubernetes platform

Working on cost saving strategy involves looking at the problem from very different dimensions.
Overall Kubernetes costs can be split on compute costs, networking costs, storage costs, licensing cost and SaaS costs.
In this part I will cover: Right-size infrastructure & use autoscaling

Right-size infrastructure & use autoscaling

Kubernetes Nodes autoscaler

In a cloud environment Kubernetes Nodes autoscaler plays important role in delivering just enough resources for your cluster. In a nutshell:

  • it adds new nodes according to demand – up scale
  • it consolidates under utilized nodes – down scale

It directly depends on Pod resource requests. Choosing the right requests is the key component for cost-effective strategy.
For self hosted solutions you might want to look into project like Karpenter.
Alternative scaling strategy is to add or remove nodes by schedule for a specific time plan. Using that approach you can simple specify the time of the day when you need to upscale and the time when you downscale.

Continue reading Cost saving strategy for Kubernetes platform

Multi provider DNS management with Terraform and Pulumi

The Problem

Every DNS provider is very specific how they create DNS records. Using Terraform or Pulumi don’t guarantee multi provider support out of the box.

One example where AWS Route53 support values for multiple IP binding to the same name record. Where Cloudflare must have a dedicated record for each IP.

Theses API difference make it harder to write code which will work for multiple providers.

For AWS Route53 a single record can be created like this:

mydomain.local: IP1, IP2, IP3

For Cloudflare it would be 3 different records:

mydomain.local: IP1
mydomain.local: IP2
mydomain.local: IP3

The Solution 1: Use flexibility of programming language available with Pulumi

Pulumi has a first hand here since you can use the power of programming language to handle custom logic.

DNS data structure:

mydomain1.com: 
 - IP1
 - IP2 
 - IP3
mydomain2.com:
 - IP4
 - IP5 
 - IP6
mydomain3.com: 
 - IP7
 - IP8 
 - IP9

Using Python or Javascript we can expand this structure for Cloudflare provider or keep as is for AWS Route53.

In Cloudflare case we will create new record for each new IP

import pulumi
import pulumi_cloudflare as cloudflare
import yaml

# Load the configuration from a YAML file
yaml_file = "dns_records.yaml"
with open(yaml_file, "r") as file:
    dns_config = yaml.safe_load(file)

# Cloudflare Zone ID (Replace with your actual Cloudflare Zone ID)
zone_id = "your_cloudflare_zone_id"

# Iterate through domains and their associated IPs to create A records
for domain, ips in dns_config.items():
    if isinstance(ips, list):  # Ensure it's a list of IPs
        for ip in ips:
            record_name = domain
            cloudflare.Record(
                f"{record_name}-{ip.replace('.', '-')}",
                zone_id=zone_id,
                name=record_name,
                type="A",
                value=ip,
                ttl=3600,  # Set TTL (adjust as needed)
            )

# Export the created records
pulumi.export("dns_records", dns_config)

and since AWS Route53 support IPs list, so the code would look like:

for domain, ips in dns_config.items():
    if isinstance(ips, list) and ips:  # Ensure it's a list of IPs and not empty
        aws.route53.Record(
            f"{domain}-record",
            zone_id=hosted_zone_id,
            name=domain,
            type="A",
            ttl=300,  # Set TTL (adjust as needed)
            records=ips,  # AWS Route 53 supports multiple IPs in a single record
        )

Solution 2 – Using Terraform for each loop

It’s quite possible to achieve the same using Terraform starting with version 0.12 which introduce dynamic block.

Same data structure:

mydomain1.com: 
  - 192.168.1.1
  - 192.168.1.2
  - 192.168.1.3
mydomain2.com:
  - 10.0.0.1
  - 10.0.0.2
  - 10.0.0.3
mydomain3.com: 
  - 172.16.0.1
  - 172.16.0.2
  - 172.16.0.3

Terraform example for AWS Route53

provider "aws" {
  region = "us-east-1"  # Change this to your preferred region
}

variable "hosted_zone_id" {
  type = string
}

variable "dns_records" {
  type = map(list(string))
}

resource "aws_route53_record" "dns_records" {
  for_each = var.dns_records

  zone_id = var.hosted_zone_id
  name    = each.key
  type    = "A"
  ttl     = 300
  records = each.value
}

Quite simple using for_each loop, but will not work with Cloudflare, because of the mentioned compatibility issue. So, we need new record for each IP.

Terraform example for Cloudflare

# Create multiple records for each domain, one per IP
resource "cloudflare_record" "dns_records" {
  for_each = { for k, v in var.dns_records : k => flatten([for ip in v : { domain = k, ip = ip }]) }

  zone_id = var.cloudflare_zone_id
  name    = each.value.domain
  type    = "A"
  value   = each.value.ip
  ttl     = 3600
  proxied = false  # Set to true if using Cloudflare proxy
}

Conclusions

  1. Pulumi: Flexible and easy to start. Data is separate from code, making it easy to add providers or change logic.
  2. Terraform: Less complex and easier to support long-term but depends on data format
  3. Both solutions require programming skills or expertise in Terraform language.

Grafana a new look

As much as l like managed monitoring solutions it’s hard not to be excited about setting up your own platform based on Prometheus Grafana stack and save some bucks for the company.

After 5 years of using proprietary solutions I am again on the way to get into Grafana world.

Here I want to wrap up what I learnt so far about Grafana.

Big ecosystem of tools

The biggest change for me is ecosystem of projects which supports almost all imaginable monitoring needs.

  • For Logs: Loki and Promtail
  • For Traces: Tempo
  • For Profiling: Peryscope
  • For open telemetry collecting: Alloy
  • For eBPF: Beyla
  • For Synthetics: Prometheus BlackBox and K6 API
  • and many more

Automated Deployment process

A lot was done in the deployment process where using Helm charts you can get it up and running in a minutes.

All that comes with scalability and high availability in mind.

Though you would need to connect and customize each component, it’s still can be just a Helm configuration change.

Huge ecosystem of Components to simplify instrumentation

It’s clear your code has to be instrumented to be observable. Grafana and their Open Source community developed a number of tools to make it easy and simple:

  1. Compatible receivers for big providers like Datadog
  2. New discovery methods like eBPF
  3. No code way of instrumentation like sidecar container

And a lot more see the Alloy components section to get started.

New fresh UI

Grafana Logs and Metrics get new look.

You can see multiple logs and metrics on the same page.

Summary

It’s cool, it’s fresh and it’s a lot of fun to use!

Can ArgoCD replace Helm managers?

Argocd gitops flux
Argocd gitops flux cycle

ArgoCD is a tool which provides GitOps way of deployment. It supports various format of applications including Helm charts.

Helm is one of the most popular packaging format for Kubernetes applications. It give a rise to various tools to manage helm chart like helmfiles, helmify and others.

Why ArgoCD is replacing Helm managers?

For a long time Helm managers help to deploy Helm charts via different deploy strategies. It serves well, but the big disadvantage was a code drift which accumulate over time.

To avoid code drift one solution was to schedule a period job which will apply latest changes to Helm charts. Sometimes it works, but most often if one of the charts get install issue all other charts stop being updated as well.
So, it wasn’t ideal way to handle it.

In that situation ArgoCD comes to a rescue. ArgoCD allow to control the update of each chart as a dedicated process where issue in one chart will not block updates on other charts. Helm charts in ArgoCD is a first class citizen and support most of it features.

How it looks like in ArgoCD?

To try it out lets start with a simple Application which will install Prometheus helm chart with a custom configuration.

Prometheus helm chart can be found at https://prometheus-community.github.io/helm-charts

Custom configuration will be stored in our internal project https://git.example.com/org/value-files.git

Thanks to multi source support in Application we can use official chart and custom configuration together like in the following example:

apiVersion: argoproj.io/v1alpha1
kind: Application
spec:
  sources:
  - repoURL: 'https://prometheus-community.github.io/helm-charts'
    chart: prometheus
    targetRevision: 15.7.1
    helm:
      valueFiles:
      - $values/charts/prometheus/values.yaml
  - repoURL: 'https://git.example.com/org/value-files.git'
    targetRevision: dev
    ref: values

Where `$values is a reference to root folder of values-files repository. That values file override default prometheus settings, so we can configure it for our needs.

What’s next?

There are several distinct feature which Helm managers support, so your next step is to check if ArgoCD cover all your needs.

One of the notable feature is secrets support. Many managers support SOPS format of secrets where ArgoCD don’t give you any solution, so it’s up to you how to manage secrets.

Other important feature is order of execution which can be important part of your Helm manager setup. ArgoCD don’t have built-in replacement, so you have to rely on Helm format to support dependencies. One way is to build umbrella charts for the complex applications.

Most important change is the GitOps style of deployment. ArgoCD doesn’t run CI/CD pipeline, so you don’t have a feature like helm diff to preview changes before applying. It will apply them as soon as they become available in the linked git repository.

Conclusions

  1. ArgoCD can replace Helm managers, but it strongly depend on your project needs.
  2. ArgoCD introduce new challenges like secrets managers and order of execution.
  3. It introduce GitOps deployment style and replace usual CI/CD pipeline, so new “quality gates” needs to be build to be ready for production environment.

Path to Staff Engineer role

Staff Engineer

Someday in your career you take your time to think what’s next for me? What is my next challenge to grow in a career ladder? If your answer is Staff Engineer, Tech lead, Team lead or Principal Engineer then you are on the way to Staff-plus Engineer role.

As a Staff-plus Engineer myself I want to share my few tips for you.

Start working on Staff engineer package

  1. What’s your Staff level project?
    • What did you do and what was impact?
    • What behavior did you demonstrate and how complex projects were?
  2. Link your Design Documents, RFCs and Proposals to support your package with your design and architecture contribution.
  3. Can you quantify the impact of your projects?
    • Did it helps to increase revenue?
    • Save costs?
  4. What glue work did you do to organization?
    • What’s the impact of the glue work?
  5. Who have you mentored and through what accomplishments?

Sharpen your soft skills

  1. Communication.
    • Keep people informed and connected.
  2. Negotiation
    • Be ready to resolve difficult situations
  3. Presentation
    • Know your audience. Learn one or two presentation frameworks.
  4. Networking
    • Don’t be shame to get in touch with other Staff-plus engineers in your company.
    • They are the best people to help you navigate the role inside the company.

Learn Staff Engineer tools

  1. Leadership
    • Become a problem solver. Be a visionary in your area. Stay up to date with technologies in your area.
  2. Planning and Goal orientation
    • Have a vision of what you do next in one or another area.
    • Take active participate on Planning meetings.
  3. Collaboration and contribution
    • Propose new Decision docs, Proposals and RFCs. Make sure they are reviewed and discussed.
    • Implement POC to demo the idea out.
  4. Team work and mentorship
    • Help your peers. Be a problem solver. Be visible. Become a go to person.
    • Try on a mentorship role.

Summary

Getting Staff engineer role could take months or years to accomplish. However, it’s crucial to view this milestone not as the ultimate destination, but rather as a guiding roadmap for your career development. It’s not the title itself that holds the most significance, but rather the daily challenge we face and the positive impact we make as we progress in our journey.

Resources

  1. About glue work.
  2. Staff Engineer book

Certified Kubernetes Administrator

The CKA exam was a real challenge and great experience over all.

My first impression now I am officially part of Kubernetes world and CNCF.

But, I just love Kubernetes and it’s eco-system of open source projects and I had a lot of fun during my preparation and CKA exam. When you combine work with fun you can achieve really great results.

My advise for all future CKA participants is to have 80% of real Kubernetes practice and other 20% you can learn along the exam year.

From a practical aspect I have used Minikube to tackle with Deployments and other Kuberentes API related tasks. As for cluster administration experience I mostly used Vagrant boxes and Google Cloud VMs to get multi nodes setup and network capabilities.

My last advice is to build your Kubernetes knowledge like a layered system. Start with foundation and build it block by block. Finish it with production ready Kubernetes cluster and your exam will become an easy walk in the park.

Have fun!

Implementing RED method with Istio

In this article I describe how to quickly get started with SLO and SLI practice using Istio. if you are new SLO and SLI read a Brief Summary of SRE best practices.

RED or Rate Errors Duration

The RED method is an easy to understand monitoring methodology. For every service, monitor:

Rate – how many operations per second is placed on a service
Errors – what percentage of the traffic return in error state
Duration – the time it take to serve a request or process a job

Rate, Errors, Duration are a good Service Level Indicators to start, because undesirable change in any of it directly impact your users.
Examples of Service Level Objectives for RED indicators:

  • Number of Requests are above 1000 operations per minute or Number of Requests don’t drop more than on 10% for 10 minutes period
  • Errors rate are below 0.5%
  • Request Latency for 99% of operations are less than 100ms

For a microservices platform, RED is a perfect metrics to start with. But how to collect and visualize them for a big number of services?

Enter Istio.

Deploying Kubernetes, Istio and demo Apps

Tools for a demo

  • minikube – local Kubernetes cluster to test our setup (optional if you have a demo cluster)
  • skaffold – an application for building and deploying demo microservices
  • istioctl – istio command line tool
  • Online Boutique – a cloud-native microservices demo application

Kubernetes and Istio

For our demo we would need following cluster specs with default Istio setup.

minikube start --cpus=4 --memory 4096 --disk-size 32g
curl -L https://istio.io/downloadIstio | sh -
cd istio-1.6.4
bin/istioctl install --set profile=demo
kubectl label namespace default istio-injection=enabled

Demo applications

For test applications I am using Online Boutique which is cloud native microservices demo from Google Cloud Platform team.

git clone https://github.com/GoogleCloudPlatform/microservices-demo.git
cd microservices-demo
skaffold run # this will build and deploy applications. take about 20 minutes

Verify that applications are installed:

kubectl get pods
NAME                                     READY   STATUS    RESTARTS   AGE
adservice-85cb97f6c8-fkbf2               2/2     Running   0          96s
cartservice-6f955b89f4-tb6vg             2/2     Running   2          96s
checkoutservice-5856dcfdd5-s7phn         2/2     Running   0          96s
currencyservice-9c888cdbc-4lxhs          2/2     Running   0          96s
emailservice-6bb8bbc6f7-m5ldg            2/2     Running   0          96s
frontend-68646cffc4-n4jqj                2/2     Running   0          96s
loadgenerator-5f86f94b89-xc7hf           2/2     Running   3          95s
paymentservice-56ddc9454b-lrpsm          2/2     Running   0          95s
productcatalogservice-5dd6f89b89-bmnr4   2/2     Running   0          95s
recommendationservice-868bc84d65-cgj2j   2/2     Running   0          95s
redis-cart-b55b4cf66-t29mk               2/2     Running   0          95s
shippingservice-cd4c57b99-r8bl7          2/2     Running   0          95s

Notice 2/2 Ready containers. One of the container is Istio sidecar proxy which will do all the job for collecting RED metrics.

RED metrics Visualization

Setup a proxy to Grafana dashboard

kubectl -n istio-system port-forward $(kubectl -n istio-system get pod -l app=grafana -o jsonpath='{.items[0].metadata.name}') 3000:3000 &

Open Grafana Istio Service Mesh dashboard for a global look at RED metrics

Detailed board for any service with RED metrics looks like:

Conclusions

Istio provide Rate, Errors and Duration metrics out of the box which is a big leap toward SLO and SLI practice for all services.

Incorporating Istio in your platform is a big toward observability, security and control of your service mesh.