Site Reliability Engineer Blog

The Chaos Monkey Controversy: Why Overuse is Killing Developer Productivity

In the world of Site Reliability Engineering (SRE), few tools have sparked as much debate as Netflix’s Chaos Monkey. What started as an innovative approach to building resilient systems has become a lightning rod for controversy, dividing the engineering community into passionate camps. After years of watching teams struggle with its implementation, I’m convinced that the tool’s widespread misuse has created more problems than it solved.

The Promise vs. Reality

Netflix introduced Chaos Monkey in 2014 with a simple premise: randomly terminate instances in production to force teams to build more resilient systems. The idea was sound – if your system can’t handle the unexpected failure of a single component, it’s not truly robust. But somewhere between Netflix’s careful, strategic implementation and the broader industry adoption, something went wrong.

The evidence is everywhere. Reddit discussions in r/sre are filled with horror stories of cascading failures triggered by poorly timed Chaos Monkey deployments. HackerNews threads document teams spending more time recovering from simulated failures than actually improving their systems. As one frustrated developer put it: “Chaos Monkey causes chaos, it does not fix it.”

The Case Against Overuse

The critics’ arguments are compelling and backed by real-world evidence:

Reduced Developer Productivity: Constant, unpredictable deployments disrupt development workflows. Teams report spending excessive time rolling back changes and restoring services instead of building new features. When your developers are in perpetual recovery mode, innovation suffers.

False Sense of Security: Perhaps more dangerously, frequent disruptions create complacency. Teams become so accustomed to recovering from simulated failures that they lose focus on preventative measures. The perception of resilience becomes skewed – surviving artificial chaos doesn’t guarantee handling real-world incidents.

Operational Overhead Explosion: Every Chaos Monkey deployment requires investigation, analysis, and often rollback procedures. This consumes significant resources that could be better spent on proactive improvements and strategic technical debt reduction.

Misinterpretation of Failure Data: Controlled, short-lived interruptions bear little resemblance to sustained, complex outages. The data collected during Chaos Monkey tests cannot reliably inform long-term architectural decisions because the test environment fundamentally differs from genuine failure scenarios.

The Netflix Model: What Actually Works

Netflix’s original implementation wasn’t the blanket deployment strategy many teams have adopted. Their approach was surgical, strategic, and tied to specific maturity milestones. They deployed Chaos Monkey on reduced scales, focused on services that were already stable, and closely monitored the results.

The key difference? Netflix treated Chaos Monkey as a precision instrument, not a blunt force tool. Their engineers understood that the goal wasn’t to cause disruption – it was to observe system behavior under stress and identify specific weaknesses.

The Middle Ground: Disciplined Chaos

This doesn’t mean Chaos Monkey is inherently flawed. When used correctly, it can provide valuable insights. The critical factors for success include:

Service Maturity Requirements: Deploy only on stable, well-understood services. Using Chaos Monkey on immature systems is like stress-testing a house of cards – you’ll learn it falls down, but not much else.

Controlled Frequency: Deployments should be carefully scheduled and spaced, not constant background noise. Teams need time to implement improvements between tests.

Clear Success Metrics: Define specific learning objectives before each deployment. What exactly are you trying to discover or validate?

Comprehensive Monitoring: Ensure you can observe and analyze the complete impact of each test, not just surface-level metrics.

My Take: Strategy Over Chaos

After witnessing both spectacular failures and genuine successes with Chaos Monkey, I believe the tool’s value lies in its strategic application, not its frequency of use. The teams that succeed with chaos engineering treat it like any other engineering discipline – with careful planning, clear objectives, and disciplined execution.

The controversy surrounding Chaos Monkey ultimately reflects a broader issue in our industry: the tendency to adopt powerful tools without fully understanding their appropriate context and limitations. Netflix built Chaos Monkey for their specific needs, scale, and organizational maturity. The problems arise when teams apply it blindly without considering their own unique circumstances.

Looking Forward

The evolution toward tools like Chaos Gorilla and more sophisticated chaos engineering frameworks shows the industry is learning. We’re moving away from random disruption toward targeted, hypothesis-driven resilience testing. This represents a maturation of the chaos engineering discipline – from “break things and see what happens” to “systematically validate our assumptions about system behavior.”

The lesson isn’t to abandon chaos engineering, but to approach it with the same rigor we apply to any other engineering practice. Used strategically, tools like Chaos Monkey can strengthen systems. Used carelessly, they strengthen nothing but frustration levels.

The choice is ours: embrace disciplined chaos or remain victims of chaotic discipline.

Kubernetes operators quick start

Kubernetes operator provide easy and simple way to extend Kubernetes cluster with additional functionality.

For example you might find that you need custom operator in areas like:

Managing stateful applications: take any Database, Message Queue or Cache as example, it needs full lifecycle of operations to be good for production. For a long time running databases in Kubernetes was an anti-pattern, but with operators things started to changed.
Automating Application Lifecycle Management: the complexity of making new application can be overwhelmed for your development teams, so why not to introduce something more simple with just few core fields which are shared between all applications.
Enhancing Observability: operators like Prometheus operator or Alermanager hides complexity of managing these tools and provide simple interface to get started quick.
Automated Backup, Restore, and Disaster Recovery: For stateful applications, operators can automate the process of taking regular backups, storing them securely, and orchestrating the restoration process in case of data loss or system failure.
CI/CD and GitOps Automation: operators play crucial role in GitOps workflow. Tools like the Pulumi Kubernetes Operator enable managing infrastructure alongside Kubernetes workloads.
Networking and Security Management: Operators for service meshes like Istio or Linkerd simplify their installation, configuration, and upgrades across the cluster.
Building Platform-as-a-Service (PaaS)-like Capabilities: By abstracting away underlying Kubernetes resources, operators can provide a simplified, application-centric interface for developers, similar to what a PaaS offers.

Writing your own operator

If you think to get started to write your own operator it’s good to know what any operator consist of:

API Design

Custom Resource Definition (CRD) Design. The CRD is the API your users (and other tools) will interact with. A well-designed CRD is intuitive, clear, and extensible. A poorly designed one can be confusing and hard to evolve.

// AppoperatorSpec defines the desired state of Appoperator.
type AppoperatorSpec struct {
	// Image specify the container image for the deployment
	Image string `json:"image,omitempty"`

	// Replicas specify the container replica count for the deployment
	Replicas *int32 `json:"replicas,omitempty"`
}

// AppoperatorStatus defines the observed state of Appoperator.
type AppoperatorStatus struct {
	// Conditions represent the latest available observations of the Appoperator's state.
	Conditions []metav1.Condition `json:"conditions,omitempty"`
}

Keep the spec (desired state) as simple as possible for the user, abstracting away unnecessary complexity.
Implement robust OpenAPI v3 schema validation in your CRD to catch errors early and guide users.
Design a comprehensive status subresource to provide clear feedback to users about the state of the managed resources and any errors.

Reconciliation Loop Logic

This is the heart of your operator. It’s the code that observes the current state of the system and the desired state (from the CR) and takes action to make them match.

log := logf.FromContext(ctx)

// Fetch the Appoperator instance
var appoperator toolsv1beta1.Appoperator
if err := r.Get(ctx, req.NamespacedName, &appoperator); err != nil {
    if apierrors.IsNotFound(err) {
	// The resource was deleted, nothing to do
	return ctrl.Result{}, nil
    }
    log.Error(err, "Failed to fetch Appoperator")
    return ctrl.Result{}, err
}
// Logic to create deployment
// Logic to compare current deployment with spec

The reconcile function must be idempotent. This means running it multiple times with the same inputs should produce the same outcome without unintended side effects.
Properly handle errors from API calls or other operations. Decide whether an error is terminal or if the reconciliation should be retried.
Ensure you correctly create, update, and delete managed resources (Deployments, Services, etc.). Avoid leaking resources.

State Management

Operators often manage stateful applications or complex workflows. Ensuring the operator correctly understands and transitions between states is vital.

deployment := &appsv1.Deployment{
	ObjectMeta: metav1.ObjectMeta{
		Name:      appoperator.Name,
		Namespace: appoperator.Namespace,
	},
	Spec: appsv1.DeploymentSpec{
		Replicas: appoperator.Spec.Replicas,
		Selector: &metav1.LabelSelector{
			MatchLabels: map[string]string{"app": appoperator.Name},
		},
		Template: corev1.PodTemplateSpec{
			ObjectMeta: metav1.ObjectMeta{
				Labels: map[string]string{"app": appoperator.Name},
			},
			Spec: corev1.PodSpec{
				Containers: []corev1.Container{
					{
						Name:  "app-container",
						Image: appoperator.Spec.Image,
					},
				},
			},
		},
	},
}
// Get existing deployment
// Compare with CR state and update if necessary

Accurately read the current state of all managed resources from the Kubernetes API server before making decisions.
Implement logic to detect differences between the CR spec and the actual state of the cluster.
Be mindful of potential race conditions

Owner References and Garbage Collection

Proper use of owner references is essential for ensuring that Kubernetes automatically cleans up resources created by your operator when the parent Custom Resource is deleted.

Ensure that all resources created and managed by the operator have an ownerReference pointing back to the instance of your Custom Resource that “owns” them.
If your operator needs to perform actions before its Custom Resource can be deleted (e.g., de-provisioning an external resource, taking a final backup), use finalizers. This prevents the CR from being deleted until the operator removes the finalizer.

Status Reporting

Users rely on the status of your Custom Resource to understand what the operator is doing and whether their application is healthy.

Use standard Kubernetes Conditions (e.g., Type: Ready, Status: True/False/Unknown) to provide a standardized way of reporting status.
Include informative messages and reasons in your status conditions.
Reflect the metadata.generation of the CR that was processed in the status.observedGeneration. This helps users and other tools understand if the status reflects the latest changes to the spec.

Versioning

Versioning is a critical aspect of developing and maintaining Kubernetes operators, as it impacts how users interact with your custom APIs and how your operator software evolves. It breaks down into two main areas: CRD (API) Versioning and Operator Software Versioning

A single CRD can define multiple API versions in its spec.versions list. In Kubernetes, all versions must be safely round-tripable through each other. This means that if we convert from version 1 to version 2, and then back to version 1, we must not lose information.

Operator software versioning refers to how you version the operator’s code and container image. Users need to know which version of the operator software supports which features and CRD versions.

Effective versioning requires a disciplined approach. For CRDs, follow Kubernetes API conventions, plan for conversions, and clearly communicate stability and deprecation. For operator software, use semantic versioning, clearly map operator versions to the CRD versions they support, and have a robust upgrade strategy, potentially leveraging tools like OLM. Both aspects are crucial for the long-term health and usability of your operator.

Create with Kubebuilder

Kubebuilder isn’t just providing boilerplate code and some level of automation, but it also a great learning resource to get start with operators. It covers:

The structure of Kubernetes APIs and Resources
API versioning semantics
Self-healing
Garbage Collection and Finalizers
Declarative vs Imperative APIs
Level-Based vs Edge-Base APIs
Resources vs Subresources
and more.

Generating manifests, running test, deployment and installation and other useful command available to make things easier from the start.

Summary

Kubernetes operators provide a powerful way to extend cluster functionality, primarily by automating the complex lifecycle management of applications (especially stateful ones like databases), enhancing observability, and enabling PaaS-like capabilities.

Developing a robust operator requires careful attention to its API (CRD) design, implementing idempotent reconciliation logic for state management, ensuring proper resource cleanup via owner references, clear status reporting, and disciplined versioning for both the CRD and the operator software.

Solving Top-K problem – Leaderboard in Python

What is Top-K problem?

Think about Real-time Gaming Leaderboard with a rank system. Determining the rank of a player is a problem we are trying to solve.
This is implementation of Real-time Leaderboard solution from the System Design interview book.
Check out full source code at Leaderboard – solving Top K Problem with Python Github project.

Leader board requirements

Get top 10 ranked players
Show a current rank for each player
Show a slice of players around specific rank

Solution 1: using SQL database

Direct approach is to use SQL kind of database and count rank in real time using count until Player score points >= all other players. This solution fall short as soon as we need to scale it for high number of players.

Solution 2: using Valkey/Redis sorted set

Valkey sorted set is a special engine type which keeps all records sorted according to a “score” which was made for exactly this kind of problem and it’s what we are going to use.

Implementation

Top-k-probelm(Leaderboard) project is built using FastAPI, SQLModel and Valkey.

API Design

API design was borrowed from the System Design Interview book. According to requirements it covers CRUD operations and getting rank operation.

1. Create Score
POST /scores
{
  "name": "AwesomeGamer"
}

2. Update Score
PUT /scores
{
  "id": 1,
  "points": 10
}

3. Get TOP Scores
GET /scores

4. Get user rank
GET /scores/id

Players data model and API: SQLModel + FastAPI

I am using SQLite table to store user profile and current score of a player. Several SQLModel classes helps to customize API endpoints.

class ScoreBase(SQLModel):
    name: str | None = Field(default=None)

class Score(ScoreBase, table=True):
    id: int | None = Field(default=None, primary_key=True)
    points: int = Field(default=0)
    rank: int = Field(default=0)
    
class ScoreCreate(ScoreBase):
    pass

class ScorePublic(ScoreBase):
    id: int
    points: int
    name: str
    rank: int

And APIs using FastAPI parameters and response model

@app.post("/scores", response_model=ScorePublic)
def create_score(data: ScoreCreate, session: SessionDep):

@app.put("/scores", response_model=ScorePublic)
def update_scores(id: int, points: int, session: SessionDep):

@app.get("/scores", response_model=list[ScorePublic])
def get_top_scores(session: SessionDep, 
                   offset: int = 0,
                   limit: Annotated[int, Query(le=10)] = 10):

@app.get("/scores/{id}", response_model=ScorePublic)
def get_score(id: int, session: SessionDep):

Both libraries provides very flexible and powerful way to API design.

Rank solution based on Valkey sorted set

From Valkey documentation

A Sorted Set is a collection of unique strings (members) ordered by an associated score. When more than one string has the same score, the strings are ordered lexicographically. Some use cases for sorted sets include:

Leaderboards. For example, you can use sorted sets to easily maintain ordered lists of the highest scores in a massive online game.

Rate limiters. In particular, you can use a sorted set to build a sliding-window rate limiter to prevent excessive API requests.

ZREVRANGE helps to get top 10 players sorted in reverse order by their score for our leader board.

ZREVRANGE leaderboard 0, 10

ZRANK helps to get specific player rank.

ZRANK leaderboard id

Finally, using ZRANK following by ZREVRANGE we can also provide a list of players who is near by to a specific player inside the rank.

ZRANGE leaderboard 77
# rank 255
ZREVRANGE leaderboard 255, 250
# 5 players before
ZREVRANGE leaderboard 260, 255
# 5 players after

Conclusions

FastAPI and SQLModel provides great interfaces to design APIs with nice domain model integration
Valkey soreted set is a built-in solution for Top-k problem
It’s not Python native, but if you are looking for more leet code solution on Python then check out this awesome video

LLM-Powered Unit Testing: Effortlessly Generate Domain Model Data

Traditional approaches to generating model data for unit tests often involve setting up complex fixtures or writing cumbersome mock classes. This becomes especially challenging when your domain model is tightly coupled with your data mode.

class Order(Base):
    # ...
    product_name = Column(String, index=True)
    quantity = Column(Integer)

In more advanced scenarios, your Domain model is decoupled from the Data model, allowing for cleaner architecture and improved maintainability.

class Order(BaseModel):
    product_name: str
    quantity: int

def testOrder(self):
    order = Order(product_name="Milk", quantity=2)
    self.assertEqual(order.quantity, 2)

Using hard-coded data can be a sub-optimal approach. Every modification to the Domain model requires updates to the test suite.

LLM-generated data comes to the rescue. By leveraging models like ChatGPT-4, you can dynamically generate test data that adheres to your Domain model’s schema.

In the following example, we’ll use ChatGPT-4 to generate data for a Electronics Domain model class, ensuring the structure and content align with the model’s expected schema.

# Imports are skipped to demo the core of the solution

class Electronics(BaseModel):
    """ Domain model for Electronics """
    model: str
    price: float = Field(gt=0, description="Price must be greater than 0")

@llm.call(provider="openai", model="gpt-4o", response_model=List[Electronics])
def list_electronics(limit: int) -> List[Electronics]:
    """ Generate N items of products defined as Electronics """
    return f"Generate {limit} number of electronic products"

And the unittests:

class TestElectronics(unittest.TestCase):
    """ Test class using LLM to generate domain data """

    @classmethod
    def setUpClass(cls):
        """ Generating test data once """
        cls._products = list_electronics(5)

    def test_unique_models(self):
        """ Test product models are unique """
        models = [product.model for product in self._products]
        self.assertEqual(len(models), len(set(models)))

Running the LLM can take a few seconds, so to optimize performance, I move the data generation process to the setUpClass method.

Big kudos to mirascope, llm, smartfunc and other libraries to make LLMs much more accessible to every day applications!

Feel free to try that yourself, all source code available in the Github project.

If you are interested in a command line LLM integration check out How to automate review with gptscript.

Conclusions:

LLMs become much more accessible in day to day programming.
Unittests data generation is just an example, but it demonstrates how easy it is to embed LLMs in real life scenarios
It is still expensive and time consuming to call LLM for each unit tests invocation, so smarter approach is to generate and cache tests data

🚀 Supercharge Your Python Code Reviews: Automate with GPT and OpenAPI Endpoints

A common approach to getting code reviews with ChatGPT is by posting a prompt along with a code snippet. While effective, this process can be time-consuming and repetitive. If you want to streamline, customize, or fully automate your code review process, meet gptscript.

Gptscript is a powerful automation tool designed to build tools of any complexity on top of GPT models.

At the core of gptscript is a script defining a series of steps to execute, along with available tools such as Git operations, file reading, web scraping, and more.

In this guide, we’ll demonstrate how to perform a Python code review using a simple script (python-code-review.gpt).

tools: sys.read

You are an expert Python developer, your task is to review a set of pull requests.
You are given a list of filenames and their partial contents, but note that you might not have the full context of the code.

Only review lines of code which have been changed (added or removed) in the pull request. The code looks similar to the output of a git diff command. Lines which have been removed are prefixed with a minus (-) and lines which have been added are prefixed with a plus (+). Other lines are added to provide context but should be ignored in the review.

Begin your review by evaluating the changed code using a risk score similar to a LOGAF score but measured from 1 to 5, where 1 is the lowest risk to the code base if the code is merged and 5 is the highest risk which would likely break something or be unsafe.

In your feedback, focus on highlighting potential bugs, improving readability if it is a problem, making code cleaner, and maximising the performance of the programming language. Flag any API keys or secrets present in the code in plain text immediately as highest risk. Rate the changes based on SOLID principles if applicable.

Do not comment on breaking functions down into smaller, more manageable functions unless it is a huge problem. Also be aware that there will be libraries and techniques used which you are not familiar with, so do not comment on those unless you are confident that there is a problem.

Use markdown formatting for the feedback details. Also do not include the filename or risk level in the feedback details.

Ensure the feedback details are brief, concise, accurate. If there are multiple similar issues, only comment on the most critical.

Include brief example code snippets in the feedback details for your suggested changes when you're confident your suggestions are improvements. Use the same programming language as the file under review.
If there are multiple improvements you suggest in the feedback details, use an ordered list to indicate the priority of the changes.

Format the response in a valid Markdown format as a list of feedbacks, where the value is an object containing the filename ("fileName"), risk score ("riskScore") and the feedback ("details"). The schema of the Markdown feedback object must be:

## File: filename
Risk: riskScore

Details: details

The content for review is provided as input file.

Testing time

You will need gptscript installed.
The prompt from above as python-code-review.gpt file.
File to review. I am using the following Python code (code.py):

colors = {
    "apple": "red",
    "banana": "yellow",
    "cherry": "red",
    "mango": "red",
    "lemon": "yellow",
    "plum": "purple"
}

common = {}
for k, v in colors.items():
    if v in common:
        common[v] += 1
    else:
        common[v] = 1

common = sorted(common.items(), key=lambda x:x[1], reverse=True)
print(common[0][0])

Run gptscript with promptfile and code as first two inputs:

gptscript python-code-review.gpt code.py

Output:

## File: code.py
Risk: 2

Details:
1. The sorting of the `common` dictionary could be optimized by using the `collections.Counter` class, which is specifically designed for counting hashable objects. This would make the code more readable and efficient.

   ```python
   from collections import Counter

   common = Counter(colors.values())
   most_common_color = common.most_common(1)[0][0]
   print(most_common_color)
   ```

2. Consider using more descriptive variable names to improve readability, such as `color_counts` instead of `common`.

The results is in Markdown syntax which is easy to read for a human.

But, if you want to add automation I would prefer to change output toJSON format or a format of your choice which suits you tools.

Let’s refactor the promt to request JSON output:

Format the response in a valid JSON format as a list of feedbacks, where the value is an object containing the filename ("fileName"),  risk score ("riskScore") and the feedback ("details"). 
The schema of the JSON feedback object must be:
{
  "fileName": {
    "type": "string"
  },
  "riskScore": {
    "type": "number"
  },
  "details": {
    "type": "string"
  }
}


The content for review is provided as input file.

Re-run the script and you will get something like this:

[
  {
    "fileName": "code.py",
    "riskScore": 2,
    "details": "1. Consider using a `defaultdict` from the `collections` module to simplify the counting logic. This will make the code cleaner and more efficient.\n\nExample:\n```python\nfrom collections import defaultdict\n\ncommon = defaultdict(int)\nfor v in colors.values():\n    common[v] += 1\n```\n\n2. The sorting and accessing the first element can be improved for readability by using `max` with a key function.\n\nExample:\n```python\nmost_common_color = max(common.items(), key=lambda x: x[1])[0]\nprint(most_common_color)\n```"
  }
]

Remember, LLMs output are not determined, you can get different result for the same request.

Now let’s review what actually happens when you run gptscript with the prompt.

What It Does in a Nutshell

Extracts code for review: Captures content from the input file(code.py).
Sets context for the LLM: Instructs the LLM to act as an expert Python developer tasked with providing a detailed and sophisticated code review.
Defines Structured Output: Returns results in two fields:
Risk: A score indicating the potential risk associated with the changes.
Details: A comprehensive explanation of the changes and their implications.

Conclusion

Gptscript is a powerful tool to kickstart your journey into automation using OpenAPI models. By defining custom scripts, you can streamline complex workflows, such as automated code reviews, with minimal effort.

This example just scratches the surface—there’s much more you can achieve with gptscript. Explore additional examples from the gptscript project to discover more possibilities and enhance your automation capabilities.

Happy automating!

Tech Team Leaders’ Guide to Strategy

Building a tech strategy is a core responsibility of the CTO, VP of Engineering, or Head of Engineering. Involving team leaders in this process ensures a more grounded and effective approach.

Tech team leaders play a crucial role by defining roadmaps for their teams, which, in turn, provide the foundation for an effective high-level strategy. To achieve the best results, continuous collaboration between leadership and team leaders is essential.

Let’s explore how to create a roadmap that is both practical and aligned with the company’s overall vision.

Building an Effective Team Roadmap

A team roadmap is a strategic document that outlines product needs, infrastructure requirements, modernization efforts, and compliance and security considerations, among other critical aspects.

An effective roadmap goes beyond listing high-level initiatives or goals. It expands on each goal using the Diagnosis, Policy, and Actions framework, helping to answer the Why, What, and How of every initiative. This approach fosters trust, alignment, and transparency with top-level leadership.

The Diagnosis, Policy, and Actions framework, developed by Richard Rumelt, consists of:

Diagnosis – Defining the problem that needs to be addressed
Policy – Establishing guiding principles and constraints for the solution
Actions – Defining concrete steps to implement the solution within the given policy

Let’s explore a few examples.

Example 1: Modernizing the Infrastructure

Diagnosis: Our current infrastructure relies on outdated and proprietary components, leading to scalability challenges, high maintenance costs, and slow adoption of new technologies.

Policy: Prioritize open-source and cloud-native solutions for new developments. Maintain legacy systems where necessary but avoid further expansion of proprietary technologies.

Actions:

Identify and replace critical proprietary components with open-source or cloud-native alternatives.
Standardize infrastructure automation and provisioning to improve scalability and maintainability.
Update internal documentation and on-boarding materials to reflect new infrastructure standards.

Example 2: Upgrade the Database

Diagnosis: The current database version has reached end-of-life and is no longer receiving security updates or feature enhancements. An upgrade is necessary to maintain security, stability, and performance.

Policy: The database upgrade must be performed with zero downtime to avoid service disruptions.

Actions:

Test new database version in the QA environment to ensure compatibility
Create a full backup of the existing database.
Implement a Blue-Green deployment strategy to minimize risk during the upgrade.
Communicate the upgrade plan and schedule a rollout window.

Example 3: Improve Cloud Cost Efficiency

Diagnosis: Cloud expenses represent a significant portion of overall costs. Unused or underutilized resources contribute to unnecessary costs.

Policy: Optimize cloud usage by right-sizing instances, using auto-scaling, and enforcing cost-control policies.

Actions:

Conduct an audit of cloud resources to identify inefficiencies.
Implement auto-scaling policies for workloads with variable demand.
Use reserved or spot instances for predictable workloads.
Set up monitoring and alerts for unexpected cost spikes.

Conclusion

Structuring your team’s roadmap using the Diagnosis, Policy, and Actions framework ensures clear prioritization and alignment with the company’s overall strategy.
This approach facilitates productive discussions with top-level leadership, leading to better decision-making.
It improves transparency, trust and accountability across all levels of the organization.

Have you faced challenges when implementing a strategic roadmap? How did you overcome them? Drop a comment below and let’s learn from each other!

Collecting Datadog APM traces using Grafana Alloy and Tempo

The Problem

Hybrid cloud is the future, but monitoring remains stuck in the past. Many organizations embrace hybrid infrastructure, yet struggle with fragmented observability tools. Why? Because monitoring providers still operate in silos.

One of the primary reasons hybrid monitoring isn’t as prevalent is the challenge of instrumentation. Many cloud providers offer their own monitoring solutions. Instrumentation libraries are often incompatible with one another, making cross-platform integration difficult.

The good news? With OpenTelemetry, Grafana, and Datadog, hybrid monitoring is becoming easier and more flexible. 🚀

The Solution

One promising development is the rise of open-source, vendor-neutral instrumentation frameworks like OpenTelemetry.

In essence, open standards are reducing incompatibility issues and allowing “a vendor-agnostic approach to get data from the sources you need to the observability service of your choice.”

A step toward solving the hybrid cloud monitoring challenge came when Grafana introduced the otelcol.receiver.datadog component.

Now, with otelcol.receiver.datadog, Grafana users can ingest and process Datadog telemetry directly within OpenTelemetry pipelines, unlocking several advantages:

Expanding Grafana’s Reach to Datadog Customers
Seamless Integration with OpenTelemetry Pipelines
Avoiding Vendor Lock-in While Retaining Datadog’s Strengths
Cost Optimization by Centralizing Hybrid Monitoring

How it works together?

Requirements

Grafana – for web UI
Grafana Alloy – to receive, process and export telemetry data
Grafana Tempo – to collect and visualize traces from Datadog instrumented apps
Datadog agent – with enabled APM feature
Instrumented application with Datadog trace library

Quick check list:

you have running: Grafana, Alloy and Tempo services
You have running Datadog agents
You have instrumented applications with Datadog trace library
You add Tempo as Datasource in Grafana

Grafana Alloy configuration

The core of the solution is to set up Datadog receiver in Alloy config:

receivers:
  datadog:
    endpoint: "0.0.0.0:9126"
    output:
      traces: [otelcol.exporter.otlp.tempo.input]

# I am skipping extra steps which you might want to use to pre-proces data

otelcol:
  exporter:
    otlp:
      tempo:
        client:
          endpoint: "https://tempo-distributor.example.com:443"
          tls:
            insecure: false
            insecure_skip_verify: true

To avoid ports conflict with Datadog agent we choose 9126 as a port for Alloy Datadog receiver.

Running Alloy as Daemonset with hostNetwork access allow agent to be present on each Node.

alloy:
  extraPorts:
    - name: "datadog"
      port: 9126
      targetPort: 9126
      protocol: "TCP"

controller:
  hostNetwork: true
  hostPID: true

service:
  internalTrafficPolicy: "Local"

Application set up

Application has to be configured to send APM traces to Node IP:1926.

The Node IP can be extracted form Kubernetes meta information and passed as environment variable:

env:
- name: NODE_IP
  valueFrom:
    fieldRef:
      fieldPath: status.hostIP

Datadog Agent configuration

With Datadog Agent we have two options:

Send traces to Alloy Datadog receiver as addition to Datadog host
Send traces only to Alloy Datadog receiver

With option 1 we assume you are using both Datadog and Grafana solutions in hybrid mode. In that case Datadog agent has to have following configuration:

agents:
  containers:
    agent:
      env:
        - name: DD_ADDITIONAL_ENDPOINTS
          value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
        - name: DD_APM_ADDITIONAL_ENDPOINTS
          value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
    traceAgent:
      enabled: true
      env:
        - name: DD_ADDITIONAL_ENDPOINTS
          value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
        - name: DD_APM_ADDITIONAL_ENDPOINTS
          value: '{"http://alloy-service:9126": ["datadog-receiver"]}'

For option 2 where traces must go only to Alloy Datadog receiver following configuration might work:

datadog:
  apm:
    socketEnabled: true                    # Use the Unix Domain Socket (default). Can be true even if port is enabled.
    portEnabled: true                      # Enable TCP port 8126 for traces.
    useLocalService: false 

  env:
    - name: DD_APM_DD_URL
      value: "http://alloy-service:9126"  # URL for OTel collector’s Datadog trace receiver
    - name: DD_APM_NON_LOCAL_TRAFFIC
      value: "true"

After all done and agents restarted you can navigate to Explore page in Grafana, select Tempo as datasource and get recent Datadog traces.

Conclusions

Kudos to Grafana and Datadog – Their collaboration on the otelcol.receiver.datadog makes transitioning between monitoring platforms smoother than ever, reducing friction for hybrid observability.
Hybrid Monitoring is the New Normal – Applications no longer rely on a single monitoring provider throughout their lifespan. As infrastructure evolves, businesses will inevitably switch or integrate multiple observability tools.
Stay Agile with Open Standards – Using OpenTelemetry ensures flexibility, allowing teams to adapt their monitoring stack without vendor lock-in, keeping observability seamless across hybrid and multi-cloud environments.

By embracing open standards, organizations can future-proof their monitoring strategies while ensuring complete visibility across their hybrid infrastructure. 🚀

Cost saving strategy for Kubernetes platform

Working on cost saving strategy involves looking at the problem from very different dimensions.
Overall Kubernetes costs can be split on compute costs, networking costs, storage costs, licensing cost and SaaS costs.
In this part I will cover: Right-size infrastructure & use autoscaling

Right-size infrastructure & use autoscaling

Kubernetes Nodes autoscaler

In a cloud environment Kubernetes Nodes autoscaler plays important role in delivering just enough resources for your cluster. In a nutshell:

it adds new nodes according to demand – up scale
it consolidates under utilized nodes – down scale

It directly depends on Pod resource requests. Choosing the right requests is the key component for cost-effective strategy.
For self hosted solutions you might want to look into project like Karpenter.
Alternative scaling strategy is to add or remove nodes by schedule for a specific time plan. Using that approach you can simple specify the time of the day when you need to upscale and the time when you downscale.

Multi provider DNS management with Terraform and Pulumi

The Problem

Every DNS provider is very specific how they create DNS records. Using Terraform or Pulumi don’t guarantee multi provider support out of the box.

One example where AWS Route53 support values for multiple IP binding to the same name record. Where Cloudflare must have a dedicated record for each IP.

Theses API difference make it harder to write code which will work for multiple providers.

For AWS Route53 a single record can be created like this:

mydomain.local: IP1, IP2, IP3

For Cloudflare it would be 3 different records:

mydomain.local: IP1
mydomain.local: IP2
mydomain.local: IP3

The Solution 1: Use flexibility of programming language available with Pulumi

Pulumi has a first hand here since you can use the power of programming language to handle custom logic.

DNS data structure:

mydomain1.com: 
 - IP1
 - IP2 
 - IP3
mydomain2.com:
 - IP4
 - IP5 
 - IP6
mydomain3.com: 
 - IP7
 - IP8 
 - IP9

Using Python or Javascript we can expand this structure for Cloudflare provider or keep as is for AWS Route53.

In Cloudflare case we will create new record for each new IP

import pulumi
import pulumi_cloudflare as cloudflare
import yaml

# Load the configuration from a YAML file
yaml_file = "dns_records.yaml"
with open(yaml_file, "r") as file:
    dns_config = yaml.safe_load(file)

# Cloudflare Zone ID (Replace with your actual Cloudflare Zone ID)
zone_id = "your_cloudflare_zone_id"

# Iterate through domains and their associated IPs to create A records
for domain, ips in dns_config.items():
    if isinstance(ips, list):  # Ensure it's a list of IPs
        for ip in ips:
            record_name = domain
            cloudflare.Record(
                f"{record_name}-{ip.replace('.', '-')}",
                zone_id=zone_id,
                name=record_name,
                type="A",
                value=ip,
                ttl=3600,  # Set TTL (adjust as needed)
            )

# Export the created records
pulumi.export("dns_records", dns_config)

and since AWS Route53 support IPs list, so the code would look like:

for domain, ips in dns_config.items():
    if isinstance(ips, list) and ips:  # Ensure it's a list of IPs and not empty
        aws.route53.Record(
            f"{domain}-record",
            zone_id=hosted_zone_id,
            name=domain,
            type="A",
            ttl=300,  # Set TTL (adjust as needed)
            records=ips,  # AWS Route 53 supports multiple IPs in a single record
        )

Solution 2 – Using Terraform for each loop

It’s quite possible to achieve the same using Terraform starting with version 0.12 which introduce dynamic block.

Same data structure:

mydomain1.com: 
  - 192.168.1.1
  - 192.168.1.2
  - 192.168.1.3
mydomain2.com:
  - 10.0.0.1
  - 10.0.0.2
  - 10.0.0.3
mydomain3.com: 
  - 172.16.0.1
  - 172.16.0.2
  - 172.16.0.3

Terraform example for AWS Route53

provider "aws" {
  region = "us-east-1"  # Change this to your preferred region
}

variable "hosted_zone_id" {
  type = string
}

variable "dns_records" {
  type = map(list(string))
}

resource "aws_route53_record" "dns_records" {
  for_each = var.dns_records

  zone_id = var.hosted_zone_id
  name    = each.key
  type    = "A"
  ttl     = 300
  records = each.value
}

Quite simple using for_each loop, but will not work with Cloudflare, because of the mentioned compatibility issue. So, we need new record for each IP.

Terraform example for Cloudflare

# Create multiple records for each domain, one per IP
resource "cloudflare_record" "dns_records" {
  for_each = { for k, v in var.dns_records : k => flatten([for ip in v : { domain = k, ip = ip }]) }

  zone_id = var.cloudflare_zone_id
  name    = each.value.domain
  type    = "A"
  value   = each.value.ip
  ttl     = 3600
  proxied = false  # Set to true if using Cloudflare proxy
}

Conclusions

Pulumi: Flexible and easy to start. Data is separate from code, making it easy to add providers or change logic.
Terraform: Less complex and easier to support long-term but depends on data format
Both solutions require programming skills or expertise in Terraform language.

Grafana a new look

As much as l like managed monitoring solutions it’s hard not to be excited about setting up your own platform based on Prometheus Grafana stack and save some bucks for the company.

After 5 years of using proprietary solutions I am again on the way to get into Grafana world.

Here I want to wrap up what I learnt so far about Grafana.

Big ecosystem of tools

The biggest change for me is ecosystem of projects which supports almost all imaginable monitoring needs.

For Logs: Loki and Promtail
For Traces: Tempo
For Profiling: Peryscope
For open telemetry collecting: Alloy
For eBPF: Beyla
For Synthetics: Prometheus BlackBox and K6 API
and many more

Automated Deployment process

A lot was done in the deployment process where using Helm charts you can get it up and running in a minutes.

All that comes with scalability and high availability in mind.

Though you would need to connect and customize each component, it’s still can be just a Helm configuration change.

Huge ecosystem of Components to simplify instrumentation

It’s clear your code has to be instrumented to be observable. Grafana and their Open Source community developed a number of tools to make it easy and simple:

Compatible receivers for big providers like Datadog
New discovery methods like eBPF
No code way of instrumentation like sidecar container

And a lot more see the Alloy components section to get started.

New fresh UI

Grafana Logs and Metrics get new look.

You can see multiple logs and metrics on the same page.

Summary

It’s cool, it’s fresh and it’s a lot of fun to use!