Solving Top-K problem – Leaderboard in Python

What is Top-K problem?

Think about Real-time Gaming Leaderboard with a rank system. Determining the rank of a player is a problem we are trying to solve.
This is implementation of Real-time Leaderboard solution from the System Design interview book.
Check out full source code at Leaderboard – solving Top K Problem with Python Github project.

Leader board requirements

  1. Get top 10 ranked players
  2. Show a current rank for each player
  3. Show a slice of players around specific rank

Solution 1: using SQL database

Direct approach is to use SQL kind of database and count rank in real time using count until Player score points >= all other players. This solution fall short as soon as we need to scale it for high number of players.

Solution 2: using Valkey/Redis sorted set

Valkey sorted set is a special engine type which keeps all records sorted according to a “score” which was made for exactly this kind of problem and it’s what we are going to use.

Implementation

Top-k-probelm(Leaderboard) project is built using FastAPI, SQLModel and Valkey.

API Design

API design was borrowed from the System Design Interview book. According to requirements it covers CRUD operations and getting rank operation.

1. Create Score
POST /scores
{
"name": "AwesomeGamer"
}

2. Update Score
PUT /scores
{
"id": 1,
"points": 10
}

3. Get TOP Scores
GET /scores

4. Get user rank
GET /scores/id

Players data model and API: SQLModel + FastAPI

I am using SQLite table to store user profile and current score of a player. Several SQLModel classes helps to customize API endpoints.

class ScoreBase(SQLModel):
name: str | None = Field(default=None)

class Score(ScoreBase, table=True):
id: int | None = Field(default=None, primary_key=True)
points: int = Field(default=0)
rank: int = Field(default=0)

class ScoreCreate(ScoreBase):
pass

class ScorePublic(ScoreBase):
id: int
points: int
name: str
rank: int

And APIs using FastAPI parameters and response model

@app.post("/scores", response_model=ScorePublic)
def create_score(data: ScoreCreate, session: SessionDep):

@app.put("/scores", response_model=ScorePublic)
def update_scores(id: int, points: int, session: SessionDep):

@app.get("/scores", response_model=list[ScorePublic])
def get_top_scores(session: SessionDep, 
                   offset: int = 0,
                   limit: Annotated[int, Query(le=10)] = 10):

@app.get("/scores/{id}", response_model=ScorePublic)
def get_score(id: int, session: SessionDep):

Both libraries provides very flexible and powerful way to API design.

Rank solution based on Valkey sorted set

From Valkey documentation

A Sorted Set is a collection of unique strings (members) ordered by an associated score. When more than one string has the same score, the strings are ordered lexicographically. Some use cases for sorted sets include:

  • Leaderboards. For example, you can use sorted sets to easily maintain ordered lists of the highest scores in a massive online game.
  • Rate limiters. In particular, you can use a sorted set to build a sliding-window rate limiter to prevent excessive API requests.

ZREVRANGE helps to get top 10 players sorted in reverse order by their score for our leader board.

ZREVRANGE leaderboard 0, 10

ZRANK helps to get specific player rank.

ZRANK leaderboard id

Finally, using ZRANK following by ZREVRANGE we can also provide a list of players who is near by to a specific player inside the rank.

ZRANGE leaderboard 77
# rank 255
ZREVRANGE leaderboard 255, 250
# 5 players before
ZREVRANGE leaderboard 260, 255
# 5 players after

Conclusions

  1. FastAPI and SQLModel provides great interfaces to design APIs with nice domain model integration
  2. Valkey soreted set is a built-in solution for Top-k problem
  3. It’s not Python native, but if you are looking for more leet code solution on Python then check out this awesome video

LLM-Powered Unit Testing: Effortlessly Generate Domain Model Data

Traditional approaches to generating model data for unit tests often involve setting up complex fixtures or writing cumbersome mock classes. This becomes especially challenging when your domain model is tightly coupled with your data mode.

class Order(Base):
# ...
product_name = Column(String, index=True)
quantity = Column(Integer)

In more advanced scenarios, your Domain model is decoupled from the Data model, allowing for cleaner architecture and improved maintainability.

class Order(BaseModel):
product_name: str
quantity: int

def testOrder(self):
order = Order(product_name="Milk", quantity=2)
self.assertEqual(order.quantity, 2)

Using hard-coded data can be a sub-optimal approach. Every modification to the Domain model requires updates to the test suite.

LLM-generated data comes to the rescue. By leveraging models like ChatGPT-4, you can dynamically generate test data that adheres to your Domain model’s schema.

In the following example, we’ll use ChatGPT-4 to generate data for a Electronics Domain model class, ensuring the structure and content align with the model’s expected schema.

# Imports are skipped to demo the core of the solution

class Electronics(BaseModel):
""" Domain model for Electronics """
model: str
price: float = Field(gt=0, description="Price must be greater than 0")

@llm.call(provider="openai", model="gpt-4o", response_model=List[Electronics])
def list_electronics(limit: int) -> List[Electronics]:
""" Generate N items of products defined as Electronics """
return f"Generate {limit} number of electronic products"

And the unittests:

class TestElectronics(unittest.TestCase):
""" Test class using LLM to generate domain data """

@classmethod
def setUpClass(cls):
""" Generating test data once """
cls._products = list_electronics(5)

def test_unique_models(self):
""" Test product models are unique """
models = [product.model for product in self._products]
self.assertEqual(len(models), len(set(models)))

Running the LLM can take a few seconds, so to optimize performance, I move the data generation process to the setUpClass method.

Big kudos to mirascope, llm, smartfunc and other libraries to make LLMs much more accessible to every day applications!

Feel free to try that yourself, all source code available in the Github project.

If you are interested in a command line LLM integration check out How to automate review with gptscript.

Conclusions:

  1. LLMs become much more accessible in day to day programming.
  2. Unittests data generation is just an example, but it demonstrates how easy it is to embed LLMs in real life scenarios
  3. It is still expensive and time consuming to call LLM for each unit tests invocation, so smarter approach is to generate and cache tests data

🚀 Supercharge Your Python Code Reviews: Automate with GPT and OpenAPI Endpoints

A common approach to getting code reviews with ChatGPT is by posting a prompt along with a code snippet. While effective, this process can be time-consuming and repetitive. If you want to streamline, customize, or fully automate your code review process, meet gptscript.

Gptscript is a powerful automation tool designed to build tools of any complexity on top of GPT models.

At the core of gptscript is a script defining a series of steps to execute, along with available tools such as Git operations, file reading, web scraping, and more.

In this guide, we’ll demonstrate how to perform a Python code review using a simple script (python-code-review.gpt).

tools: sys.read

You are an expert Python developer, your task is to review a set of pull requests.
You are given a list of filenames and their partial contents, but note that you might not have the full context of the code.

Only review lines of code which have been changed (added or removed) in the pull request. The code looks similar to the output of a git diff command. Lines which have been removed are prefixed with a minus (-) and lines which have been added are prefixed with a plus (+). Other lines are added to provide context but should be ignored in the review.

Begin your review by evaluating the changed code using a risk score similar to a LOGAF score but measured from 1 to 5, where 1 is the lowest risk to the code base if the code is merged and 5 is the highest risk which would likely break something or be unsafe.

In your feedback, focus on highlighting potential bugs, improving readability if it is a problem, making code cleaner, and maximising the performance of the programming language. Flag any API keys or secrets present in the code in plain text immediately as highest risk. Rate the changes based on SOLID principles if applicable.

Do not comment on breaking functions down into smaller, more manageable functions unless it is a huge problem. Also be aware that there will be libraries and techniques used which you are not familiar with, so do not comment on those unless you are confident that there is a problem.

Use markdown formatting for the feedback details. Also do not include the filename or risk level in the feedback details.

Ensure the feedback details are brief, concise, accurate. If there are multiple similar issues, only comment on the most critical.

Include brief example code snippets in the feedback details for your suggested changes when you're confident your suggestions are improvements. Use the same programming language as the file under review.
If there are multiple improvements you suggest in the feedback details, use an ordered list to indicate the priority of the changes.

Format the response in a valid Markdown format as a list of feedbacks, where the value is an object containing the filename ("fileName"), risk score ("riskScore") and the feedback ("details"). The schema of the Markdown feedback object must be:

## File: filename
Risk: riskScore

Details: details


The content for review is provided as input file.

Testing time

You will need gptscript installed.
The prompt from above as python-code-review.gpt file.
File to review. I am using the following Python code (code.py):

colors = {
"apple": "red",
"banana": "yellow",
"cherry": "red",
"mango": "red",
"lemon": "yellow",
"plum": "purple"
}

common = {}
for k, v in colors.items():
if v in common:
common[v] += 1
else:
common[v] = 1

common = sorted(common.items(), key=lambda x:x[1], reverse=True)
print(common[0][0])

Run gptscript with promptfile and code as first two inputs:

gptscript python-code-review.gpt code.py

Output:

## File: code.py
Risk: 2

Details:
1. The sorting of the `common` dictionary could be optimized by using the `collections.Counter` class, which is specifically designed for counting hashable objects. This would make the code more readable and efficient.

```python
from collections import Counter

common = Counter(colors.values())
most_common_color = common.most_common(1)[0][0]
print(most_common_color)
```

2. Consider using more descriptive variable names to improve readability, such as `color_counts` instead of `common`.

The results is in Markdown syntax which is easy to read for a human.

But, if you want to add automation I would prefer to change output toJSON format or a format of your choice which suits you tools.

Let’s refactor the promt to request JSON output:

Format the response in a valid JSON format as a list of feedbacks, where the value is an object containing the filename ("fileName"),  risk score ("riskScore") and the feedback ("details"). 
The schema of the JSON feedback object must be:
{
  "fileName": {
    "type": "string"
  },
  "riskScore": {
    "type": "number"
  },
  "details": {
    "type": "string"
  }
}


The content for review is provided as input file.

Re-run the script and you will get something like this:

[
{
"fileName": "code.py",
"riskScore": 2,
"details": "1. Consider using a `defaultdict` from the `collections` module to simplify the counting logic. This will make the code cleaner and more efficient.\n\nExample:\n```python\nfrom collections import defaultdict\n\ncommon = defaultdict(int)\nfor v in colors.values():\n common[v] += 1\n```\n\n2. The sorting and accessing the first element can be improved for readability by using `max` with a key function.\n\nExample:\n```python\nmost_common_color = max(common.items(), key=lambda x: x[1])[0]\nprint(most_common_color)\n```"
}
]

Remember, LLMs output are not determined, you can get different result for the same request.

Now let’s review what actually happens when you run gptscript with the prompt.

What It Does in a Nutshell

  1. Extracts code for review: Captures content from the input file(code.py).
  2. Sets context for the LLM: Instructs the LLM to act as an expert Python developer tasked with providing a detailed and sophisticated code review.
  3. Defines Structured Output: Returns results in two fields:
    Risk: A score indicating the potential risk associated with the changes.
    Details: A comprehensive explanation of the changes and their implications.

Conclusion

Gptscript is a powerful tool to kickstart your journey into automation using OpenAPI models. By defining custom scripts, you can streamline complex workflows, such as automated code reviews, with minimal effort.

This example just scratches the surface—there’s much more you can achieve with gptscript. Explore additional examples from the gptscript project to discover more possibilities and enhance your automation capabilities.

Happy automating!

Tech Team Leaders’ Guide to Strategy

Building a tech strategy is a core responsibility of the CTO, VP of Engineering, or Head of Engineering. Involving team leaders in this process ensures a more grounded and effective approach.

Tech team leaders play a crucial role by defining roadmaps for their teams, which, in turn, provide the foundation for an effective high-level strategy. To achieve the best results, continuous collaboration between leadership and team leaders is essential.

Let’s explore how to create a roadmap that is both practical and aligned with the company’s overall vision.

Building an Effective Team Roadmap

A team roadmap is a strategic document that outlines product needs, infrastructure requirements, modernization efforts, and compliance and security considerations, among other critical aspects.

An effective roadmap goes beyond listing high-level initiatives or goals. It expands on each goal using the Diagnosis, Policy, and Actions framework, helping to answer the Why, What, and How of every initiative. This approach fosters trust, alignment, and transparency with top-level leadership.

The Diagnosis, Policy, and Actions framework, developed by Richard Rumelt, consists of:

  • Diagnosis – Defining the problem that needs to be addressed
  • Policy – Establishing guiding principles and constraints for the solution
  • Actions – Defining concrete steps to implement the solution within the given policy

Let’s explore a few examples.

Example 1: Modernizing the Infrastructure

Diagnosis: Our current infrastructure relies on outdated and proprietary components, leading to scalability challenges, high maintenance costs, and slow adoption of new technologies.

Policy: Prioritize open-source and cloud-native solutions for new developments. Maintain legacy systems where necessary but avoid further expansion of proprietary technologies.

Actions:

  1. Identify and replace critical proprietary components with open-source or cloud-native alternatives.
  2. Standardize infrastructure automation and provisioning to improve scalability and maintainability.
  3. Update internal documentation and on-boarding materials to reflect new infrastructure standards.

Example 2: Upgrade the Database

Diagnosis: The current database version has reached end-of-life and is no longer receiving security updates or feature enhancements. An upgrade is necessary to maintain security, stability, and performance.

Policy: The database upgrade must be performed with zero downtime to avoid service disruptions.

Actions:

  1. Test new database version in the QA environment to ensure compatibility
  2. Create a full backup of the existing database.
  3. Implement a Blue-Green deployment strategy to minimize risk during the upgrade.
  4. Communicate the upgrade plan and schedule a rollout window.

Example 3: Improve Cloud Cost Efficiency

Diagnosis: Cloud expenses represent a significant portion of overall costs. Unused or underutilized resources contribute to unnecessary costs.

Policy: Optimize cloud usage by right-sizing instances, using auto-scaling, and enforcing cost-control policies.

Actions:

  1. Conduct an audit of cloud resources to identify inefficiencies.
  2. Implement auto-scaling policies for workloads with variable demand.
  3. Use reserved or spot instances for predictable workloads.
  4. Set up monitoring and alerts for unexpected cost spikes.

Conclusion

  1. Structuring your team’s roadmap using the Diagnosis, Policy, and Actions framework ensures clear prioritization and alignment with the company’s overall strategy.
  2. This approach facilitates productive discussions with top-level leadership, leading to better decision-making.
  3. It improves transparency, trust and accountability across all levels of the organization.

Have you faced challenges when implementing a strategic roadmap? How did you overcome them? Drop a comment below and let’s learn from each other!

Collecting Datadog APM traces using Grafana Alloy and Tempo

The Problem

Hybrid cloud is the future, but monitoring remains stuck in the past. Many organizations embrace hybrid infrastructure, yet struggle with fragmented observability tools. Why? Because monitoring providers still operate in silos.

One of the primary reasons hybrid monitoring isn’t as prevalent is the challenge of instrumentation. Many cloud providers offer their own monitoring solutions. Instrumentation libraries are often incompatible with one another, making cross-platform integration difficult.

The good news? With OpenTelemetry, Grafana, and Datadog, hybrid monitoring is becoming easier and more flexible. 🚀

The Solution

One promising development is the rise of open-source, vendor-neutral instrumentation frameworks like OpenTelemetry.

In essence, open standards are reducing incompatibility issues and allowing “a vendor-agnostic approach to get data from the sources you need to the observability service of your choice.”

A step toward solving the hybrid cloud monitoring challenge came when Grafana introduced the otelcol.receiver.datadog component.

Now, with otelcol.receiver.datadog, Grafana users can ingest and process Datadog telemetry directly within OpenTelemetry pipelines, unlocking several advantages:

  1. Expanding Grafana’s Reach to Datadog Customers
  2. Seamless Integration with OpenTelemetry Pipelines
  3. Avoiding Vendor Lock-in While Retaining Datadog’s Strengths
  4. Cost Optimization by Centralizing Hybrid Monitoring

How it works together?

Requirements

  1. Grafana – for web UI
  2. Grafana Alloy – to receive, process and export telemetry data
  3. Grafana Tempo – to collect and visualize traces from Datadog instrumented apps
  4. Datadog agent – with enabled APM feature
  5. Instrumented application with Datadog trace library

Quick check list:

  1. you have running: Grafana, Alloy and Tempo services
  2. You have running Datadog agents
  3. You have instrumented applications with Datadog trace library
  4. You add Tempo as Datasource in Grafana

Grafana Alloy configuration

The core of the solution is to set up Datadog receiver in Alloy config:

receivers:
datadog:
endpoint: "0.0.0.0:9126"
output:
traces: [otelcol.exporter.otlp.tempo.input]

# I am skipping extra steps which you might want to use to pre-proces data

otelcol:
exporter:
otlp:
tempo:
client:
endpoint: "https://tempo-distributor.example.com:443"
tls:
insecure: false
insecure_skip_verify: true

To avoid ports conflict with Datadog agent we choose 9126 as a port for Alloy Datadog receiver.

Running Alloy as Daemonset with hostNetwork access allow agent to be present on each Node.

alloy:
extraPorts:
- name: "datadog"
port: 9126
targetPort: 9126
protocol: "TCP"

controller:
hostNetwork: true
hostPID: true

service:
internalTrafficPolicy: "Local"

Application set up

Application has to be configured to send APM traces to Node IP:1926.

The Node IP can be extracted form Kubernetes meta information and passed as environment variable:

env:
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP

Datadog Agent configuration

With Datadog Agent we have two options:

  1. Send traces to Alloy Datadog receiver as addition to Datadog host
  2. Send traces only to Alloy Datadog receiver

With option 1 we assume you are using both Datadog and Grafana solutions in hybrid mode. In that case Datadog agent has to have following configuration:

agents:
containers:
agent:
env:
- name: DD_ADDITIONAL_ENDPOINTS
value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
- name: DD_APM_ADDITIONAL_ENDPOINTS
value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
traceAgent:
enabled: true
env:
- name: DD_ADDITIONAL_ENDPOINTS
value: '{"http://alloy-service:9126": ["datadog-receiver"]}'
- name: DD_APM_ADDITIONAL_ENDPOINTS
value: '{"http://alloy-service:9126": ["datadog-receiver"]}'

For option 2 where traces must go only to Alloy Datadog receiver following configuration might work:

datadog:
apm:
socketEnabled: true # Use the Unix Domain Socket (default). Can be true even if port is enabled.
portEnabled: true # Enable TCP port 8126 for traces.
useLocalService: false

env:
- name: DD_APM_DD_URL
value: "http://alloy-service:9126" # URL for OTel collector’s Datadog trace receiver
- name: DD_APM_NON_LOCAL_TRAFFIC
value: "true"

After all done and agents restarted you can navigate to Explore page in Grafana, select Tempo as datasource and get recent Datadog traces.

Conclusions

  1. Kudos to Grafana and Datadog – Their collaboration on the otelcol.receiver.datadog makes transitioning between monitoring platforms smoother than ever, reducing friction for hybrid observability.
  2. Hybrid Monitoring is the New Normal – Applications no longer rely on a single monitoring provider throughout their lifespan. As infrastructure evolves, businesses will inevitably switch or integrate multiple observability tools.
  3. Stay Agile with Open Standards – Using OpenTelemetry ensures flexibility, allowing teams to adapt their monitoring stack without vendor lock-in, keeping observability seamless across hybrid and multi-cloud environments.

By embracing open standards, organizations can future-proof their monitoring strategies while ensuring complete visibility across their hybrid infrastructure. 🚀

Cost saving strategy for Kubernetes platform

Working on cost saving strategy involves looking at the problem from very different dimensions.
Overall Kubernetes costs can be split on compute costs, networking costs, storage costs, licensing cost and SaaS costs.
In this part I will cover: Right-size infrastructure & use autoscaling

Right-size infrastructure & use autoscaling

Kubernetes Nodes autoscaler

In a cloud environment Kubernetes Nodes autoscaler plays important role in delivering just enough resources for your cluster. In a nutshell:

  • it adds new nodes according to demand – up scale
  • it consolidates under utilized nodes – down scale

It directly depends on Pod resource requests. Choosing the right requests is the key component for cost-effective strategy.
For self hosted solutions you might want to look into project like Karpenter.
Alternative scaling strategy is to add or remove nodes by schedule for a specific time plan. Using that approach you can simple specify the time of the day when you need to upscale and the time when you downscale.

Continue reading Cost saving strategy for Kubernetes platform

Multi provider DNS management with Terraform and Pulumi

The Problem

Every DNS provider is very specific how they create DNS records. Using Terraform or Pulumi don’t guarantee multi provider support out of the box.

One example where AWS Route53 support values for multiple IP binding to the same name record. Where Cloudflare must have a dedicated record for each IP.

Theses API difference make it harder to write code which will work for multiple providers.

For AWS Route53 a single record can be created like this:

mydomain.local: IP1, IP2, IP3

For Cloudflare it would be 3 different records:

mydomain.local: IP1
mydomain.local: IP2
mydomain.local: IP3

The Solution 1: Use flexibility of programming language available with Pulumi

Pulumi has a first hand here since you can use the power of programming language to handle custom logic.

DNS data structure:

mydomain1.com: 
 - IP1
 - IP2 
 - IP3
mydomain2.com:
 - IP4
 - IP5 
 - IP6
mydomain3.com: 
 - IP7
 - IP8 
 - IP9

Using Python or Javascript we can expand this structure for Cloudflare provider or keep as is for AWS Route53.

In Cloudflare case we will create new record for each new IP

import pulumi
import pulumi_cloudflare as cloudflare
import yaml

# Load the configuration from a YAML file
yaml_file = "dns_records.yaml"
with open(yaml_file, "r") as file:
    dns_config = yaml.safe_load(file)

# Cloudflare Zone ID (Replace with your actual Cloudflare Zone ID)
zone_id = "your_cloudflare_zone_id"

# Iterate through domains and their associated IPs to create A records
for domain, ips in dns_config.items():
    if isinstance(ips, list):  # Ensure it's a list of IPs
        for ip in ips:
            record_name = domain
            cloudflare.Record(
                f"{record_name}-{ip.replace('.', '-')}",
                zone_id=zone_id,
                name=record_name,
                type="A",
                value=ip,
                ttl=3600,  # Set TTL (adjust as needed)
            )

# Export the created records
pulumi.export("dns_records", dns_config)

and since AWS Route53 support IPs list, so the code would look like:

for domain, ips in dns_config.items():
    if isinstance(ips, list) and ips:  # Ensure it's a list of IPs and not empty
        aws.route53.Record(
            f"{domain}-record",
            zone_id=hosted_zone_id,
            name=domain,
            type="A",
            ttl=300,  # Set TTL (adjust as needed)
            records=ips,  # AWS Route 53 supports multiple IPs in a single record
        )

Solution 2 – Using Terraform for each loop

It’s quite possible to achieve the same using Terraform starting with version 0.12 which introduce dynamic block.

Same data structure:

mydomain1.com: 
  - 192.168.1.1
  - 192.168.1.2
  - 192.168.1.3
mydomain2.com:
  - 10.0.0.1
  - 10.0.0.2
  - 10.0.0.3
mydomain3.com: 
  - 172.16.0.1
  - 172.16.0.2
  - 172.16.0.3

Terraform example for AWS Route53

provider "aws" {
  region = "us-east-1"  # Change this to your preferred region
}

variable "hosted_zone_id" {
  type = string
}

variable "dns_records" {
  type = map(list(string))
}

resource "aws_route53_record" "dns_records" {
  for_each = var.dns_records

  zone_id = var.hosted_zone_id
  name    = each.key
  type    = "A"
  ttl     = 300
  records = each.value
}

Quite simple using for_each loop, but will not work with Cloudflare, because of the mentioned compatibility issue. So, we need new record for each IP.

Terraform example for Cloudflare

# Create multiple records for each domain, one per IP
resource "cloudflare_record" "dns_records" {
  for_each = { for k, v in var.dns_records : k => flatten([for ip in v : { domain = k, ip = ip }]) }

  zone_id = var.cloudflare_zone_id
  name    = each.value.domain
  type    = "A"
  value   = each.value.ip
  ttl     = 3600
  proxied = false  # Set to true if using Cloudflare proxy
}

Conclusions

  1. Pulumi: Flexible and easy to start. Data is separate from code, making it easy to add providers or change logic.
  2. Terraform: Less complex and easier to support long-term but depends on data format
  3. Both solutions require programming skills or expertise in Terraform language.

Grafana a new look

As much as l like managed monitoring solutions it’s hard not to be excited about setting up your own platform based on Prometheus Grafana stack and save some bucks for the company.

After 5 years of using proprietary solutions I am again on the way to get into Grafana world.

Here I want to wrap up what I learnt so far about Grafana.

Big ecosystem of tools

The biggest change for me is ecosystem of projects which supports almost all imaginable monitoring needs.

  • For Logs: Loki and Promtail
  • For Traces: Tempo
  • For Profiling: Peryscope
  • For open telemetry collecting: Alloy
  • For eBPF: Beyla
  • For Synthetics: Prometheus BlackBox and K6 API
  • and many more

Automated Deployment process

A lot was done in the deployment process where using Helm charts you can get it up and running in a minutes.

All that comes with scalability and high availability in mind.

Though you would need to connect and customize each component, it’s still can be just a Helm configuration change.

Huge ecosystem of Components to simplify instrumentation

It’s clear your code has to be instrumented to be observable. Grafana and their Open Source community developed a number of tools to make it easy and simple:

  1. Compatible receivers for big providers like Datadog
  2. New discovery methods like eBPF
  3. No code way of instrumentation like sidecar container

And a lot more see the Alloy components section to get started.

New fresh UI

Grafana Logs and Metrics get new look.

You can see multiple logs and metrics on the same page.

Summary

It’s cool, it’s fresh and it’s a lot of fun to use!

Can ArgoCD replace Helm managers?

Argocd gitops flux
Argocd gitops flux cycle

ArgoCD is a tool which provides GitOps way of deployment. It supports various format of applications including Helm charts.

Helm is one of the most popular packaging format for Kubernetes applications. It give a rise to various tools to manage helm chart like helmfiles, helmify and others.

Why ArgoCD is replacing Helm managers?

For a long time Helm managers help to deploy Helm charts via different deploy strategies. It serves well, but the big disadvantage was a code drift which accumulate over time.

To avoid code drift one solution was to schedule a period job which will apply latest changes to Helm charts. Sometimes it works, but most often if one of the charts get install issue all other charts stop being updated as well.
So, it wasn’t ideal way to handle it.

In that situation ArgoCD comes to a rescue. ArgoCD allow to control the update of each chart as a dedicated process where issue in one chart will not block updates on other charts. Helm charts in ArgoCD is a first class citizen and support most of it features.

How it looks like in ArgoCD?

To try it out lets start with a simple Application which will install Prometheus helm chart with a custom configuration.

Prometheus helm chart can be found at https://prometheus-community.github.io/helm-charts

Custom configuration will be stored in our internal project https://git.example.com/org/value-files.git

Thanks to multi source support in Application we can use official chart and custom configuration together like in the following example:

apiVersion: argoproj.io/v1alpha1
kind: Application
spec:
  sources:
  - repoURL: 'https://prometheus-community.github.io/helm-charts'
    chart: prometheus
    targetRevision: 15.7.1
    helm:
      valueFiles:
      - $values/charts/prometheus/values.yaml
  - repoURL: 'https://git.example.com/org/value-files.git'
    targetRevision: dev
    ref: values

Where `$values is a reference to root folder of values-files repository. That values file override default prometheus settings, so we can configure it for our needs.

What’s next?

There are several distinct feature which Helm managers support, so your next step is to check if ArgoCD cover all your needs.

One of the notable feature is secrets support. Many managers support SOPS format of secrets where ArgoCD don’t give you any solution, so it’s up to you how to manage secrets.

Other important feature is order of execution which can be important part of your Helm manager setup. ArgoCD don’t have built-in replacement, so you have to rely on Helm format to support dependencies. One way is to build umbrella charts for the complex applications.

Most important change is the GitOps style of deployment. ArgoCD doesn’t run CI/CD pipeline, so you don’t have a feature like helm diff to preview changes before applying. It will apply them as soon as they become available in the linked git repository.

Conclusions

  1. ArgoCD can replace Helm managers, but it strongly depend on your project needs.
  2. ArgoCD introduce new challenges like secrets managers and order of execution.
  3. It introduce GitOps deployment style and replace usual CI/CD pipeline, so new “quality gates” needs to be build to be ready for production environment.

Path to Staff Engineer role

Staff Engineer

Someday in your career you take your time to think what’s next for me? What is my next challenge to grow in a career ladder? If your answer is Staff Engineer, Tech lead, Team lead or Principal Engineer then you are on the way to Staff-plus Engineer role.

As a Staff-plus Engineer myself I want to share my few tips for you.

Start working on Staff engineer package

  1. What’s your Staff level project?
    • What did you do and what was impact?
    • What behavior did you demonstrate and how complex projects were?
  2. Link your Design Documents, RFCs and Proposals to support your package with your design and architecture contribution.
  3. Can you quantify the impact of your projects?
    • Did it helps to increase revenue?
    • Save costs?
  4. What glue work did you do to organization?
    • What’s the impact of the glue work?
  5. Who have you mentored and through what accomplishments?

Sharpen your soft skills

  1. Communication.
    • Keep people informed and connected.
  2. Negotiation
    • Be ready to resolve difficult situations
  3. Presentation
    • Know your audience. Learn one or two presentation frameworks.
  4. Networking
    • Don’t be shame to get in touch with other Staff-plus engineers in your company.
    • They are the best people to help you navigate the role inside the company.

Learn Staff Engineer tools

  1. Leadership
    • Become a problem solver. Be a visionary in your area. Stay up to date with technologies in your area.
  2. Planning and Goal orientation
    • Have a vision of what you do next in one or another area.
    • Take active participate on Planning meetings.
  3. Collaboration and contribution
    • Propose new Decision docs, Proposals and RFCs. Make sure they are reviewed and discussed.
    • Implement POC to demo the idea out.
  4. Team work and mentorship
    • Help your peers. Be a problem solver. Be visible. Become a go to person.
    • Try on a mentorship role.

Summary

Getting Staff engineer role could take months or years to accomplish. However, it’s crucial to view this milestone not as the ultimate destination, but rather as a guiding roadmap for your career development. It’s not the title itself that holds the most significance, but rather the daily challenge we face and the positive impact we make as we progress in our journey.

Resources

  1. About glue work.
  2. Staff Engineer book