Grafana a new look

As much as l like managed monitoring solutions it’s hard not to be excited about setting up your own platform based on Prometheus Grafana stack and save some bucks for the company.

After 5 years of using proprietary solutions I am again on the way to get into Grafana world.

Here I want to wrap up what I learnt so far about Grafana.

Big ecosystem of tools

The biggest change for me is ecosystem of projects which supports almost all imaginable monitoring needs.

  • For Logs: Loki and Promtail
  • For Traces: Tempo
  • For Profiling: Peryscope
  • For open telemetry collecting: Alloy
  • For eBPF: Beyla
  • For Synthetics: Prometheus BlackBox and K6 API
  • and many more

Automated Deployment process

A lot was done in the deployment process where using Helm charts you can get it up and running in a minutes.

All that comes with scalability and high availability in mind.

Though you would need to connect and customize each component, it’s still can be just a Helm configuration change.

Huge ecosystem of Components to simplify instrumentation

It’s clear your code has to be instrumented to be observable. Grafana and their Open Source community developed a number of tools to make it easy and simple:

  1. Compatible receivers for big providers like Datadog
  2. New discovery methods like eBPF
  3. No code way of instrumentation like sidecar container

And a lot more see the Alloy components section to get started.

New fresh UI

Grafana Logs and Metrics get new look.

You can see multiple logs and metrics on the same page.

Summary

It’s cool, it’s fresh and it’s a lot of fun to use!

Filebeats configuration for Kubernetes

filebeat.autodiscover:
  providers:
  - type: kubernetes
    node: ${NODE_NAME}
    hints.enabled: true
    hints.default_config:
      type: container
      paths:
        - /var/log/containers/*${data.kubernetes.container.id}.log

What’s so cool about above configuration

Filebeat Autodiscover

When you run applications on containers, they become moving targets to the monitoring system. Autodiscover allows you to track them and adapt settings as changes happen.

The Kubernetes autodiscover provider watches for Kubernetes nodes, pods, services to start, update, and stop.
As well it recognise a lot of additional labels and statuses related to Kubernetes objects.

Hints based autodiscover

Filebeat supports autodiscover based on hints from the provider. The hints system looks for hints in Kubernetes Pod annotations or Docker labels that have the prefix co.elastic.logs. As soon as the container starts, Filebeat will check if it contains any hints and launch the proper config for it. Hints tell Filebeat how to get logs for the given container.

Type Container

Use the container input to read containers log files.

This input searches for container logs under the given path, and parse them into common message lines, extracting timestamps too. Everything happens before line filtering, multiline, and JSON decoding, so this input can be used in combination with those settings.

Conclusions

Kubernetes logs autodiscovery and JSON decoding provide very good visibility into log stream. Labels and JSON log fields are properly named and parsed. Using ES and Kibana we can search through logs with easy queries and filter by fields.

References

Integrating Flask with Jaeger tracing on Kuberentes

Distributed applications and microservices required high level of observability. In this article we will integrate a Flask micro framework with Jaeger tracing tool. All code will be deployed to Kubernetes minikube cluster.

Flask

Let’s build a simple task manager service using Flask framework.

Code

tasks.py

from flask import Flask, jsonify
app = Flask(__name__)

@app.route('/')

tasks = {"tasks":[
        {"name":"task 1", "uri":"/task1"},
        {"name":"task 2", "uri":"/task2"}
    ]}

def  root():
	"Service root"
	return  jsonify({"url":"/tasks")
                     
@app.route('/tasks')
def  tasks():
	"Tasks list"
	return  jsonify(tasks)

if __name__ == '__main__':
  "Start up"
  app.run(debug=True, host='0.0.0.0',port=5000)
Continue reading Integrating Flask with Jaeger tracing on Kuberentes

Making simple Splunk Nginx dashboard

As a DevOps guy I often do incident analysis, post deployment monitoring and usual logs checks. If you also is using Splunk as me when let me show for you few effective Splunk commands for Nginx logs monitoring.

Extract fileds

To make commands works Nginx log fields have to be extracted into variables.
Where are 2 ways to extract fields:

  1. By default Splunk recognise “access_combined” log format which is default format for Nginx. If it is your case congratulations nothing to do for you!
  2. For custom format of logs you will need to create regular expression. Splunk has built in user interface to extract fields or you can provide regular expression manually.

field_extractor

Website traffic over time and error rate

Unexpected spike in traffic or in error rate are always first thing to look for. Following command build a time chart with response codes. Codes 200/300 is your normal traffic and 400/500 is errors.

timechart count(status) span=1m by status

Website traffic in Splunk

Response time

How do you know if your website running slowly?
For response time I suggest to use 20, 85 and 95 percentile as metrics.
You also can think of average response time metric, but low average response time doesn’t show that website is OK, so I am not using that metric in the query.

timechart perc20(request_time), perc85(request_time), perc95(request_time) span=1m

Response time in Splunk

Traffic by IP

Checking which IPs are most popular is a good way to spot bad guys or misbehaving bot.

top limit=20 clientip

Traffic by IP with splunk

Top of error page

Looking for pages which produce most errors like 500 Internal Server Error or not found pages like 404? Following two queries give you exactly that information.
Top error pages

search status >= 500 | stats count(status) as cnt by uri, status | sort cnt desc

Top 40x error pages

search status >= 400 AND status < 500 | stats count(status) by uri, status | sort cnt desc

TOP nginx error urls with Splunk

Number of timeouts(>30s) per upstream

When you are using Nginx as a proxy server it is very useful to see if any of upstreams are getting timeouts.
Timeouts could be a symptom for: slow application performance, not enough system resources or just upstream server is down.

search upstream_response_time >= 30 | stats count(upstream_response_time) as upstreams by upstream

Splunk get timeout nginx upstreams

Most time consuming upstreams

Most time consuming upstreams showing which of servers are already overloaded by requests and giving you a hint when application needs to be scaled

stats sum(upstream_response_time), count(upstream) by upstream

Most time consuming upstreams

In conclusion

Splunk functions like timechart, stats and top is your best friends for data aggregation. They are like unix tools – the more tools you know the more easier is to build powerful commands.