AWS Archives - Site Reliability Engineer Blog

Multi provider DNS management with Terraform and Pulumi

The Problem

Every DNS provider is very specific how they create DNS records. Using Terraform or Pulumi don’t guarantee multi provider support out of the box.

One example where AWS Route53 support values for multiple IP binding to the same name record. Where Cloudflare must have a dedicated record for each IP.

Theses API difference make it harder to write code which will work for multiple providers.

For AWS Route53 a single record can be created like this:

mydomain.local: IP1, IP2, IP3

For Cloudflare it would be 3 different records:

mydomain.local: IP1
mydomain.local: IP2
mydomain.local: IP3

The Solution 1: Use flexibility of programming language available with Pulumi

Pulumi has a first hand here since you can use the power of programming language to handle custom logic.

DNS data structure:

mydomain1.com: 
 - IP1
 - IP2 
 - IP3
mydomain2.com:
 - IP4
 - IP5 
 - IP6
mydomain3.com: 
 - IP7
 - IP8 
 - IP9

Using Python or Javascript we can expand this structure for Cloudflare provider or keep as is for AWS Route53.

In Cloudflare case we will create new record for each new IP

import pulumi
import pulumi_cloudflare as cloudflare
import yaml

# Load the configuration from a YAML file
yaml_file = "dns_records.yaml"
with open(yaml_file, "r") as file:
    dns_config = yaml.safe_load(file)

# Cloudflare Zone ID (Replace with your actual Cloudflare Zone ID)
zone_id = "your_cloudflare_zone_id"

# Iterate through domains and their associated IPs to create A records
for domain, ips in dns_config.items():
    if isinstance(ips, list):  # Ensure it's a list of IPs
        for ip in ips:
            record_name = domain
            cloudflare.Record(
                f"{record_name}-{ip.replace('.', '-')}",
                zone_id=zone_id,
                name=record_name,
                type="A",
                value=ip,
                ttl=3600,  # Set TTL (adjust as needed)
            )

# Export the created records
pulumi.export("dns_records", dns_config)

and since AWS Route53 support IPs list, so the code would look like:

for domain, ips in dns_config.items():
    if isinstance(ips, list) and ips:  # Ensure it's a list of IPs and not empty
        aws.route53.Record(
            f"{domain}-record",
            zone_id=hosted_zone_id,
            name=domain,
            type="A",
            ttl=300,  # Set TTL (adjust as needed)
            records=ips,  # AWS Route 53 supports multiple IPs in a single record
        )

Solution 2 – Using Terraform for each loop

It’s quite possible to achieve the same using Terraform starting with version 0.12 which introduce dynamic block.

Same data structure:

mydomain1.com: 
  - 192.168.1.1
  - 192.168.1.2
  - 192.168.1.3
mydomain2.com:
  - 10.0.0.1
  - 10.0.0.2
  - 10.0.0.3
mydomain3.com: 
  - 172.16.0.1
  - 172.16.0.2
  - 172.16.0.3

Terraform example for AWS Route53

provider "aws" {
  region = "us-east-1"  # Change this to your preferred region
}

variable "hosted_zone_id" {
  type = string
}

variable "dns_records" {
  type = map(list(string))
}

resource "aws_route53_record" "dns_records" {
  for_each = var.dns_records

  zone_id = var.hosted_zone_id
  name    = each.key
  type    = "A"
  ttl     = 300
  records = each.value
}

Quite simple using for_each loop, but will not work with Cloudflare, because of the mentioned compatibility issue. So, we need new record for each IP.

Terraform example for Cloudflare

# Create multiple records for each domain, one per IP
resource "cloudflare_record" "dns_records" {
  for_each = { for k, v in var.dns_records : k => flatten([for ip in v : { domain = k, ip = ip }]) }

  zone_id = var.cloudflare_zone_id
  name    = each.value.domain
  type    = "A"
  value   = each.value.ip
  ttl     = 3600
  proxied = false  # Set to true if using Cloudflare proxy
}

Conclusions

Pulumi: Flexible and easy to start. Data is separate from code, making it easy to add providers or change logic.
Terraform: Less complex and easier to support long-term but depends on data format
Both solutions require programming skills or expertise in Terraform language.

Build Kubernetes control plane image with Packer

Steps to prepare single control plane image is quite simple:

Prepare Docker and Kubernetes packages and settings
Execute kubeadm bootstrap script when EC2 start up first time

One unanswered question is: How to add additional control plane nodes and worker nodes which required tokens and certificates to be preset when joining the cluster?

Practical guide to Kubernetes Certified Administration exam

I have published practical guide to Kubernetes Certified Administration exam https://github.com/vorozhko/practical-guide-to-kubernetes-administration-exam

Covered topics so far are:

Share your efforts

If your are also working on preparation to Kubernetes Certified Administration exam lets combine our efforts by sharing the practical side of exam.

Disaster recovery of single node Kubernetes control plane

Overview

There are many possible root causes why control plane might become unavailable. Lets review most common scenarios and mitigation steps.

Mitigation steps in this article build around AWS public cloud features, but all popular public cloud offerings have similar functionality.

Apiserver VM shutdown or apiserver crashing

Results

unable to stop, update, or start new pods, services, replication controller
existing pods and services should continue to work normally, unless they depend on the Kubernetes API

120 Days of AWS EKS in Staging

Felix Georgii wakeboarding at Wake Crane Project in Pula, Croatia on September 25, 2016

My journey with Kubernetes started with Google Kubernetes Engine then one year later with self managed kuberntes and then with migration to Amazon EKS.

EKS as a managed kubernetes cluster is not 100% managed. Core tools didn’t work as expcted. Customers expectation was not aligned with functions provided. Here I have summarized all our experience we gained by running EKS cluster in Staging.

To run EKS you still have to:

Prepare network layer: VPC, subnets, firewalls…
Install worker nodes
Periodically apply security patches on workers nodes
Monitor worker nodes health by install node problem detector and monitoring stack
Setup security groups and NACLs
and more

AWS Transit Gateway to Simplify Your Network Architecture

This is how your network architecture probably looks now

this is how it can be with Transit Gateway

Transit gateway was one of the missing peaces to full fill network administration needs.

Terraform remote state and state locking

Terraform remote state and state locking is important part in team collaboration. What are challenges when working on Terraform in a team:
1. how to synchronize terraform state between people
2. how to avoid collisions of running terraform at the same time

Terraform remote state

Terraform remote state is a mechanism to share state file by hosting it on a shared resource like aws s3 bucket or consul server.

Example of storing state in s3 bucket.

terraform {
  backend "s3" {
    bucket         = "mybucket-terraform-state-file"
    key            = "example/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
  }
}

Bucket have to be created beforehand. You the key to separate states from difference modules and projects.

Terraform state locking

Terraform locking state isolate state changes. As soon as lock is acquired by terraform plan or apply no other terraform plan/apply command will succeed until lock is released.

To store lock in dynamodb table you need:
– Create dynamodb table in your aws account in the same region as specified in your terramform backend configuration (us-east-1 in our case)
– primary key must have a name LockID without it locking will not work

terraform {
  backend "s3" {
    bucket         = "mybucket-terraform-state-file"
    key            = "example/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform_example_lock"
  }
}

Note that terraform provide a way to disable locking from command line using -lock=false flag, but it is not recommended.

One improvement is to set “key” keyword as variable and base it on module or project, so you don’t have to set it manually. One caveat with that approach is to make sure they key is unique across projects.

Best,
Iaroslav

AWS best practices for Lambda functions in Production

Hey folks,

just a month ago I have been involved in AWS project based on Lambda functions. In this article I will explain what I learned so far and how to create production Lambda AWS environment with best practices in mind.

I will start from top level and will explain everything you need to have basic infrastructure supporting your Lambda functions and other applications in your cloud.

VPC

First, you need to created dedicated VPC and reserve range of IPs which doesn’t conflict with your other networks in case you would need to pair them together. As a general rule you should never use default VPC for production needs.
Create a security group which only allow 80 and 443 incoming traffic.

Subnets

You would need at least 4 subnets, two private and two public. Each type of subnet have to split in at least two different availability zones.
Public subnet have to contain AWS services endpoints and your servers which needs to have direct connection to internet like ELB, API gateway endpoints or bastion host (your ssh jump server).
Private subnet have to contain all your infrastructure servers like web servers, database server or backend applications.

Note that You should never place your infrastructure servers in public subnets.

Internet gateway and NAT

To function properly your VPC have to be attached to internet gateway and your private subnet should have NAT service enabled.

MySQL

For the database I use MySQL RDS. You need to disable public access to the instance and deploy it into private subnet. In security group add port 3306 for incoming connections and only from internal IP range. So, we have double protection here with security group and internal dns name for database.
There are a lot of best practices of how to setup production ready mysql instance, so I will skip most of it, but what you definitely need is to have read replica and shadow copy enabled. Make sure you set maintenance window which is right for you.

Lambda functions

To have access to our private database Lambda functions needs to be deployed inside the same VPC in private subnets. To setup https endpoints for lambda functions you would need to attach API gateway. In Lambda security groups add ports 80 and 443 for incoming connections.

That’s pretty much it, but very often you will have other web applications running in your vpc and to route traffic properly between Lambda and other apps you would need some web proxy like nginx.

Nginx

To have common entry point for your web applications and Lambda function Nginx is the best way to go. There is a new possibility to use ELB for that, but it isn’t good enough yet.

To have reliable and secure setup of nginx you would need to use common pattern of AWS which include: ELB, Autoscaling group, Launch configuration and security groups.

On the configuration side nginx will proxy traffic to Lambda functions through API gateway.

Elastic load balancer

Here you need to decide what kind of ELB suits your needs. I choose ELB with HTTPS support which provide SSL termination. In the ELB security group I added ports 80 and 443 for all incoming traffic.

Launch configuration

Within Launch configuration you need to define what kind of instance you want to launch when autoscaling is trigger in.

Autoscaling group

ASG define what is desired number of instance you want to run at any given moment. Using metrics such as CPU you can setup it to scale up or down to desired maximum or minimum number of instances.

Almost there!

Last step is to connect ELB with ASG and with Launch configuration!

Note I have skipped setting up of Target group and health checks, but they are pretty much basics.

That’s it!

Now you have a good start to develop with AWS Lambda in conjunction with general approach of web tier architecture.

What’s next?

Second part of the topic is to setup CI and automation. Next time I will write how to code infrastructure with terraform, create nginx image with packer and run configuration management with ansible.