120 Days of AWS EKS in Staging

Felix Georgii wakeboarding at Wake Crane Project in Pula, Croatia on September 25, 2016

My journey with Kubernetes started with Google Kubernetes Engine then one year later with self managed kuberntes and then with migration to Amazon EKS.

EKS as a managed kubernetes cluster is not 100% managed. Core tools didn’t work as expcted. Customers expectation was not aligned with functions provided. Here I have summarized all our experience we gained by running EKS cluster in Staging.

To run EKS you still have to:

  • Prepare network layer: VPC, subnets, firewalls…
  • Install worker nodes
  • Periodically apply security patches on workers nodes
  • Monitor worker nodes health by install node problem detector and monitoring stack
  • Setup security groups and NACLs
  • and more

EKS Staging how to?

EKS Setup

  • Use terraform EKS module or eksctl to make installation and maintenance easier.

EKS Essentials

  • Install node problem detector to monitor for unforeseen kernel or docker issues
  • Scale up kube-dns to two or more instances
  • See more EKS core tips in 90 Days EKS in Production

EKS Autoscaling

  • Kubernetes cluster autoscaling is no doubt must have addition to EKS toolkit. Scale your cluster up and down to 0 instances if you wish. Base your scaling on cluster state of Pending/Running pods to get maximum from it.
  • Kubernetes custom metrics, node exporter and kube state metrics is must have to enable horizonal pod autoscaling based on build in metrics like cpu/memory and as well on application specific metrics like request rate or data throughput.
  • Prometheus and cadvisor is another addition you would need to enable metrics collection

Ingress controller

  • Istio one of the most advanced, but breaking changes and beta status might introduce hard to debug bugs
  • Contour looks like good replacement to Istio. It didn’t have that good community support as istio, but stable enough and has quite cool CRD IngressRoute which makes Ingress fun to use
  • Nginx ingress is battle tested and has the best support from community. Have huge number of features, so is a good choice to setup the most stable environment

Statefull applications

  • Ensure you have enough nodes in each AZ where data volumes are. Good start is to create dedicated node group for each AZ with minimum number of nodes needed.
  • Ensure persistent volume claim(PVC) is created in desired AZ. Create dedicated storage class for specific AZ you need PVC to be in. See allowedTopologies in following example.
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard-eu-west1-a
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/zone
    values:
    - eu-west1-a

Summary

EKS is a good managed Kubernetes service. Some of mentioned tasks are common for all Kubernetes platforms, but there is a lot of space to grow for the better service. The burden for maintenance is still quite high, but fortunately Kubernetes ecosystem has a lot of opensource tools to easy it.

Have fun!

Terraform remote state and state locking

Terraform remote state and state locking is important part in team collaboration. What are challenges when working on Terraform in a team:
1. how to synchronize terraform state between people
2. how to avoid collisions of running terraform at the same time

Terraform remote state

Terraform remote state is a mechanism to share state file by hosting it on a shared resource like aws s3 bucket or consul server.

Example of storing state in s3 bucket.

terraform {
  backend "s3" {
    bucket         = "mybucket-terraform-state-file"
    key            = "example/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
  }
}

Bucket have to be created beforehand. You the key to separate states from difference modules and projects.

Terraform state locking

Terraform locking state isolate state changes. As soon as lock is acquired by terraform plan or apply no other terraform plan/apply command will succeed until lock is released.

To store lock in dynamodb table you need:
– Create dynamodb table in your aws account in the same region as specified in your terramform backend configuration (us-east-1 in our case)
– primary key must have a name LockID without it locking will not work

terraform {
  backend "s3" {
    bucket         = "mybucket-terraform-state-file"
    key            = "example/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform_example_lock"
  }
}

Note that terraform provide a way to disable locking from command line using -lock=false flag, but it is not recommended.

One improvement is to set “key” keyword as variable and base it on module or project, so you don’t have to set it manually. One caveat with that approach is to make sure they key is unique across projects.

Best,
Iaroslav

AWS best practices for Lambda functions in Production

Hey folks,

just a month ago I have been involved in AWS project based on Lambda functions. In this article I will explain what I learned so far and how to create production Lambda AWS environment with best practices in mind.

I will start from top level and will explain everything you need to have basic infrastructure supporting your Lambda functions and other applications in your cloud.

VPC

First, you need to created dedicated VPC and reserve range of IPs which doesn’t conflict with your other networks in case you would need to pair them together. As a general rule you should never use default VPC for production needs.
Create a security group which only allow 80 and 443 incoming traffic.

Subnets

You would need at least 4 subnets, two private and two public. Each type of subnet have to split in at least two different availability zones.
Public subnet have to contain AWS services endpoints and your servers which needs to have direct connection to internet like ELB, API gateway endpoints or bastion host (your ssh jump server).
Private subnet have to contain all your infrastructure servers like web servers, database server or backend applications.

Note that You should never place your infrastructure servers in public subnets.

Internet gateway and NAT

To function properly your VPC have to be attached to internet gateway and your private subnet should have NAT service enabled.

MySQL

For the database I use MySQL RDS. You need to disable public access to the instance and deploy it into private subnet. In security group add port 3306 for incoming connections and only from internal IP range. So, we have double protection here with security group and internal dns name for database.
There are a lot of best practices of how to setup production ready mysql instance, so I will skip most of it, but what you definitely need is to have read replica and shadow copy enabled. Make sure you set maintenance window which is right for you.

Lambda functions

To have access to our private database Lambda functions needs to be deployed inside the same VPC in private subnets. To setup https endpoints for lambda functions you would need to attach API gateway. In Lambda security groups add ports 80 and 443 for incoming connections.

That’s pretty much it, but very often you will have other web applications running in your vpc and to route traffic properly between Lambda and other apps you would need some web proxy like nginx.

Nginx

To have common entry point for your web applications and Lambda function Nginx is the best way to go. There is a new possibility to use ELB for that, but it isn’t good enough yet.

To have reliable and secure setup of nginx you would need to use common pattern of AWS which include: ELB, Autoscaling group, Launch configuration and security groups.

On the configuration side nginx will proxy traffic to Lambda functions through API gateway.

Elastic load balancer

Here you need to decide what kind of ELB suits your needs. I choose ELB with HTTPS support which provide SSL termination.  In the ELB security group I added ports 80 and 443 for all incoming traffic.

Launch configuration

Within Launch configuration you need to define what kind of instance you want to launch when autoscaling is trigger in.

Autoscaling group

ASG define what is desired number of instance you want to run at any given moment. Using metrics such as CPU you can setup it to scale up or down to desired maximum or minimum number of instances.

Almost there!

Last step is to connect ELB with ASG and with Launch configuration!

Note I have skipped setting up of Target group and health checks, but they are pretty much basics.

That’s it!

Now you have a good start to develop with AWS Lambda in conjunction with general approach of web tier architecture.

What’s next?

Second part of the topic is to setup CI and automation. Next time I will write how to code infrastructure with terraform, create nginx image with packer and run configuration management with ansible.