AWS best practices for Lambda functions in Production

Hey folks,

just a month ago I have been involved in AWS project based on Lambda functions. In this article I will explain what I learned so far and how to create production Lambda AWS environment with best practices in mind.

I will start from top level and will explain everything you need to have basic infrastructure supporting your Lambda functions and other applications in your cloud.

VPC

First, you need to created dedicated VPC and reserve range of IPs which doesn’t conflict with your other networks in case you would need to pair them together. As a general rule you should never use default VPC for production needs.
Create a security group which only allow 80 and 443 incoming traffic.

Subnets

You would need at least 4 subnets, two private and two public. Each type of subnet have to split in at least two different availability zones.
Public subnet have to contain AWS services endpoints and your servers which needs to have direct connection to internet like ELB, API gateway endpoints or bastion host (your ssh jump server).
Private subnet have to contain all your infrastructure servers like web servers, database server or backend applications.

Note that You should never place your infrastructure servers in public subnets.

Internet gateway and NAT

To function properly your VPC have to be attached to internet gateway and your private subnet should have NAT service enabled.

MySQL

For the database I use MySQL RDS. You need to disable public access to the instance and deploy it into private subnet. In security group add port 3306 for incoming connections and only from internal IP range. So, we have double protection here with security group and internal dns name for database.
There are a lot of best practices of how to setup production ready mysql instance, so I will skip most of it, but what you definitely need is to have read replica and shadow copy enabled. Make sure you set maintenance window which is right for you.

Lambda functions

To have access to our private database Lambda functions needs to be deployed inside the same VPC in private subnets. To setup https endpoints for lambda functions you would need to attach API gateway. In Lambda security groups add ports 80 and 443 for incoming connections.

That’s pretty much it, but very often you will have other web applications running in your vpc and to route traffic properly between Lambda and other apps you would need some web proxy like nginx.

Nginx

To have common entry point for your web applications and Lambda function Nginx is the best way to go. There is a new possibility to use ELB for that, but it isn’t good enough yet.

To have reliable and secure setup of nginx you would need to use common pattern of AWS which include: ELB, Autoscaling group, Launch configuration and security groups.

On the configuration side nginx will proxy traffic to Lambda functions through API gateway.

Elastic load balancer

Here you need to decide what kind of ELB suits your needs. I choose ELB with HTTPS support which provide SSL termination.  In the ELB security group I added ports 80 and 443 for all incoming traffic.

Launch configuration

Within Launch configuration you need to define what kind of instance you want to launch when autoscaling is trigger in.

Autoscaling group

ASG define what is desired number of instance you want to run at any given moment. Using metrics such as CPU you can setup it to scale up or down to desired maximum or minimum number of instances.

Almost there!

Last step is to connect ELB with ASG and with Launch configuration!

Note I have skipped setting up of Target group and health checks, but they are pretty much basics.

That’s it!

Now you have a good start to develop with AWS Lambda in conjunction with general approach of web tier architecture.

What’s next?

Second part of the topic is to setup CI and automation. Next time I will write how to code infrastructure with terraform, create nginx image with packer and run configuration management with ansible.

Continuous integration and deployment with Google Cloud Builder and Kubernetes

Pipeline of Continuous Integration(CI) for containers has several basic steps. Lets see what they are:

Setup a trigger

Listen to a change in repositories(github, bitbucket) such like pull request, new tag or new branch.

It is basic step for any CI/CD tool and for google cloud builder it is pretty trivial task to setup. Check out Container Registry – Build Triggers tool in google cloud console.

Build an image

When change to repository occur we want to start build of new Docker container image for a change. Good practice is to tag new image with branch name and git reference hash. E.g. master-00covfefe

With cloud builder you face two choices: use a Dockerfile or cloudbuild.yaml file. With Dockerfile option steps are predetermined and don’t give you too much flexibility.
With cloudbuild.yaml you can customise every step of your pipeline.
In the following example first command is doing a build step using Dockerfile and second command tag new image with branch-revision pattern(remember master-00covfefe):

Push new image to Container Registry

One important note that cloudbuild.yaml file has special directive “image” which publish image to registry, but that directive only executed at the end of all steps. So, in order to perform deployment step you need to publish image as a separate step.

Deploy new image to Kubernetes

When new image is in registry it’s time to trigger deployment step. In this example it is deployment to Kubernetes cluster.
This step require Google Cloud Builder user to have Edit permissions to kubernetes cluster. In google cloud it is a user with “@cloudbuild.gserviceaccount.com” domain. You need to give that user Edit access to kubernetes using IAM console.
Second requirement is to specify zone and cluster cloudbuild.yaml using env variables. That will tell kubectl command to which cluster to deploy.

What next

At this point the CI/CD job is done. Possible next steps to improve your pipeline can be:

  1. Send notification to Slack or Hipchat to let everyone know about new version deployment.
  2. Run user acceptance tests to check that all functions perform well.
  3. Run load tests and stress tests to check that new version has no degradation in performance.

Full cloudbuild.yaml file example

Prepare Application Launch Checklist

Introduction

Application Launch checklist is aimed for Devops, Sysops and anyone whois job to make website available and reliable.
The checklist better works for applications which are going to be Live in near feature, but also useful to validate your Devops processes for already running applications.

This checklist is a complied notes from Launch Checklist for Google Cloud Platform. It is mostly targeted on Devops work routines and in a nutshell explain first and necessary Devops steps into launching applications.

Software Architecture Documentation

  • Create an Architectural Summary. Include an overall architectural diagram, a summary of the process flows, detail the service interaction points.
  • List and describe how each service is used. Include use of any 3rd-party APIs.
  • Make it easy accessible and available – the best as wiki pages.

Builds and Releases

  • Document your build and release, configuration, and security management processes.
  • Automate build process. Include automated testing and packaging.
  • Automate release process to provision package between environments. Include rollback functionality.
  • Version your configuration and put it into Configuration Management system like Saltstack, Puppet or Ansible.
  • Simulate build and release failures. Are you able to roll back effectively? Is the process documented?

Disaster recovery

  • Document your routine backup, regular maintenance, and disaster recovery processes.
  • Test your restore process with real data. Determine time required for a full restore and reflect this in the disaster recovery processes.
  • Automate as much as possible.
  • Simulate major outages and test your Disaster Recovery processes
  • Simulate individual services failure to test your incidents recovery process

Monitoring

  • Document and define your system monitoring and alerting processes.
  • Validate that your system monitoring and alerting are sufficient and effective.

Final thoughts

I can not overstate how much final outcome depend on the level of interaction between Developers, Sysops and Devops teams in your organisation.
After application will go live convert it to a training program for every new devop before she will start site support.

Making simple Splunk Nginx dashboard

As a DevOps guy I often do incident analysis, post deployment monitoring and usual logs checks. If you also is using Splunk as me when let me show for you few effective Splunk commands for Nginx logs monitoring.

Extract fileds

To make commands works Nginx log fields have to be extracted into variables.
Where are 2 ways to extract fields:

  1. By default Splunk recognise “access_combined” log format which is default format for Nginx. If it is your case congratulations nothing to do for you!
  2. For custom format of logs you will need to create regular expression. Splunk has built in user interface to extract fields or you can provide regular expression manually.

field_extractor

Website traffic over time and error rate

Unexpected spike in traffic or in error rate are always first thing to look for. Following command build a time chart with response codes. Codes 200/300 is your normal traffic and 400/500 is errors.

Website traffic in Splunk

Response time

How do you know if your website running slowly?
For response time I suggest to use 20, 85 and 95 percentile as metrics.
You also can think of average response time metric, but low average response time doesn’t show that website is OK, so I am not using that metric in the query.

Response time in Splunk

Traffic by IP

Checking which IPs are most popular is a good way to spot bad guys or misbehaving bot.

Traffic by IP with splunk

Top of error page

Looking for pages which produce most errors like 500 Internal Server Error or not found pages like 404? Following two queries give you exactly that information.
Top error pages

Top 40x error pages

TOP nginx error urls with Splunk

Number of timeouts(>30s) per upstream

When you are using Nginx as a proxy server it is very useful to see if any of upstreams are getting timeouts.
Timeouts could be a symptom for: slow application performance, not enough system resources or just upstream server is down.

Splunk get timeout nginx upstreams

Most time consuming upstreams

Most time consuming upstreams showing which of servers are already overloaded by requests and giving you a hint when application needs to be scaled

Most time consuming upstreams

In conclusion

Splunk functions like timechart, stats and top is your best friends for data aggregation. They are like unix tools – the more tools you know the more easier is to build powerful commands.