Check out my Quora answer on the topic As a software engineer, how do I shift my career to devops?.
Application Launch checklist is aimed for Devops, Sysops and anyone whois job to make website available and reliable.
The checklist better works for applications which are going to be Live in near feature, but also useful to validate your Devops processes for already running applications.
This checklist is a complied notes from Launch Checklist for Google Cloud Platform. It is mostly targeted on Devops work routines and in a nutshell explain first and necessary Devops steps into launching applications.
Software Architecture Documentation
- Create an Architectural Summary. Include an overall architectural diagram, a summary of the process flows, detail the service interaction points.
- List and describe how each service is used. Include use of any 3rd-party APIs.
- Make it easy accessible and available – the best as wiki pages.
Builds and Releases
- Document your build and release, configuration, and security management processes.
- Automate build process. Include automated testing and packaging.
- Automate release process to provision package between environments. Include rollback functionality.
- Version your configuration and put it into Configuration Management system like Saltstack, Puppet or Ansible.
- Simulate build and release failures. Are you able to roll back effectively? Is the process documented?
- Document your routine backup, regular maintenance, and disaster recovery processes.
- Test your restore process with real data. Determine time required for a full restore and reflect this in the disaster recovery processes.
- Automate as much as possible.
- Simulate major outages and test your Disaster Recovery processes
- Simulate individual services failure to test your incidents recovery process
- Document and define your system monitoring and alerting processes.
- Validate that your system monitoring and alerting are sufficient and effective.
Where are 2 ways to extract fields:
- By default Splunk recognise “access_combined” log format which is default format for Nginx. If it is your case congratulations nothing to do for you!
- For custom format of logs you will need to create regular expression. Splunk has built in user interface to extract fields or you can provide regular expression manually.
Website traffic over time and error rate
timechart count(status) span=1m by status
For response time I suggest to use 20, 85 and 95 percentile as metrics.
You also can think of average response time metric, but low average response time doesn’t show that website is OK, so I am not using that metric in the query.
timechart perc20(request_time), perc85(request_time), perc95(request_time) span=1m
Traffic by IP
top limit=20 clientip
Top of error page
Top error pages
search status >= 500 | stats count(status) as cnt by uri, status | sort cnt desc
Top 40x error pages
search status >= 400 AND status < 500 | stats count(status) by uri, status | sort cnt desc
Number of timeouts(>30s) per upstream
Timeouts could be a symptom for: slow application performance, not enough system resources or just upstream server is down.
search upstream_response_time >= 30 | stats count(upstream_response_time) as upstreams by upstream
Most time consuming upstreams
stats sum(upstream_response_time), count(upstream) by upstream
Splunk functions like timechart, stats and top is your best friends for data aggregation. They are like unix tools - the more tools you know the more easier is to build powerful commands.