Scalable, Available, Stable, Performant, and Intelligent System Design Patterns
Updated Sep 15, 2018
A curated list of Site Reliability and Production Engineering resources.
Updated Sep 19, 2018
A curated list of Chaos Engineering resources.
Updated Aug 27, 2018
A collection of postmortem templates
Google Site Reliability Engineering book converted in audio
Updated Mar 22, 2017
A collection of SRE tools
Updated Jan 26, 2018
Calculate how much downtime should be permitted in your SLA
HTML
Updated Jun 9, 2018
A collection of ported Markdown templates from the SRE Workbook
Updated Aug 24, 2018
Role Playing Game for Incident Management Training
A combination of introduction to operating system and computer network
Updated Feb 2, 2017
The agent of Komlog, a PaaS for helping observability teams to better understand their systems.
Python
Updated Nov 14, 2017
Terraform provider for Arachnys' Cabot. Create, manage, and manipulate status checks, and alerts for services.
Go
Updated Sep 15, 2017
Go
Updated Apr 24, 2018
Endpoint monitoring and DNS failover agent written in Go
Go
Updated Dec 8, 2017
Control health checks and toggle upstream node status in load balancers with ease.
Go
Updated May 22, 2017
Deterministic Subsetting as defined in the SRE book
Python
Updated May 29, 2018
Calculate the tolerable downtime of your service
HTML
Updated Jun 21, 2018
Resume of M. Adam Kendall, Software Engineer
Updated Aug 30, 2018
🔖 Daily-updated reading list for designing High Scalability 🍒, High Availability 🔥, High Stability 🗻 back-end system…
A curated list of awesome Site Reliability and Production Engineering resources.
Dev environment for SRE
Shell
Updated Sep 7, 2017
SRE Sandbox
Go
Updated Aug 1, 2017
Projects that focused on learning bash scripting, Linux, Vagrant, and Vim. View README inside for more
Shell
Updated Jun 29, 2018
Great resources for learning Software and Site Reliability Engineering.
Updated May 1, 2018
External Node Classifier written in Go
Go
Updated Jul 2, 2018