Dan Sinker & Dylan Richard
Basic premise: Saw a lot of big websites go down. This is a really stupid problem for news to have. We have a major problem if your site goes down when you’re the single source.
Need to think about how to approach this problem, in a manner of strategic thinking rather than throwing servers into it.
Building an Infrastructure
Had to build an infrastructure that would work an entire day, all day. Had to be fault tolerant. e.g. using a cache
Had to test for resiliency, so created “Game Days” to practice failure. If very terrible things are happening, what needs to still work?
Set aside 2 weeks to build in all of the fault lines, then 2 days to finish prepare. Had a schedule to turn various things off one at a time. e.g. master database
Big Things We Learned about
What breaks: Identity provider wasn’t using the code
Holes in communication: Ended up having to route problems to different places to discuss. Two things can and will actually happen rather than just one failure point.
What Failure Really Looks like: Usually what it looks like to the end user. Things broke in unpredictable ways, but you can only build in failures that you can predict.
How to Fix Things: Built collective reference on how to fix different errors. Any dev/engineer could troubleshoot
Went Through User Case Together: How to simulate failure, what the failure looks like, what can we learn, and how to sell how to practice this inside of your organization.
e.g. Chicago Sun going down when Roger Ebert obituary was posted
How do you simulate the failure?
- Traffic e.g. Machine Gun (AWS instances bomb site), Silenium (emulate user in browser), ChaosMonkey (randomly kill servers)
- slow or turn off database load
- turn off cache
- taking down search cluster
- kill scaling
- turn off communication channels (e.g. HR)
What does failure look like?
- Turn off external services (e.g. Facebook)
- Turn off dynamic functionality (e.g. comments)
- Helpful Error Pages (e.g. changing the error page to the article)
- Redirect people into static page on S3
- Varnish: serve front cache
- monitor what users are doing/saying about it
- turn off ads
Serve what people need instead of failing completely.
What can you learn from simulating failure this way?
- what is important
- what are you fallbacks are (how do we continue to communicate with users)
- learn weak spots
- strength of your team and teammates
- quality of communication channels and documentation
- how well third-party services work
- how to fix it faster
- be proactive in monitoring
How do we get buy in?
Takes a lot of time and constrained resources, so need to convince others that the exercise has value.
- go down, lose revenue
- point to other examples
- effect on reputation e.g. as trustworthy