The original title: Don’t Shave That Yak. Why you might not need Kubernetes.
by Adam Serediuk @aserediuk
Once upon a time….
Product on multiple platforms, complex business rules, security, high visibility, had to iterate quickly, and scale to millions.
IT incident management in (x)matter.
Started as monolithic software turned into microservices.
DevOps is DIFFICULT
Still solving similar problems.
Imagine having to deploy a service. Assuming GCP project, create a cluster, re-create VPC networks, then create a cluster for real. Can use Google Groups for RBAC auth for cluster, but only if you’re on G Suite.
Production readiness requires a lot of considerations (deployable, security, monitoring, logging, networking, HA, testing, backup/recovery), but it can take a long time depending on the type of system.
- Managed service: 1-2 weeks
- Distributed system (stateless): 2-4 weeks
- Distributed system (stateful): 2-4 months
- Entire cloud architecture: 6-24 months
Journey
Had 12 months to do it.
Need microservices, service ownership, practice devops, use SRE practices.
Had 6 data centres, > 500 servers, > 4000 VMs.
Tried doing a private cloud, but it didn’t make sense and it failed.
Not all the work was useless.
Logically broke up services: interfaces, event processing, notification processing, data processing. Composable services means configurable service matrix by customer, environment, etc.
Allowed developers to work on one service especially because the full stack was too large to work on one laptop.
Ended up on GCP: 6 regions, 400 services, 1200 pods. Keep the data close to the customer.
Less to manage and more cost effective.
Lessons learned
Have common ground between developers, and have good tools that people want to use, where you can have a similar mindset.
Software and libraries are interchangeable, but work culture is difficult to change.
New and shiny is not always the best. Choose things that make sense for you. Keep it simple, easy, boring.
All code is technical debt.
You need to continue the investment. If it’s not your core business, you shouldn’t be writing it.
Evaluating software is a very different thing from writing code. One of the biggest indicators is whether an organization has its own team support software, then that means it needs a lot of maintenance.
Built a PaaS
Wanted to be able to move to another service (e.g. Amazon).
Lots of layers.
Docs weren’t that great.
Didn’t get the adoption that it needed.
Things started to get worse and worse. Things that were supposed to have HA, but didn’t.
Embrace failure: unreliable hardware is good because it gives you a false sense of security. Expect infrastructure to work, and creates artificial suppressed volatility (see Antifragile by Nassim Taleb).
Should incentivise mean time recovery. If you have a reliable system, need to inject failure to do chaos testing (see Cthulhu chaos testing) to find gaps and test events. Prevent surprises and make sure things are quick and easy to get back up.