DevOps Days Victoria 2019: Yak shaving and lessons learned while scaling

The original title: Don’t Shave That Yak. Why you might not need Kubernetes.

by Adam Serediuk @aserediuk

Once upon a time….

Product on multiple platforms, complex business rules, security, high visibility, had to iterate quickly, and scale to millions.

IT incident management in (x)matter.

Started as monolithic software turned into microservices.

DevOps is DIFFICULT

Still solving similar problems.

Imagine having to deploy a service. Assuming GCP project, create a cluster, re-create VPC networks, then create a cluster for real. Can use Google Groups for RBAC auth for cluster, but only if you’re on G Suite.

Production readiness requires a lot of considerations (deployable, security, monitoring, logging, networking, HA, testing, backup/recovery), but it can take a long time depending on the type of system.

  • Managed service: 1-2 weeks
  • Distributed system (stateless): 2-4 weeks
  • Distributed system (stateful): 2-4 months
  • Entire cloud architecture: 6-24 months

Journey

Had 12 months to do it.

Need microservices, service ownership, practice devops, use SRE practices.

Had 6 data centres, > 500 servers, > 4000 VMs.

Tried doing a private cloud, but it didn’t make sense and it failed.

Not all the work was useless.

Logically broke up services: interfaces, event processing, notification processing, data processing. Composable services means configurable service matrix by customer, environment, etc.

Allowed developers to work on one service especially because the full stack was too large to work on one laptop.

Ended up on GCP: 6 regions, 400 services, 1200 pods. Keep the data close to the customer.

Less to manage and more cost effective.

Lessons learned

Have common ground between developers, and have good tools that people want to use, where you can have a similar mindset.

Software and libraries are interchangeable, but work culture is difficult to change.

New and shiny is not always the best. Choose things that make sense for you. Keep it simple, easy, boring.

All code is technical debt.

You need to continue the investment. If it’s not your core business, you shouldn’t be writing it.

Evaluating software is a very different thing from writing code. One of the biggest indicators is whether an organization has its own team support software, then that means it needs a lot of maintenance.

Built a PaaS

Wanted to be able to move to another service (e.g. Amazon).

Lots of layers.

Docs weren’t that great.

Didn’t get the adoption that it needed.

Things started to get worse and worse. Things that were supposed to have HA, but didn’t.

Embrace failure: unreliable hardware is good because it gives you a false sense of security. Expect infrastructure to work, and creates artificial suppressed volatility (see Antifragile by Nassim Taleb).

Should incentivise mean time recovery. If you have a reliable system, need to inject failure to do chaos testing (see Cthulhu chaos testing) to find gaps and test events. Prevent surprises and make sure things are quick and easy to get back up.

Yak by Lutrus. CC BY-NC-ND 2.0

Author: Cynthia

Technologist, Librarian, Metadata and Technical Services expert, Educator, Mentor, Web Developer, UXer, Accessibility Advocate, Documentarian

Leave a Comment

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.