Welcome to another episode of Modern Digital Business, the podcast that helps you navigate the ever-changing landscape of modernizing your applications and digital business. In this episode, we continue our exploration of modern operations with our special guest, Beth Long. Today's discussion is all about operational ownership and how it plays a crucial role in the success of modern organizations. We dive into the importance of service ownership, the measurement of SLAs, and the need for specific, measurable, attainable, relevant, and time-bound goals. Join us as we unravel the complexities of modern ops with Beth Long in this enlightening episode of Modern Digital Business. Let's dive in!
Today on Modern Digital Business
Thank you for tuning in to Modern Digital Business. We typically release new episodes on Thursdays. We also occasionally release short-topic episodes on Tuesdays, which we call Tech Tapas Tuesdays.
If you enjoy what you hear, will you please leave a review on Apple Podcasts, Podchaser, or directly on our website at mdb.fm/reviews?
If you'd like to suggest a topic for an episode or you are interested in being a guest, please contact me directly by sending me a message at mdb.fm/contact.
And if you’d like to record a quick question or comment, click the microphone icon in the lower right-hand corner of our website. Your recording might be featured on a future episode!
To ensure you get every new episode when they become available, please subscribe from your favorite podcast player. If you want to learn more from me, then check out one of my books, courses, or articles by going to leeatchison.com.
Thank you for listening, and welcome to the modern world of the modern digital business!
Useful Links
Lee Atchison is a software architect, author, public speaker, and recognized thought leader on cloud computing and application modernization. His most recent book, Architecting for Scale (O’Reilly Media), is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee has been widely quoted in multiple technology publications, including InfoWorld, Diginomica, IT Brief, Programmable Web, CIO Review, and DZone, and has been a featured speaker at events across the globe.
Take a look at Lee's many books, courses, and articles by going to leeatchison.com.
Check out Architecting for Scale. Currently in it's second edition, this book, written by Lee Atchison, and published by O'Reilly Media, will help you build high scale, highly available web applications, or modernize your existing applications. Check it out! Available in paperback or on Kindle from Amazon.com or other retailers.
Subscribe here to catch each new episode as it becomes available.
Want more from Lee? Click here to sign up for our newsletter. You'll receive information about new episodes, new articles, new books, and courses from Lee. Don't worry, we won't send you spam, and you can unsubscribe anytime.
Mentioned in this episode:
Architecting for Scale
What does it take to operate a modern organization running a modern digital application? Read more in my O’Reilly Media book Architecting for Scale, now in its second edition. Go to: leeatchison.com/books or mdb.fm/afs.
Modern applications require modern operations, and modern
Speaker:operations requires a new definition for ownership that
Speaker:most classical organizations must provide.
Speaker:Today I continue my discussion on modern ops with Beth Long.
Speaker:Are you ready? Let's go.
Speaker:This is the Modern Digital Business Podcast, the technical
Speaker:leader's guide to modernizing your applications and digital business.
Speaker:Whether you're a business technology leader or a small business
Speaker:innovator, keeping up with the digital business revolution is a
Speaker:must. Here to help make it easier with actionable insights and
Speaker:recommendations, as well as thoughtful interviews with industry experts.
Speaker:Lee Atchison in this episode of Modern Digital Business,
Speaker:I continue my conversation on Modern Operations with my good
Speaker:friend SRE engineer and operations manager Beth
Speaker:Long. This conversation, which focuses on service
Speaker:ownership and measurement, is a continuation of our
Speaker:conversation on SLAs in Modern Applications.
Speaker:In a previous episode, we talked about Stosa, and this fits very much into
Speaker:that idea is the idea that how you organize
Speaker:your teams so that each team has a certain
Speaker:set of responsibilities. We won't go into all the details of Stosa, but bottom
Speaker:line is ownership is critical to the
Speaker:Stosa model. Ownership is critical towards all DevOps
Speaker:models. If you own a service, you're responsible
Speaker:for how that service performs, because other teams are depending on
Speaker:you to perform what those performance,
Speaker:what it means to perform. The definition of what it
Speaker:means to perform is what an SLA is all about.
Speaker:Yeah. So what does a good SLA look like?
Speaker:Beth that's a great question. Let's get to the measurement.
Speaker:It does get into measurement.
Speaker:That is always a hard question to answer.
Speaker:If you look at the textbook
Speaker:discussions of Slis and SLOs and
Speaker:SLAs in particular, you'll often see references
Speaker:to a lot of the things that are measurable. So you'll
Speaker:have your golden signals of error, rate,
Speaker:latency, saturation. So you have
Speaker:these things that allow you to say, okay,
Speaker:we're going to tolerate this many errors,
Speaker:or this many of this type of error, this much
Speaker:latency. But all of that is kind of trying
Speaker:to distill down the customer experience
Speaker:into these things that can be measured and
Speaker:put on a dashboard. The term smart goals comes
Speaker:to mind, right. That I think, is a good
Speaker:measure. I know the idea of smart goals really hasn't been tied to
Speaker:SLAs too closely, but I think there's a lot of similarities here. So
Speaker:smart goals are five specific criteria. They're specific
Speaker:measurable, attainable,
Speaker:relevant, and time bound. So
Speaker:now I think all five of those actually apply here
Speaker:as well. Too right. When you create your SLAs,
Speaker:they have to be specific. You can't say, yeah, we'll meet your
Speaker:needs. That's not a good experience. But
Speaker:in my mind, a good measurement is something
Speaker:like, we will maintain
Speaker:five milliseconds latency on average
Speaker:for 90% of all requests that come in.
Speaker:And I also like to put in an assuming.
Speaker:Assuming you meet these criteria, such
Speaker:as amount of traffic, the traffic load is
Speaker:less than X, number of requests permitted or whatever the
Speaker:criteria is. So in my mind, it's a specific
Speaker:measurement with bounds for what that
Speaker:means. Under assumptions. And these are the
Speaker:assumptions. So something like five
Speaker:milliseconds average latency for 90% of requests,
Speaker:assuming the request rate is less than
Speaker:5000 requests per second,
Speaker:and assuming both those things occur. And you could also have assuming the
Speaker:request rate is at least 100 /second because
Speaker:caching can warming caches can have an effect there too. And things
Speaker:like that. So you can have both bounded numbers. There
Speaker:something like that is a very specific it's specific. It's
Speaker:measurable. All of those numbers I specified are all things you could
Speaker:measure. They're something you could see. Specific
Speaker:measurable. You want to make sure they're attainable within
Speaker:the service. That's your responsibility as the owner of a
Speaker:service. If another team says, I need
Speaker:this level of performance, it is your responsibility as the owner. Before
Speaker:you accept that is to say yes, I can do that. So they have to
Speaker:be attainable to you. And this actually gets at something very
Speaker:important in implementing these sorts of things, which is to make sure that
Speaker:you are starting with goals that are near what you're currently
Speaker:actually doing and step your way towards
Speaker:improvement instead of setting impossible goals. And then
Speaker:punishing teams when they don't achieve something that was so far outside of
Speaker:their ability. Oh absolutely there's two things that make a
Speaker:goal bad. One is when the goal is so easy that
Speaker:it's irrelevant. The other one is when it's so difficult that it's never
Speaker:set never hit. You should set
Speaker:goals that are in the case of
Speaker:SLAs, your goal needs to hit the
Speaker:SLA 100% of the time, but it
Speaker:can't be three times what you are ever
Speaker:going to see. Because giving you plenty of room
Speaker:to have all sorts of problems because then that doesn't make it relevant to
Speaker:the consumer of the goal. They need something better than that. That's
Speaker:where the attainable and that's where relevant comes in. And
Speaker:relevant is so important because it's so tempting. This is where when
Speaker:it's the engineers that set those goals those
Speaker:objectives in isolation you tend to get things that are
Speaker:measurable and specific and
Speaker:attainable but not relevant, right? I will
Speaker:guarantee my service will have a latency of less than
Speaker:37 seconds for this simple request guaranteed I
Speaker:can promise you that, right? And the consumer will say
Speaker:well I'm sorry I need ten milliseconds 37 seconds doesn't
Speaker:that sounds an absurd number but you and I have both
Speaker:heard numbers like that right? Where they're so far out of bounds they're
Speaker:totally irrelevant, they're not worth even discussing.
Speaker:Yes and a sneakier example would be something
Speaker:like setting an objective
Speaker:around how your infrastructure is behaving in ways that
Speaker:don't translate directly to
Speaker:the benefit to the customer. If you own a web
Speaker:service that is serving directly to end
Speaker:users. And your primary measures of
Speaker:system health are around
Speaker:CPU and I
Speaker:O. Well, those might tell you something about what's
Speaker:happening, but they are not directly
Speaker:relevant to the customer. You need to have those on your dashboards for when
Speaker:you're troubleshooting, when there is a problem, but that's not indicating the health
Speaker:of the system. Right. So specific measurable
Speaker:attainable relevant. So relevant
Speaker:means the consumer of your service has to find them
Speaker:to be useful. Attainable means that you as provider
Speaker:of the service, need to be able to meet them. Measurable
Speaker:means need to be measurable specific.
Speaker:They can't be general purpose and ambiguous. They have to
Speaker:be very specific. So all those make sense. Does time bound really apply
Speaker:here? I think it does, but in the sense
Speaker:that when you're setting these agreements,
Speaker:you tend to say, this is my commitment, and
Speaker:you tend to measure over a span of time and
Speaker:there is a sense of the clock getting reset.
Speaker:That's true. We'll handle this much traffic
Speaker:over this period of time. You're right. That's a form of time bound. I think
Speaker:when you talk about smart goals, they're really talking about the time
Speaker:when you'll accomplish the goal. And what we're saying
Speaker:is the time you accomplish the goal is now. It's
Speaker:not really a goal, it's an agreement as far
Speaker:as it's a habit. Rather than a habit.
Speaker:And that's actually a good point. These aren't goals.
Speaker:I'm going to try to make this no, this is what you're
Speaker:going to be performing to and you can change them and improve them over time.
Speaker:You can have a goal that says I'm going to improve my
Speaker:SLA over time and make
Speaker:my SLA twice as good by the state.
Speaker:That's a perfectly fine goal. But that's what a goal is
Speaker:versus an SLA, which says your SLA is
Speaker:something like five millisecond latency
Speaker:with less than 10,000 requests. And you can say, that's
Speaker:great, I have a goal to make. It a two millisecond latency
Speaker:with 5000 requests, and by this time next
Speaker:quarter, and at that point in time then your SLA is now two
Speaker:milliseconds. But the SLA is what it is and
Speaker:what you're agreeing to, committing to now, it's a
Speaker:failure if you don't meet it right
Speaker:now. As opposed to a goal, which is what you're striving towards.
Speaker:Yeah, towards completing something. Right.
Speaker:One anecdote. That a well known anecdote that I
Speaker:think is interesting to talk about. Here is
Speaker:the example that Google gave. This is in the SRE
Speaker:book of actually
Speaker:overshooting and having a service that
Speaker:was too reliable. I can't remember which service it was
Speaker:off the top of my head, but they actually had a service that they did
Speaker:not want to guarantee 100% uptime, but they ended up
Speaker:getting over delivering on quality for a while.
Speaker:And when that service did fail,
Speaker:users were incensed because there was sort of this
Speaker:implicit SLA. Well, it's been performing so well.
Speaker:And so what I love about that story is that they ended
Speaker:up deliberately introducing failures into the system
Speaker:so that users would not become accustomed to too high of
Speaker:a performance level. And what this
Speaker:underscores is how much this is about
Speaker:ultimately the experience of whatever person it is
Speaker:that needs to use your service. This is not a purely
Speaker:technical problem. This is very much about understanding
Speaker:how your system can be maximally healthy
Speaker:and maximally serve
Speaker:whoever it is that's using it. So I love that story. I
Speaker:didn't know that story before, but it plays very well into
Speaker:the Netflix Chaos Monkey approach to testing. And that is
Speaker:the idea that the way you ensure
Speaker:your system as a whole keeps performing is you keep causing it to fail on
Speaker:a regular basis to make sure that you can handle those failures.
Speaker:So what the Chaos Monkey does, and I'm sure at some point in time we're
Speaker:going to do an episode on Chaos Monkey. Matter of fact, we should add it
Speaker:to our list. What Chaos Monkey is all about is the idea
Speaker:that you intentionally insert faults into your system
Speaker:at irregular times so that you can
Speaker:verify that the
Speaker:response your application is supposed to have to self heal around the
Speaker:problems that are occurring can be tested to make sure they
Speaker:occur. Now, you don't do this in staging, you don't do this in
Speaker:dev, you do it in production. But you do it in production
Speaker:during times when people are around. So that if
Speaker:it does cause a real problem, if you turn off the service
Speaker:and that causes a real problem and customers are really affected,
Speaker:everyone's on board and you can solve the problem right away as opposed
Speaker:to the exact same thing happening by happen chance at
02 00:12:56
00 in the morning when everyone's drowsy and sleeping and
02 00:13:00
not knowing what's going on. You can address the problem right there
02 00:13:04
right then as opposed to later on. And the other
02 00:13:08
thing it helps with is this problem that you were addressing which
02 00:13:11
is getting too
02 00:13:15
used to things working. So if you deploy a new
02 00:13:19
change and let's say I own a service, and one of the
02 00:13:22
things I'm doing service A and I call Service B and I need to
02 00:13:26
expect a service B will fail occasionally, well, I'm going to write
02 00:13:30
code into Service A to do different things. If Service B
02 00:13:33
doesn't work well, what if I introduce an error in that
02 00:13:37
code that I'm not aware of and then I deploy my
02 00:13:41
code? Well it's going to function, it's going to work,
02 00:13:44
everything's going to be fine until Service B fails and Service A is also going
02 00:13:48
to fail. But if Service B is regularly
02 00:13:52
failing, you're going to notice that a
02 00:13:56
lot sooner, perhaps immediately after deployment,
02 00:13:59
and you're going to be able to fix that problem, roll it back if necessary,
02 00:14:03
or roll forward with a fix to it to
02 00:14:06
get the situation resolved. The more
02 00:14:10
chaotic you put code into, the more stable the
02 00:14:13
code is going to be. It's a weird thought
02 00:14:17
to think that way, but the more chaotic a system, the
02 00:14:21
more stable the code that's in that system behaves
02 00:14:25
over the long term. I'm so glad you bring this up. And what I
02 00:14:28
love about this is that we're really touching
02 00:14:32
on similar themes in different contexts
02 00:14:35
because both Chaos Engineering and the DevOps
02 00:14:39
approach are really about
02 00:14:43
understanding that we don't just have a technical system,
02 00:14:46
we have a sociotechnical system. We have this intertwined human and
02 00:14:50
technology system. And so with DevOps, one
02 00:14:54
of the advantages of DevOps is that it changes the behavior of the people
02 00:14:58
who are creating the system itself. Because
02 00:15:01
again, if you're going to deploy code
02 00:15:05
and you know that if something goes wrong, it's going to wake up that person
02 00:15:08
over there that you don't even know.
02 00:15:12
You just build your services differently.
02 00:15:16
You're not as rigorous as
02 00:15:19
when you know you're going to be the one woken up at 02:00 A.m.. And
02 00:15:23
similarly with chaos engineering, if you know that
02 00:15:26
service B is going to fail absolutely in the coming
02 00:15:30
week, you're just going to be like, well, I may as well deal with this
02 00:15:34
now. As opposed to like, well, I'm under deadline. Service b is usually
02 00:15:37
stable. I'm just going to run the risk and we'll deal with it later.
02 00:15:41
So it really drives behavior inserted into
02 00:15:44
systems. Right. And the other thing
02 00:15:48
I love about how you kind of unpacked chaos
02 00:15:52
Engineering is it does work
02 00:15:56
on this very counterintuitive idea that you should be
02 00:15:59
running towards incidents and problems
02 00:16:03
instead of running away from them, you should embrace them.
02 00:16:06
And that will actually help you, as you said,
02 00:16:10
make the system more stable because you
02 00:16:13
are proactively encountering those issues rather
02 00:16:17
than letting them come to you. Yeah, that's absolutely great.
02 00:16:21
That's great. Yeah, you're right.
02 00:16:25
We're not talking about coding. We're talking about social systems here. We're
02 00:16:29
talking about systems of people that happen to include
02 00:16:32
code as opposed to systems of code. And that the vast
02 00:16:36
majority of incidents that happen have a
02 00:16:40
socio component to. It not just a code
02 00:16:43
problem. It's someone who said this is good
02 00:16:47
enough or someone who didn't spend the time
02 00:16:50
to think about whether or not it would be good enough or not and
02 00:16:54
therefore missed something. Right. And these aren't bad
02 00:16:58
people doing bad things. These are good people that are making mistakes that
02 00:17:01
are caused by the environment of which they're
02 00:17:05
working. And that's why environment and
02 00:17:09
systems of people and how they're structured and how they're organized
02 00:17:12
is so important. I keep hearing people
02 00:17:16
say how you
02 00:17:20
organize your companies irrelevant. Right? It shouldn't matter.
02 00:17:24
Nothing could be further from the truth. It matters the
02 00:17:28
way you organize a company.
02 00:17:32
I hate saying it this way because I don't always work in this one, but
02 00:17:34
how clean your desk is is a good indication of how clean the system
02 00:17:38
is. And I don't mean that literally because I've had dirty
02 00:17:42
desks too, but it really is a good indication
02 00:17:45
here. It's how well you organize your
02 00:17:49
environment, how well you organize your team,
02 00:17:52
how well you organize your organization,
02 00:17:57
gives an indication for how well you're going to perform as a
02 00:18:00
company from the standpoint. Yes,
02 00:18:06
when we look at the realm of incidents which
02 00:18:10
are messy and frustrating and scary and expensive,
02 00:18:14
and every tech company knows that they
02 00:18:17
are probably one really
02 00:18:21
bad incident away from going out of
02 00:18:24
business, every company knows
02 00:18:28
that there's that really bad
02 00:18:32
thing that could collapse the whole
02 00:18:36
structure. And so incidents are really high
02 00:18:39
stakes, but
02 00:18:43
that drives us to look for certainty and look for clarity. And
02 00:18:46
so we look to a lot of these things that people have been talking
02 00:18:50
about for years around incident metrics. So you've
02 00:18:54
got your mean time metrics, what's your mean time to resolution
02 00:18:58
or your mean time between failure and it's? This attempt
02 00:19:01
to bring some kind of order
02 00:19:06
and sense to this very scary and chaotic world
02 00:19:10
of incidents. But so
02 00:19:13
many of those, what are now often being called shallow
02 00:19:17
incident metrics end up giving short
02 00:19:21
shrift to what we were just talking about, which is that
02 00:19:25
this is a very complex system.
02 00:19:29
The technology itself is very complex. The
02 00:19:33
sociotechnical system is complex.
02 00:19:36
We're trying to kind of get a handle on
02 00:19:40
how do you surface those complexities and make them
02 00:19:43
intelligible and make them sensible without
02 00:19:47
falling back to some of these shallow metrics. That
02 00:19:52
Niall Murphy, who was back to SRE, one of the authors of
02 00:19:55
the original SRE book, had a paper out recently where he kind
02 00:19:59
of unpacks the ways that these mean time
02 00:20:03
and other shallow metrics aren't
02 00:20:06
statistically meaningful and
02 00:20:09
aren't helping us make good decisions
02 00:20:13
in the wake of these incidents. And so much of what we're talking
02 00:20:17
about is SLAs are how do
02 00:20:20
you make decisions about what work you're going to do and how
02 00:20:24
much you invest in reliability versus new features
02 00:20:28
and incident follow up is so much about what
02 00:20:31
decisions do we make based on what we learned
02 00:20:35
in this event. Yeah, you add a whole new dimension
02 00:20:39
here to the metric discussion here, because
02 00:20:43
it's so easy to think about metrics along the line of
02 00:20:46
how we're performing and when we don't perform, it's a failure
02 00:20:50
oops, but there's a lot of data in the
02 00:20:53
Oops, and you're right. Things like meantime
02 00:20:57
to detect and meantime to resolution. And those are important,
02 00:21:01
but they're very superficial compared to the depth that you
02 00:21:05
can get. And I'm not talking about Joe's team caused five
02 00:21:08
incidents last week. That's a problem for Joe. I'm not talking about
02 00:21:12
that. I'm talking about the
02 00:21:15
undercovering,
02 00:21:18
the sophisticated connection between
02 00:21:22
things that can cause problems to occur.
02 00:21:28
Thank you for tuning in to Modern Digital Business. This
02 00:21:31
podcast exists because of the support of you, my listeners.
02 00:21:35
If you enjoy what you hear, will you please leave a review on Apple
02 00:21:38
podcasts or directly on our website at MDB
02 00:21:42
FM slash Reviews If you'd like to suggest a topic for an
02 00:21:46
episode or you are interested in becoming a guest, please contact
02 00:21:50
me directly by sending me a message at MDB FM
02 00:21:53
contact. And if you'd like to record a quick question or
02 00:21:57
comment, click the microphone icon in the lower right hand corner of our
02 00:22:00
website. Your recording might be featured on a future
02 00:22:04
episode. To make sure you get every new episode when they become
02 00:22:08
available, click subscribe in your favorite podcast player
02 00:22:11
or check out our website at MDB FM. If
02 00:22:15
you want to learn more from me, then check out one of my books, courses
02 00:22:18
or articles by going to Lee Atchison.com and
02 00:22:22
all of these links are included in the show. Notes thank you for
02 00:22:26
listening and welcome to the world of the modern digital business.