Welcome back to another episode of Modern Digital Business! In today's episode, we delve deeper into the world of modern operations with our special guest, Beth Long. We explore the essential role of service level agreements (SLAs) in managing complex, multi-service modern applications.
As we unravel the differences between DevOps and SREs (Site Reliability Engineers), Beth sheds light on the origins and practices behind these two distinct approaches. We also discuss the significance of SLIs (Service Level Indicators), SLOs (Service Level Objectives), and SLAs in ensuring the stability and reliability of large-scale web operations.
Join us as we navigate the complexities of modern operations and gain valuable insights and recommendations from Beth, a seasoned SRE engineer and Operations manager. Stay tuned for an enlightening conversation on SLAs in our quest to modernize your applications and thrive in the digital business revolution. Let's dive in!
Today on Modern Digital Business
Thank you for tuning in to Modern Digital Business. We typically release new episodes on Thursdays. We also occasionally release short-topic episodes on Tuesdays, which we call Tech Tapas Tuesdays.
If you enjoy what you hear, will you please leave a review on Apple Podcasts, Podchaser, or directly on our website at mdb.fm/reviews?
If you'd like to suggest a topic for an episode or you are interested in being a guest, please contact me directly by sending me a message at mdb.fm/contact.
And if you’d like to record a quick question or comment, click the microphone icon in the lower right-hand corner of our website. Your recording might be featured on a future episode!
To ensure you get every new episode when they become available, please subscribe from your favorite podcast player. If you want to learn more from me, then check out one of my books, courses, or articles by going to leeatchison.com.
Thank you for listening, and welcome to the modern world of the modern digital business!
Useful Links
Lee Atchison is a software architect, author, public speaker, and recognized thought leader on cloud computing and application modernization. His most recent book, Architecting for Scale (O’Reilly Media), is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee has been widely quoted in multiple technology publications, including InfoWorld, Diginomica, IT Brief, Programmable Web, CIO Review, and DZone, and has been a featured speaker at events across the globe.
Take a look at Lee's many books, courses, and articles by going to leeatchison.com.
Check out Architecting for Scale. Currently in it's second edition, this book, written by Lee Atchison, and published by O'Reilly Media, will help you build high scale, highly available web applications, or modernize your existing applications. Check it out! Available in paperback or on Kindle from Amazon.com or other retailers.
Subscribe here to catch each new episode as it becomes available.
Want more from Lee? Click here to sign up for our newsletter. You'll receive information about new episodes, new articles, new books, and courses from Lee. Don't worry, we won't send you spam, and you can unsubscribe anytime.
Mentioned in this episode:
Architecting for Scale
What does it take to operate a modern organization running a modern digital application? Read more in my O’Reilly Media book Architecting for Scale, now in its second edition. Go to: leeatchison.com/books or mdb.fm/afs.
Speaker:
What are service level agreements and why are they absolutely
Speaker:
essential to managing complex, multi service
Speaker:
modern applications? Today I continue my discussion on
Speaker:
modern ops with Beth Long. Are you ready? Let's
Speaker:
go. This is the Modern
Speaker:
Digital Business podcast, the technical leader's guide to
Speaker:
modernizing your applications and digital business. Whether you're a
Speaker:
business technology leader or a small business innovator, keeping
Speaker:
up with the digital business revolution is a must. Here to help make
Speaker:
it easier with actionable insights and recommendations, as well as
Speaker:
thoughtful interviews with industry experts. Lee Atchison
Speaker:
in this episode of Modern Digital Business, I continue my conversation on
Speaker:
Modern operations with my good friend SRE
Speaker:
engineer and Operations manager Beth Long. So, Beth,
Speaker:
great to see you again today. And today we wanted to
Speaker:
talk about SRE terminology and
Speaker:
measurements, and it's fantastic that we have a
Speaker:
SRE in our miss in order to do that. So I'm glad you're
Speaker:
here. Great. Let's get started on
Speaker:
this. So SRE cyber liability Engineer
Speaker:
is tied very closely to the concept of DevOps,
Speaker:
but they're really not the same thing. Can you start out
Speaker:
by telling us what's the difference between DevOps and SREs?
Speaker:
I love this question. I've talked about this a number of times. And I'm going
Speaker:
to get back at you for asking me this by flipping it around and asking
Speaker:
you the same thing in a minute, but I'll take a stab at it. So
Speaker:
SRE site reliability engineering originated out of
Speaker:
Google, gee, almost 20 years ago now, I
Speaker:
guess. Yeah. And it was
Speaker:
really a discipline that was a
Speaker:
response to the
Speaker:
pressures of managing technology
Speaker:
at Google scale. And so
Speaker:
a lot of the practices that are associated with site reliability
Speaker:
engineering are the things that Google
Speaker:
developed internally to help them manage their scale as
Speaker:
they grew and then began to evangelize out to the wider
Speaker:
community. And so now a lot of those practices have been adopted more
Speaker:
widely and have been iterated upon. But
Speaker:
that's the origins of site reliability engineering
Speaker:
and the origins of DevOps are
Speaker:
a little bit more cross
Speaker:
cutting, a little bit more democratic, I
Speaker:
guess, and came out of
Speaker:
people around the same time
Speaker:
realizing that the
Speaker:
siloing of development and operations
Speaker:
was leading to unhealthy
Speaker:
patterns in the software engineering.
Speaker:
So people like John ALSPAW, who we'll talk about a
Speaker:
little bit later probably, if we touch on incidents at all, were
Speaker:
prominent in kind of saying, let's rethink how we're doing
Speaker:
the software engineering practice. So DevOps really focused
Speaker:
on integrating development and operations so
Speaker:
that those functions were
Speaker:
shared more as opposed to completely siloed. And
Speaker:
site reliability engineering was a set of practices
Speaker:
around maintaining stability and reliability
Speaker:
of large scale web operations. And so there
Speaker:
are some foundational topics that I'd like to ask you about actually,
Speaker:
around things like service level indicators, objectives and
Speaker:
agreements and a wide number of other
Speaker:
practices. So this is a very wandering answer to say
Speaker:
that the major difference between the two I think is really kind
Speaker:
of one of ancestry and how they started and
Speaker:
SRE sort of being a set of
Speaker:
practices and DevOps being more of a
Speaker:
philosophy and an approach to the development
Speaker:
environment. Yeah, it's almost like the
Speaker:
SRE is a practice that occurs within a DevOps model, but
Speaker:
it exists independently as well too. But it's a role
Speaker:
within DevOps. But not the only role within DevOps. Right?
Speaker:
Yeah. Now what's interesting is you
Speaker:
hear both DevOps and SRE talked about as
Speaker:
practices but also you'll hear about SREs talked
Speaker:
about as a profession, but yet
Speaker:
you don't talk about DevOps as a profession.
Speaker:
And in fact people do, but usually it's considered a
Speaker:
negative. I'm a DevOps engineer. No, there's no such thing as a
Speaker:
DevOps engineer. So is that
Speaker:
also part come from the historical
Speaker:
nature of where it came from or is there really is a difference there that
Speaker:
matters? This is a great question and something I hoped we
Speaker:
would touch on because I still kind of cringe a little bit when I see
Speaker:
DevOps engineer, but I've come to understand why
Speaker:
that job title has meaning. Because
Speaker:
there are organizations that for a number of reasons,
Speaker:
including the size of the organization, the history
Speaker:
of it, its composition, sometimes it does make
Speaker:
sense to have people focus
Speaker:
on the kinds of things that happen at the
Speaker:
boundary of development and operations.
Speaker:
And so you'll get DevOps engineers who focus on internal
Speaker:
tooling build and deploy pipelines
Speaker:
that of
Speaker:
activity. Yeah, I always hate the word DevOps engineer applying
Speaker:
to that as opposed to like infrastructure engineer or tooling
Speaker:
engineering. But you're right, you do hear that, you hear that
Speaker:
apply there. What it almost seems like though is you
Speaker:
hear a large organization say DevOps is good,
Speaker:
we need to go to DevOps. Okay, you and you are now DevOps
Speaker:
engineers. Exactly. And that's not the way it's
Speaker:
done, of course. And often they become the ones then that
Speaker:
focus on the tooling and kind of become those tooling engineers and keep
Speaker:
the DevOps title. And it's not always a good
Speaker:
history that brings you to that situation.
Speaker:
Yeah. And to answer your original question, I
Speaker:
think
Speaker:
there's a little bit of a CRISPR definition around
Speaker:
what a site reliability engineer does,
Speaker:
but there's still a lot of fuzz in the definition and there's a lot
Speaker:
of range in if someone says they're an SRE, what they actually do. It's
Speaker:
still going to be quite a wide range of options. But
Speaker:
the origins of site reliability engineering go back to
Speaker:
bringing the software engineering discipline
Speaker:
into the operations realm. And so again you see this sort
Speaker:
of both SRE and DevOps are really about
Speaker:
crossing that boundary, but it's
Speaker:
almost. In the opposite direction of what exactly. Yeah, exactly.
Speaker:
DevOps is more about bringing Ops into dev and
Speaker:
SRE is more about bringing the processes
Speaker:
of development into operations. Right. And so you are much more
Speaker:
likely to end up with an SRE group
Speaker:
that is sort of helping the whole organization level up with those
Speaker:
things. Whereas a DevOps organization,
Speaker:
at least in the way that I tend to use DevOps and I think you
Speaker:
and I are similar in this a DevOps organization is going to
Speaker:
be you're on call for your own services rather. Than having
Speaker:
an operations center and some of those things that are more at the
Speaker:
organizational scale as opposed to SRE tending
Speaker:
to be more likely that you're going to have a group of
Speaker:
people that are bringing those things to the organization. Yeah,
Speaker:
in that manner, SRE group or SRE
Speaker:
engineers is more akin to like an architecture group and
Speaker:
architects, they're assigned to individual parts of the
Speaker:
project, but they also have some global responsibilities as
Speaker:
well and shared knowledge. And whether they're in
Speaker:
one group or distributed is a
Speaker:
much more fluid question. That depends on the
Speaker:
organization versus a clear cut who should be in which group.
Speaker:
Sort of a model that is more akin to what
Speaker:
happens in DevOps. I like that distinction. So now
Speaker:
we know SRE is not the same as DevOps and we understand the difference
Speaker:
between them. That's great. So you bring me, get me back on something. Now
Speaker:
you had said, I'm not really looking
Speaker:
forward to that, whatever that is.
Speaker:
If there's one thing that's iconically associated with
Speaker:
SRE, I think it's fair to say that it's service level indicators
Speaker:
and service level objectives and service level agreements.
Speaker:
Slis, SLOs, SLAs. The acronyms
Speaker:
confuse everybody, even those who have been using them for years.
Speaker:
And I know that you have a very pragmatic approach
Speaker:
to kind of tackling some of these questions. So I'd love first for folks that
Speaker:
aren't deeply aware of those, maybe kind of set the scene and then I'd
Speaker:
love to kind of hear your take on how you can implement those.
Speaker:
Well, sure. I even confuse Slis and
Speaker:
SLOs. And so I'm going to need help with the definition if we're going
Speaker:
to define what the three are. But I'd almost prefer to avoid the
Speaker:
definitions and talk about what the problem is that's going on there. What the
Speaker:
problem is, is what all of them are trying to indicate
Speaker:
is the health of something, the health of a code base,
Speaker:
the health of a service, the health of an application.
Speaker:
Now historically the word SLA service
Speaker:
level.
Speaker:
Agreement. Agreement? Yeah.
Speaker:
SLA service level agreement comes
Speaker:
from inter customer connections.
Speaker:
So you have a provider of a service, of an
Speaker:
application that has a customer, and that customer says, we'll buy
Speaker:
your service, but I need a SLA service level
Speaker:
agreement that specifies how well or
Speaker:
what your service is going to do for me. And often those
Speaker:
agreements are around things like uptime
Speaker:
latency, how fast the application will work,
Speaker:
how many users can be connected to it. There could be a
Speaker:
thousand different dimensions on how it's measured, but it's usually some form of
Speaker:
measurement of a guarantee to the customer
Speaker:
of what the service or application that
Speaker:
the provider of that will guarantee in
Speaker:
exchange for usually money in the case of a customer
Speaker:
relationship. So an SLA has a very long history.
Speaker:
It's been around for a long time. The word SLA probably goes back
Speaker:
long before either one of us were born, because it applies
Speaker:
to contract work in general, not just software or
Speaker:
computer work. And so it's been around for a long time. But
Speaker:
what's happened in I believe it was Google
Speaker:
is the one who started the Slo or the SLAI model. I believe they're
Speaker:
the ones that did part of the SRE revolution. It included with
Speaker:
them. But what was decided was
Speaker:
we need some way to at a smaller scale as we
Speaker:
take this large application and now internally
Speaker:
divide it into services and into microservices and into
Speaker:
its various components. And especially in DevOps
Speaker:
models, we needed a way to say this part of the
Speaker:
service has requirements that it must
Speaker:
perform to. It has obligations it
Speaker:
needs to be able to handle in order to serve the needs
Speaker:
of the other services around it.
Speaker:
And so Google created new terms called Slis and SLOs
Speaker:
in order to distinguish them from
Speaker:
SLAs for how you
Speaker:
measure those parts of the application. And the
Speaker:
idea is Slis and SLOs are internal
Speaker:
measurements for internal customers, and SLAs were
Speaker:
external measures for external customers. That's where I have
Speaker:
my problem, because in my mind, in a service oriented
Speaker:
architecture, in a service oriented
Speaker:
team model, if you own a service
Speaker:
and other services depend on you, those other teams
Speaker:
are your customers. The fact that they sit down the hall from
Speaker:
you or right next to you, or on another floor, but in the same
Speaker:
company is irrelevant. They're still your customers.
Speaker:
Whether they're an internal customer or an external customer doesn't matter.
Speaker:
They're your customers. You need to keep them happy for
Speaker:
your application to perform as expected. So when
Speaker:
you provide a service that someone else is depending
Speaker:
on, and you specify what the requirements are for running that
Speaker:
service, those are service level agreements. Those are the
Speaker:
agreements that you have with the other service owners
Speaker:
of how your service will behave. There's no
Speaker:
difference between those SLAs as the external ones. So
Speaker:
don't call them something different, because that implies there's something less.
Speaker:
Right? An Slo implies an
Speaker:
internal agreement, which of course, internal agreements aren't
Speaker:
official agreements. Well, they're not as important, right?
Speaker:
SLA implies an external agreement, which is important because we're
Speaker:
talking about customers here. They're all customers. They're all
Speaker:
external. They're all SLAs. When you make an
Speaker:
agreement that your application performs a certain way, there's
Speaker:
no difference in whether or not that agreement is made with another team within your
Speaker:
organization or to an external customer. They're just as important
Speaker:
because guess what? If you break your agreement
Speaker:
for how your service performs with another team. That's not going to
Speaker:
just affect the other team. That's going to affect all the teams that they depend
Speaker:
on, and ultimately it's going to affect the customer. So it all
Speaker:
matters. They're all just as important. So let's not invent
Speaker:
new terms to describe them. In my mind, they're all
Speaker:
SLAs. So if you have 100 service teams
Speaker:
within your organization and they have their
Speaker:
criteria for how they are expected to perform
Speaker:
to support the other service owners, those
Speaker:
are SLAs. Those expectations are
Speaker:
service level agreements. They need to be treated at the same level of
Speaker:
importance as the customer level service level
Speaker:
agreements. I find that really interesting because you're getting
Speaker:
at the fact that words matter and what we
Speaker:
call things matter, because I think there are a lot of
Speaker:
really interesting organizational challenges with implementing
Speaker:
SLOs and SLAs effectively. And one of them sort of on the
Speaker:
flip side, is that when teams
Speaker:
talk about service level objectives,
Speaker:
they often sort of set them arbitrarily based
Speaker:
on, okay, these are the things that I can measure and these are
Speaker:
my objectives as the owner. And
Speaker:
what you're getting at is the fact that these really need to be
Speaker:
agreements. They need to be hashed out with product
Speaker:
owners and technical leads and people who are
Speaker:
deeply familiar with the customer, whether that customer is internal or
Speaker:
external. Stay tuned for our next Modern Ops segment
Speaker:
when Beth and I continue our discussion on modern application
Speaker:
operations by talking about ownership in a modern
Speaker:
operations world. Thank you for
Speaker:
tuning in to Modern Digital Business. This podcast exists
Speaker:
because of the support of you, my listeners. If you enjoy what you
Speaker:
hear, will you please leave a review on Apple podcasts or
Speaker:
directly on our website at MDB FM. Slash
Speaker:
Reviews if you'd like to suggest a topic for an episode or
Speaker:
you're interested in becoming a guest, please contact me directly by
Speaker:
sending me a message at MDB FM contact.
Speaker:
And if you'd like to record a quick question or comment, click the
Speaker:
microphone icon in the lower right hand corner of our website.
Speaker:
Your recording might be featured on a future episode. To
Speaker:
make sure you get every new episode when they become available, click
Speaker:
subscribe in your favorite podcast Player, or check out our website at
Speaker:
MDB FM. If you want to learn more from me,
Speaker:
then check out one of my books, courses or articles by going to Lee
Speaker:
Atchison.com, and all of these links are included in the show.
Speaker:
Notes. Thank you for listening and welcome to the world of the
Speaker:
Modern Digital Business.
Product Manager
I write stories for humans and code for machines. I'm preoccupied with the entire ecosystem of modern technology: code, data, infrastructure, and the clever, perplexed humans who make it all work.