All your things will change Build evolvable cloud infrastructure to make it easy – Kief Morris


My name is Kief. I’ve been with ThoughtWorks for about
nine years now. And my role is around
infrastructure engineering and dealing with cloud and
platforms and these kind of things. When I joined it was
off the back of the Continuous Delivery book, which
was just coming out at that point in time. And that really kind
of clicked with me. So I’d been working
in the industry for that point about 12 years or
so across roles in development and systems administration. And I was always going to
interested in that kind of in-between point and the
time that DevOps kind of emerged as a thing. And cloud was emerging
and infrastructure as code was emerging really
kind of clicked with me and the continuous
delivery concept was great. And so what I would tend to
do my role on ThoughtWorks projects when we
go to the clients was they’re asking about we want
to do this continuous delivery thing we want to make our
release process more easy. And one of the keys
to that was getting the environments consistent all
the way through to production. And so what I and people
like me would do in projects is kind of engage
with the operations side of the organization,
the people who are managing the infrastructure
environments and say, what do we need to do in
order to make this work more smoothly. How can we make environments
more consistent. And by leveraging
things like the time kind of puppet and Chef were out
and tools like that, as well as cloud was just getting started. But not necessarily
getting loads of adoption in the enterprise
space at that point in time. So the thing was I was
working with these people and trying to figure
out how to make the best use of these technologies. The technologies themselves
weren’t enough right. So Cloud is kind of the
enabler and still these days is an enabler for
a lot of change and organizations are wanting
to become digital right. And the idea there. So a few years ago. What that meant
was you would have a separate digital department. So you’d have
organizations, which were like, we want
to do stuff online. And this is, but it is
kind of a new thing. And now it set up a
separate department that’ll be the digital team the
digital department whatever. And what they would
often do is say, we’re going to use cloud because
our existing IT department is this kind of
obstructive right. They get in the way
they’re too slow. They’re too old fashioned. We’re just going
to go to the cloud or we’re going to do
whatever we want to do. How we want to do it not worry
about all those old things. But then it turned
out that that wasn’t a lot of the kind of things that
those traditional operations teams were concerned
about were actually things that did matter right. Maybe the ways they
were going about it were exploiting the
new technologies and kind of aligned with
new ways of working. But it’s still a concern
to think about things like security and managing
of data responsibly and scaling and risk and
all those kind of things are things we still
need to do care about. And then especially
as digital sort moving into the center of the business. It starts becoming something
that you can just keep off and running as this
little kind of cowboy outfit off to the side and say
that while real business is taking place now and growing
and growing on this platform, we have to take
it more seriously. And also we have to fold
it into the whole business. The whole business
has to kind adopt this way of working in
these ways of thinking in these new technologies. And so this whole
thing of the risk. And that kind of moving fast
is a really important because. How many people are met with
a state of the DevOps report. They accelerate book? Ok
a fair number. I think this has been a
really transformative thing in the world of DevOps over
the past couple of years because they’ve done
like research to find out about how organizations
are working and what kind of
outcomes they’re getting with those practices
whereas previously, it was kind of like, you know
we would be advocating to agile ways of working DevOps and so
on the basis of like it seems it seemed like a good idea
from some of our experiences it seems to be more
effective if we do it right. But this kind of puts numbers
on us puts facts right. So they looked at things
about how effective. So first of all,
the practices right. So what kind of agile
practices and techniques are organizations using. How are they organizing their
teams those kind of things. And then they also
looked at outcomes and by outcomes they’re
looking at as a business. What are the things
that you are. You have declared as
your kind of objectives and how well are you
doing in meeting those. So whatever those might be. And it’s not just the technical
things, like delivering things. It’s also a financial
performance even share price and those kind of things. They’re kind of
building correlations between these things. And so the four metrics they
found are the most indicative. So they’ve made a measured
lots of different things. But they found that these four
are the things which kind of have them the
highest correlation to whether organizations
perform well, what they call high
performing organizations want to achieve those objectives. They’re looking to
achieve, and they found that there’s kind of
at least for kind of you could predict them
into two groups right. There’s just kind of
there’s throughput, which is around how
frequently do you make deployments to
your production systems and then how long. What’s the lead time
for making a change. All right. So it’s like when you identify
the need, whether it’s like, oh, here’s a bug we need to fix. Here’s a new feature
that we want to introduce or here’s a new product
that we want to introduce. How long does it take
for that to become into customers hands and we’re
starting to kind of find out how well that works like those
kind of throughput indicators and then there’s the stability
indicators like the change fail rate how frequently. When we
make one of those changes those deploys to production. How frequently does it go wrong. Does it need to be
remediated either whether it’s rolling it
back or doing a quick fix or something of that nature. And then that’s generally how
quickly can you restore service if something does go wrong. Something falls over
whatever how quickly can you get it back up and running. So these were the things
they found for organizations with a measure these numbers
positive kind of numbers on these correlate to overall
success of the organization. And here’s the interesting
thing I think instinctively we kind of feel like we have
to trade these off right throughput or stability right. We have to decide
as an organization, are we going to focus on
speed and sacrifice quality. Right we have, we’re
going have to make that trade or we’re
going to say, yes, that’s like this startup
mentality move fast and break things right or
we can choose to quality and have stability
in which case, we have to sacrifice
throughput and speed right. That’s the kind of we
think instinctively that these are things we have to
trade off against one another. That this is you know
so you go either you get the throughput can lead
to instability or stability is going to lead
to poor throughput. So what they found actually
when their research is that this is not a trade off actually. So organizations
that perform well against that whatever
metrics they’re looking to kind of achieve
as it you know commercially are good at both
of those things. And that actually
organizations are not good. You don’t have that
thing of organizations are good at one or the other. They’re good at both
are bad at both. And I think this is kind of a
it makes a certain kind of sense when you think about
it in that basically, if you’re going to go. If you’re going
to try to go fast you can’t do that if your
systems is poor quality right because things break
and they slow you down and you end up
going slow anyways and that’s how you end
up being bad at both. Even though you do try to
be good at the speed thing and you try to be good
at the quality thing and say we’re going to go
slow and careful going to lots of process evaluate every
change and do things really, really kind of
rigorously you find out that actually reduces
quality as well, because it costs too much
to make fixes to things. And so yeah you might
when you have something that kind of falls over
n is a catastrophe you know you stop everything
and you go and you fix it. And often you do that by kind
of breaking your process right. When you have never truly fails. You know you have to kind
of banning your process and do that kind of
hot emergency fix. But that for the kind
of routine things to keep technical debt load to
keep this quality the system high if the barriers to making
small fixes are very high have a lot of process involved
you won’t do it as often. And so that your overall you
have a lot of technical debt with long lists of known issues. You see this lot in kind of
large financial organizations. And so on, which are
very focused on having heavyweight processes
that they hope are going to make their
systems higher quality when you actually
go in there and look at kind of how they run things
and the state of things. It’s actually not that great. And this is like Neal is
you know referring earlier to the evolutionary
architectures thing and this is what we want need
to do with our infrastructure as well right. And if this is
about seeing changes in something that we can avoid. It’s not something we can kind
of minimize and limit in order to give us stability, because it
doesn’t help our quality right. So we have to figure out
how to exploit chain. How to turn change
into an advantage. So the goal here
as an organization is to optimize for change
and by optimize for change. I don’t just mean, I
don’t mean just speed that we have to figure out
how to go as fast as possible. But also how to make changes
reliable and easy and safe. So the three practices
that I’m going to talk about today around doing
this for our infrastructure. One is obviously
defining things as code. This is going to do some things
to help us optimize for change. And that’s going to help us do
continuous validation is going to help us make sure that
we kind of test each change and build quality into the
system as we make changes. And then the third
thing is making sure that we’re building
things in small pieces so that it’s easier
to do all of this. So for the first one defining
all of our things as code. This is kind of
like the that you know the basic stuff of
infrastructure as code right. So we define all of
our things as code. So that it’s visible
everybody in the team and the organization can have
a look at it and evaluate it. Auditors can come and
have a look at it. It makes it easier
to reuse things where that makes sense to do. I mean have consistency
across things and also it makes
things testable. So some of that kind
of I guess based pieces of doing
infrastructure as code the kind of enabler
for this the base thing is the
infrastructure platform needs some kind of a cloud. That’s going to have some
resources compute, storage, and networking of various
kind of configurations of these things. And then you’ve got to
take this and you’re going to use your
code to create stacks. All right. So by a stack. What it means how
many people here are familiar of use Terraform. OK How many cloud formation
a little bit fewer. So this is kind of
like there’s not like a common term
across all of these tools for like what is a unit that
they work with a Terraform project. Cloud formation does have
a stack as a concept, but I kind of use
a stack generic the term stack generically
just to have something to talk about patterns
regardless of which tools are using you can use those tools,
you can use Ansible you can use as your resource
manager template, you can use all
those kind of things it doesn’t kind of
matter at this level we’re talking about
how to put these things you know to use in good ways. And so an
infrastructure stack is a collection of
infrastructure resources that we manage as a unit. So you’re referring again to
the evolutionary architecture book they talk about a
quantum of change, which is like what’s the kind of peace
that you can make changes to. And so if you change
one thing in the stack then you redeploy the stack. Right So that’s why
it’s a unit of work. And then on top of this on top
of that kind of infrastructure platform and then
the stacks that that creates you have
application kind of runtime like what do you actually deploy
your applications onto you. So the kind of common thing
these days as virtual machines or two instances are what have
you you also in many cases do physical hardware. So there’s still
organizations which have a need for the
kind of performance to be able to deploy
straight to metal but that can still be
automated and still be done in an
infrastructure as code way. We’ve got bare metal clouds
and these kind of things. Container clusters are
obviously becoming super common. So Kubernetes is the
best known of this and getting the most traction
but there are other things like canonmad and so on. And then there’s
service run times right. And so that kind of point
of this is that all of these are different
varieties of things that you might be deploying your
applications onto but you still want to have the kind
of things around them. So those are to be able to come
provision the thing itself, whether it’s a cluster
or what have you. You kind of want
to do that as code. I’ll talk a bit more
about that later. I mean, even for serverless, you’re
going to have things around it. So your serverless code. Yeah you don’t have
infrastructure necessarily that you’re aware of. But you have to manage. But it is going
to have to connect with things you probably have
network connections coming in maybe out can have to
store data somewhere. So there is still
this kind of need to kind of define
the things around and to build a test
how this thing behaves you know in a
non-production context before you deploy
it into production. So all that stuff is
still quite relevant. It just leads us to that. How do we manage
multiple environments with infrastructure. There’s a couple of kind
of patterns that people do. So the kind of first thing
that kind of naive thing to do is to say I’ve got Terraform
I’m going to create you. I want to have a test
environment a production environment, maybe
a stage environment. So I’m going to define
this in a single project with like you know
here’s my declaration of the code for each of
these different environments. The servers in each environment
into a single Terraform project a single stack. And so you do that. And the problem with
this is is that the blast radius of these
of this stack right. So if you’re trying to
make a change to testing. I mean, you change the code for
that to the test environment change the code for that. And apply it you
might accidentally break the production
environment there can be things that happen that
have unintended consequences across the environments. So most people who kind of
try this out and to start with find this is painful after all. After a while. Then the next kind of thing
that people do is to say fine. I’m going to make
a separate project for each of my environments
someone have separate code and I’ll copy the code from
one environment to the next. And so this solves that
problem of the blast radius. Now I’m going to make a change
in my testing environment. I change just the
code for that run Terraform I’m not
going to accidentally mess with any of my
other environments, which is a good thing not
because you’re making changes by saying I change the code
to the test environment. If I’m happy with that. I’m going to copy that
code change the next. And then copy it to the next. This creates the
opportunity for errors whether it’s you’re making
a mistake in copying things, whether it’s forgetting
to copy some part of it or whether there’s like
variable or things that need to be changed in
those environments for one or the other like
names and IDs. And so on. This gets very messy right. You think about
application code. It’s not what you do with
application code right. You don’t take your code
base fears a Java application and then copy it
to another GitHub repo for each
environment and then copy and paste code changes
across predefined environments. You have a single code
base that you use. So for infrastructure,
the equivalent of this is kind of like
a template stack. This is the idea you have you
go back to a single project. But rather that project
containing separate code for each environment. This is like the code that
defines an environment and kind of abstract sense. Right. So this is what my
applications environment looks like it has some server
some application servers and web servers
database networking rules whatever it may be. And then you can apply that
to one environment at a time. And so you can version it. And you can promote
it just as you would do with, say, a Java artefact. So that’s kind of the groundwork
that we were defining stuff as code and we’re kind of able
to kind of reuse it and promote it across environments. So then the second kind of
big thing that we want to do is to continuously
validate what to do. You know automated testing
and whatever kind of testing, we need to do on our
infrastructure itself, not just our application code,
but our infrastructure code. So with this every time you make
change to your infrastructure code, whether it’s your
Terraform cloud formation or whether it’s server level
code, like a Puppet Chef Ansible you you’d like to
be able to when you commit the change
to source control that tests are run on
automatically to give you feedback is it going
to work as I expect. And then we can
build a pipeline just as we have for applications
for our infrastructure. And so we can promote it
through our environments. And I’ll go on at the moment
to get to some of the more kind of complex patterns here. But the basic principle
is this that you promote your infrastructure code ideally
across the same environments that you have your applications. So you’re testing that it
works before you put it into production. Now the big challenge
with doing this with testing
infrastructure code, especially at that
level of stuff that manages resources in the cloud. Is it takes a long
time to run, especially if you’re creating
network structures or so on your building services
setting them up you know could take 10 – 20 minutes or more
before you kind of find out you know before it builds
the environment, then you run the
test against it. And you find out, oh, I had a
kind of fairly simple mistake that I need to go
back and fix. This is where we get into that kind
of third core kind of practice for infrastructure
as code, which is building things in small
independent releasable pieces. Right components would have you. And this is kind of going to
the microservices kind of thing of saying like
rather than having a monolithic
infrastructure, we’re going to have smaller pieces. So just to kind
of think about it. We start out by saying we’ve
got like a stack right. So we’ve got our Terraform
projects cloud formation we have you that
builds that it’s got some servers and
some other things on it. And then we say, let’s
pull out the code for the what goes on those servers. So if I’ve gotten selling a
Tomcat application server, I need to have a Java installed. I’m going to write
say in this example, we’re using Chef
cookbooks to do that. We want to kind of pull those
out and test those for us. We don’t want her to spin up the
whole environments and building application servers and
databases and networks. And then I’m going to
find out whether or not I installed, Java properly
or you know configured Tomcat properly with my
cookbook instead you want to pull those out. So what you can do is kind
of you put this together kind of roll level. And you say you know we’ll
run tests each of these has got code that
is independent. We can test on its own. So we can test those
things separately, we can say that we could
test them actually. So I’m kind of virtual
machines or even in containers. So there’s tools like
say testkitchen, which kind of orchestrates
test what to do is I’ll say it aloud. It’ll set up the
kind of environment And they might use
again, a Docker image as a kind of good 1
because it’s really fast with like a really pared
down operating system on it. And then he’d say you say it
spins that up and then applies a cookbook to install Java and
Tomcat runs a couple of tests to a certain thing. So there’s kind of
testing frameworks. I use a lot of things
like our spec based things for this kind of server
spec AWspec INspec is something that the chef
people come out with lately that I quite like for this. And then that lets you
run it very kind of fast. And then orchestrates
that for you. And then this is something
you can test locally on your own machine
without having to spin up cloud resources. So it can run on your build
agents or kind of GoCD or Jenkins or what
have you agents and then you can kind of extract
all these pieces out and create kind of more complex pipelines. But ones which kind of give
you the faster feedback you know the test at the
lower level first kind of you start getting that
kind of test period effective like I can
test server configuration very quickly. And then the things that I
have to test when I actually spin up a bunch of servers. And then trying to
make you know you might want to test like your
network connections work all right. I’m across all of your
subnets or what have you. And so those are
things that kind of. They can come a bit later
when you pull things together. We can also test
a really key thing is that have to have the
testing the stack definitions. This is about testing your
Terraform code actually building up the cloud stuff. You can pull those out
and test those separately. And then the other kind of way
of breaking your infrastructure apart into smaller
pieces to make it easier to test the more
practical to test is to kind of break
the overall stack up. So a couple of clients
recently where they have very large Terraform projects. I was at one where it
takes about an hour to run Terraform
on an environment because they’ve just grown
over time and everything for their environment
is in there. And so this is the
same concept as moving from monolithic applications
to microservices is that would make it
easier to kind of deal with if we can break it into
smaller pieces like this. So each of these can
be separate stacks separate Terraform projects
that we then kind of integrate together and more
kind of loosely coupled away to each stac has
its own kind of pipeline. It’s got you to
define your contracts and pay attention to like
where two things depend on each other across these
stacks and the trick in this the kind of art or whatever too. It is working out
where do we draw the boundaries between
these different stacks and the key kind of
thing is different ways and different techniques. But it really comes
down to, you optimising for making changes to it. So what types of changes
are the typical changes that you see in your teams. What parts of the
system do they touch. And so if a type of change
tends to kind of be it touches like, say, some
networking stuff and server configuration stuff or
whatever it may be you say let’s put those together
so that we can change those together rather
than saying I have to change two different
things, then try to coordinate those changes. And now a couple
of other things. So one of the things I
mentioned I was talking a bit more about the
container cluster thing because this is something
that’s increasingly becoming an issue of we’re
using Kubernetes clusters. How do we kind of manage that. How do we define it. And I think one of the kind
of anti patterns of people fall into is going to
become snowflakes right. We think about continuous
delivery for our applications or building containers into
containers microservices and it’s really nice. We kind of got these
but then our clusters that we’re deploying
them onto are basically snowflakes that
somebody is built by hand. And then as is always the need
to do like security updates and updates to the software. And so on. And those can become very
painful and very disruptive to the organisation
to say, you know, we’re going to do this upgrade. Now that’s going to impact it
all the teams we to coordinate what is a good time to do it. And so what that means is
that we do it less often it’s riskier to
make those changes because when you change the
cluster you’re going to change kind of everything. You can
potentially break everything that everything’s working on. And so that makes more work
to make that change happen more coordination involved. And then it might
be out of date. Well, what tends
to happen then is that your cluster
configurations out of date with like you’ve
got kind of known security vulnerabilities maybe there that
you haven’t had time to patch. Again it kind of lowers the level of
kind of quality of your system overall. So one kind of pattern
to kind of deal with this is to break it down
by the environments and say, let’s have separate
clusters for our production and for our non-production
in some cases, some teams take this to
the extreme of saying we’re going to have for every
environment is going to have its own cluster, and
this then kind of creates this kind of well,
it gives you testing. So if you’re defining
your cluster as code you’re Kubernetes the
you know deployment and how that is set up
is code and how nodes are built that all this code. That means you can apply
those code changes in test environments and ideally
what you would do is probably have an environment
before you even inflicted on development environments. So you’d have just like, I’m
going to spin up a desert test cluster that nobody else cares
about except for the people who are maintaining the cluster
code and then run tests on it, make sure it’s OK. Before you then roll that
change out to development test environments. And then on to production as
all the kind of tests are run and everything is good. And so you have a
pipeline for this right. If a pipeline for your
infrastructure for your cluster code as well. And then another way to kind
of divide up this cluster is to say, let’s do it by
teams or by departments. And I think here
again, it’s about kind of defining the boundaries
based on change in ownership. So you really want
you know somebody who owns a group of people who
own an application or services to also be able to
say, you know have ownership over
the infrastructure and the clusters. And so on that those
run on even if there is there could be
kind of governance other people involved
in making those changes. But it’s not that we have to
coordinate our department wants to make a change
another department isn’t a good time for them. So we’re going to have to
wait and suffer for that. And so this avoids having to
have big bang updates that roll across out across everything. It lets you have
different configurations for different teams,
which can be very useful. So you have those kind
of battles or conflicts between teams and they can also
support governance boundaries. So you can say that, for
instance, in terms of data there might be some parts
of your system, which are handling data, which has
different regulations applying to it. And so you can say we can put
that onto its own cluster, and manage it separately. So that other teams
aren’t kind of bound held back by regulations,
which may not apply to what they’re doing. So the kind of big questions. A couple questions on this
one is like where do we start. We build our thing first and
then add the automation later. Or do we kind of build the
automation in from the start. This is kind of a
difficult trade off right. But at the end of the day. So I think that
the issue with this is that doing the
automation is a lot of work. Right so it’s a
building automation for your infrastructure building
the tasks building pipelines and all that. It takes you know quite
conceivable amount of work to get it going in
the first place. What happens a lot of times
is will start on a project and we start building
this stuff out and you know things aren’t kind
of moving as fast as people maybe maybe thought it would
because it underestimated how much all that was. And so they say, can we just
leave the automation for later. And I think that the problem
is that the automation itself is an enabler for speed. Right it’s something that
helps you do experiments and it helps you learn
and test and prove that you’re doing this the right
thing helps making changes. So building a system
is a series of changes. So you kind of want
to have a reliable way to make those changes
as you’re building. And so adding
automation afterwards it tends not to happen right. People say we’ll go in. So how many people have
tried to add unit test to a code based on existing
code base that didn’t have them. How many people have
tried to do that. How do people found
that really easy to do. OK one person. Good. It tends to be
really hard right. It tends to be something that
involves like if you actually end up doing it, you end up kind
of rewriting a lot of your code in order to make it work
because it affects the design. Having that you know just
as with your unit test, which requires you to be
able to cover instantiate say an object or a small
collection of objects independently on their own
and run tests against it. The infrastructure’s
the same kind of thing. If you build this
infrastructure is kind of this big thing of
like we did the quick thing we went in and use the console
or whatever to build it by hand automating
it is probably to mean you know
rebuilding it from scratch with significant
amount more work. So the key here really is to
do it incrementally right. It’s to say, let’s not try
to build all the automation for all the things
first but also let’s not try to build the
whole system first, what’s the kind of
quickest bit of value that we can deliver to our
users and get into their hands. What’s the kind
of simplest thing that we can do to
make that happen. And how can we kind of build
the tool and just enough of that infrastructure and just
enough of the kind of pipelines and tests to get that in so that
you can then see some results. And so you see results. Obviously, of the kind of code. The other business value
the direct business value generating code getting
that into people’s hands you see you know you
start testing that. And also like how
does our systems work. How does our automation
work, especially if it’s not
something that you’ve been doing up to this point. This level of automation, you
start kind of working out, OK, a good way to do it. What works for us what doesn’t. And then that helps you kind
of with the evolution of it.

Leave a Reply

Your email address will not be published. Required fields are marked *