Zero to data infrastructure – Brad Urani | #LeadDevAustin 2018

So my name is Brad Urani, I’m a staff engineer
at Procore. Thank you for that lovely it introduction. This is a talk about data. Specifically do
you or do you not want to build your own data infrastructure, take control of your our data,
build your own solution for piping and collecting and storing this data. The world of data is
so replete with like cheesy quotes and case studies that I’m sure you’ve all seen this.
With the power of big data, American Express can now identify 24% of accounts that will
close … we all have heard things like this. I think this must be a British person, data
are becoming the new raw materials of business. This must be a Texan? Right? Should I do a
Texas accent? Information is the oil of — sorry, I won’t do that. Big data is the secret to
prevent Of course when we talk about data, we have
to talk about what kind of data, right? We’ve got a million things we can measure, we’re
inion dated with data. It’s all over the place, and the keyword in figuring out what we want
to measure, what we want to report on is value. This is our word of the day is value. value
is the keyword here. Data is how we measure how much value we’re and analyzing it identifying
as the lead developer, it is your job to make the best use of the resources at your disposal
and almost invariably, the most precious resource you have is the time of your developers. And
if you don’t know what your users want and if we don’t know what they’re doing, OK, one
more cheesy quote, last one I promise: Without data you’re just another person with an opinion.
So if you build features iteratively in an agile manner, you face a choice of what to
work on almost every time you fire up your editor, measuring engagement, what to focus
on how to interpret these numbers, how to drive your decisionmaking is a complicated
subject that everyone in the software development business, whether it’s lead developers, product
managers, QAs, could benefit from studying, but before we get into all that we have to
collect the data and get it to where we need it so we can report on it. So we’re talking
about data infrastructure, of course. Yeah, I work at Procore. We make construction management
technology so we are based carpentry California and our second office is right here in Austin.
So if you are building a skyscraper or a shopping mall or an opera house you’re probably using
our software to do this. This is an example of what our iPad looks like. And when I joined
many, many years ago, our data infrastructure looked kind of like this, we have an app,
it’s powered by a database and all our analytics go toe Google Analytics. I’m going to make
a bet here, I’m going to bet that some of you have infrastructure that looks a lot like
this. Possibly with a bit of this like a little direct SQL on your production database, right?
The reason I make that bet and I know that is because it has been the case at every single
tech company I’ve ever worked at. And honestly this is not bad. This is perfectly fine. Analytics
is a great tool. In some cases it is probably the best analytics tool that I know of. But
at some point you may get to a point where it doesn’t do everything you need. And you
may have to figure out a solution to build yourself. Brad’s first law of data infrastructure,
right, is if you don’t need t you don’t build it, if you have not unlocked all the value
of what you have here you do not need to embark down this path. If you’re thinking of building
an end to end data pipeline, before you start be very clear of what your use cases are,
what is the value it delivers, what can you get from your custom solution that you can’t
buy off the shelf, right? Because I know here everyone understands that software development
is best done in agile ways in units in incremental delivery, you’ll be surprised how many people
way over engineer and go waterfall maybe it’s the lingo big data, data warehouse makes people
think they have to undergo this. So why would you need to build your own? Well,
there are some shortcomings to using a third party hosted solution, the Google Analytics
and their various competitors around the world. It is immutable. You send off your data and
you cannot change it. You oftentimes cannot delete things. You cannot update things, right?
If you forgot to additional fields, you cannot going back and add them. You’re out of luck
if somebody changes a company name or email list or do a lot of these really, really valuable
things if your data is locked up in somebody else’s solution. It cannot be joined. Chances
are you have more than one source of really valuable data. Perhaps in a marketing system
or other chat systems. A/B testing platforms. When your data is all siloed in separate places
you cannot unlock the value that it can bring to you. Data sovereignty laws and security
rule requirements perhaps foisted on you by customers, government agencies, regulatory
parties and the like that may prevent you from using a third party solution like that
at all. Some people are really averse to cooking tracking
and things like that. So it may not be an option to you. A little side note: Deleting
is hard. It is very easy to collect data. Sometimes it is nearly impossible to delete
it when it gets caught up in message brokers, multiple databases, buckets in S3. So if you
have a requirement that someday you may have to delete someone’s data, you need to be very,
very careful how you plan and build these things and make sure you can do that. And
there’s more, right? If you take control of your own data, you can use it to train machine
learning models. You can drive custom UX experiences, right? You can give the data back to your
customers and let them see a dashboard of their own usage. All very, very super-valuable
things. So if we decide we like all those things, we’re going to embark on this journey,
you’ve got your super team of superheroes. Does anyone have kids who watches this show,
Shera? This is so awesome. My. The plot is very complicated, but it is really, really
cool. It’s fantastic. So if you have this crack team of superheroes, right, developers,
QA, someone who knows Linux, and somebody whose sole job is to write IM possibilities.
Then a by all means jump in and do this. For instance, we have to choose some technologies,
and this complicated. Right? This is Matt Turik, and they compile this every year. They’re
immensely complicated and overlap in all kinds of confusing ways, for instance, patch spark
but Kafka also has been embedded database that can do SQL on your stream our your actual
database can double as a batch-processing system. Also, they all have some kind of half-baked
machine learning somewhere and good luck figuring out what is best for what, right? But I will
say that there are a few pieces that you will need to know a little bit about. The first
one being streaming. Streaming is my favorite. So what I’m talking about here when I talk
about streaming are oh, the stream, ah, isn’t it beautiful? We don’t have streams like this
in California. I mean we do, there’s just no water in them.
[laughter] Also, the trees are all on fire, right? What
I’m talking about are these kinds of things. The most popular is Kafka, I love T it’s what
we use. All of our major cloud providers have their own built in streaming solution which
are mostly similar which do have some differences and to be quite honest mostly pretty good.
We pay for managed hosting Heroku, and it’s great. And there are others. What these services
do is they move data from one place to another. So you have you’ve got producers, these are
applications that write records into a stream. This write it to a topic, a related group
of data. It could be your mobile app data or your user data or something like that and
on the other end we have consumer applications that read it. And I’m mainly discussing Kafka
here, but all the cloud native ones basically work the same way as this. So back to our
example app, right? We’ve got our app servers, we’re servings browsers in phones, we’re powered
off this database, what you need is a way to reliably deliver it. To get it into the
database into your downstream systems in a way that is fault-tolerant, right, so you
don’t miss things and you don’t block requests coming into your servers and this is where
the stream comes in very, very useful. Among its superpowers is that it multiplexes. You
put a message into message brokers or pub sub, event buses all of that sort of thing.
A solution like Kafka does not delete the message right away. A Kafka cluster has what’s
called a retention period. Ours is set at two weeks where the message continues to stay
on there on disk for two weeks. That means you can attach multiple consumers who all
consume it and they do not interfere with each other, unlike other messaging systems
like your RabbitMQs, our sidekicks if you’re a Rubyist, and those kinds of things. The
message stays in the stream. You may want to send it to some kind of third party dashboard.
You know, just because you have other things doesn’t mean you want to get rid of that.
You’ll probably write it to your Enterprise data warehouse, more on that in a minute.
And likely you want to put it in some sort of permanent storage. Mainly just as a permanent
backup, as peace of mind, mainly because it’s cheap, storage is cheap and it’s easy and
it provides peace of mind that it’s out there. You might want to write something to a search
index and if you have it backed up into S3 and whatnot, can you always back it up and
rewrite it, right? So we can write this data to all the downstream places we want. You
can even call this your data lake if you want fancy, you know, fancy buzzwords. But it also
gives you the ability to replay. So once the data is in the stream and has this two-week
retention period, you can always replay it from the start of the stream, so if you make
a mistake downstream and you do not capture all that data, you can go back and replay
it and get your data back. Some people use permanent retention where they just keep adding
disks to their cluster, and they can go back to the beginning of time always.
It’s very cool. The other thing is it acts like a buffer. I searched for cheesy clip
art and I couldn’t think of one, and I used a word. The buffer means the downstream solutions
don’t need to be working for you. If you’re writing analytics out of a web app and they
go into the stream, all of your stuff can totally break and it will, and it’s all there
collected in the stream and you get all your consumers and so it works as an important
buffer of your data coming out of an application or data source that you’re using and we’ve
got some guarantees that go along with this. The semantics or guarantees, the promises
they keep. You may be familiar. Streams and things have these semantics, too, these sort
of guarantees that they offer you. For instance, Kafka, my favorite has this thing of at-least-once
delivery. That means that the data is partitioned and it’s read in serial, so you may not bypass
one message until you have processed the one before it, right and that way if your consumer
ever goes down, if it partially processes the message or you don’t know if it was partially
processed, it will pick up where you left off and make sure you have got that last message.
So you never skip a message. You never lose data.
And that is very, very, very valuable and that is not true of certain other systems.
Other systems have certain other semantics. The other one it gives you is guaranteed ordering.
This is different from for instance your message queues and various other technologies in that
the order things go in is the order they come out. And that can be really important, foo.
Imagine you’re writing data to database and you have to send a sales receipt and the line
item. The receipt better come in first and the line item second. This becomes valuable
when you’re doing things like mobile push notification systems to make sure your users
get the messages that they are supposed to get in the same order that they arrived in.
And this is one of the guarantees that it gives you. Kafka is very, very powerful in
that respect so if you need those semantics, those guarantees, being a good data engineer
is about picking a solution that matches your use case and the semantics that you need.
Caveats, right? I will tell you, running your own Kafka cluster is very hard. We don’t do
it, in fact. We pay for managed hosted, even though it’s expensive. But a few other things
I wish I had known when I got started with this, not all tooling is created equal. Data
engineering is not a place to be a programming language fan boy, right, or girl. Particularly
if you’re into PHP, JavaScript, Ruby, the libraries may not be as good as the ones that
are sort of your solutions, sort of native language, Java with a for Scala, Python for
just about everything else. We discovered this the hard way. We used an open source
library it was a Ruby library and we discovered that it was a little bit behind the times
and we were having offsets falling and we were reconsuming entire partitions and reconsuming
weeks’ worths of data. So this is kind of an area what client libs you used. There’s
sort of this tower of Babel going on. So pillar No. 2 here, the second little thing,
kind of major technology that if you’re building your data infrastructure, that you really
need to learn at some point is the Cloud EDW. Cloud entrepreneur Price data warehouse. Honestly
I really got into engineering because I loved databases, they’re these boxes of algorithms
and applied computer science and you can look into logs and computer plans and see how they
work and I’m such a nerd, I said my favorite page on the entire internet is the Locke.
So come talk to me at the cocktail party if you want to talk about database and stuff
like that. But what I’m talking about here are these kind of databases, these are hosted
cloud databases and they might be a little different than what you’re used to. Amazon
Redshift is what most people go to. And of course, bigQuery all have their solutions,
snowflake is a third party one that actually runs on Amazon azure but is hosted and is
really great. These are cool database that offer some fantastic features, for instance
you can take instant snapshots. Kind of like a commit log in Git a directed async graph
of file file updates. You can take a snapshot every minute and just have your snapshots
sitting there and go back in time if you need them. Snowflake has a cool thing it’s separation
of storage and compute. So you’ve got your data. It is actually backed by S3, but if
different teams or different individuals are going to be using this data, you can assign
them different levels of compute. You can actually provision everyone their own compute
nodes and their queries will not cancel each other out or impede performance.
Google has machine learning models built right into the database. And they’re so good honestly
that honestly they’ve started to supplant in my opinion batch processing systems. With
a good database like that, honestly there’s more and they can do and less and less need
for that kind of. But I want to briefly about how they work.
These are column stores as opposed to row stores,
so in your Postgress and your mySQLs, right, data is sort of laid out like this. You see
the theme of my talk? She-ra. You see my theme here? Glimmer. And she’s she-ra’s sidekick.
She’s kind of cool. If you need to write one row, you only need to write one row or contiguous
blocks. enterprise databases actually split this up into multiple files, they have a file
per column all with rich metadata so that if the data you want is not in that particular
file, the database engine will not even open it because it already knows what’s in the
metadata and what’s in there. Notice on the bottom you can combine, we have two values
next to each other, you can compress them and in that way save scanning and storage.
What this architecture allows is for fast aggregation. So if you’re doing massive aggregation
queries, this is going to be the fastest architecture for you, and it’s going to be able to deliver
that much faster than a Postgres, maybe not and I’ll get to that in a second. But that’s
really sophisticated. What it is not good at is for instance, fetching a single row,
because that data is not next to each other, right? And that is why you typically just
batch load. Because if you only need to select one row, the performance is awful. It’s one
of those classic tradeoffs, right? It’s big aggregations is why you use these databases.
Also, you don’t need indexes. I know what you’re thinking: The plural of index is indices,
but the Postgres doc says indexes. And what that means is you can’t get these microsecond,
you know, response times for fetching things, but realize these other Postgress and the
mySQLs of the world, right, you need to know what queries are going to be run, right, which
is usually because they’re built into the application. That’s how you know what indexes
to add but that does not work if you don’t know the queries that are running, so if you
have a reporting solution you don’t know what your users will ask for so you do not have
the opportunity to tune them and add indexes you want. Enterprise data warehouses don’t
require indexes and they’re much better at that kind of flexible ad hoc thing where you
don’t know what’s coming. They are expensive. Just a warning. Most of them charge by the
byte scanned, right? Sometimes you have to sign a contract up front but the days of tiered
pricing and massive price contracts are coming to an end. You pay per query in a lot of these,
but be careful. That can get expensive really fast.
So kind of the third pillar of data engineerings, kind of the third sort of lesson that I learned
doing all of that that I really wish somebody had told me beforehand because it would have
saved me a lot of grief is about quality control. Is making sure that the data that you are
giving to people is correct. And it turns out this requires a great bit of care. First
of all, quality control is not the same as quality assurance. QA is a role, often a person
maybe on your team. That’s about testing software before it goes into production. When I say
QC I’m talking about ongoing continuous checking of data such that it will alert and flag you
if something breaks, right? Data development is notlike feature development. If you are
developing a mobile app or a web app and it breaks and doesn’t work, someone’s going to
complain. Someone is probably going to tell you or they’ll just stop using it, right and
eventually you’re going to find out and you’re going to fix it. With data development if
you make a mistake, it’s that somebody in your marketing department got the answer point
two and it’s going to be point 3. And they misallocated funds or something like that
and there’s a good chance that no one will ever know. So that means you must be the steward
of the data that you send. If it’s not QCs it’s wrong and you should just assume it’s
wrong. Back to my Kafka thing we actually pointed out we were losing messages and we
didn’t know that, we were losing about eight or percent of the rows that we were writing.
And that means that the reports were wrong and as the lead developer I realized I was
wrong, so make sure you validate you shall’ doing.
Corollary to Brad’s second law. Poop goes in, poop goes out. Meri told me not to curse,
so that’s it. If you’re putting the bad data in, you get bad data out the other side. You
are the gatekeeper, if a data mistake happens despite your best efforts, despite your double
and triple checking. No one is to blame. But if it happens because you shipped something
because you put data in someone else’s hands that you weren’t 100 percent sure about, then
you are to blame. You know the source and the pipeline, you and your team have have
the hunches, intuitions and doubts and you’re the only one who truly knows where all these
numbers came from and you are you’ve the only one who can vouch that they’re accurate. As
lead developer you need to make sure that everyone on the team is part of the QC and
everyone knows it is their sacred duty to only publish true data. No data is better
than bad data. I should have said that was Brad’s third law.
Monitoring. You have need to once it lands in the database, run checks, run checks that
look for duplicate records and report the accounts in Slack or something like that.
Look for new types of metrics that have arrived and make sure you’re expecting them to arrive.
Run queries automatically and report the results to Slack that show perhaps if there was a
metric that used to be reporting and one day it’s gone so you can get in there and investigate
why, right? Look for spikes, and look for certain Events that above a certain number
of deviations get a spike get alerted so you can figure out why that is. Otherwise it’s
too easy for things to slip by that you didn’t notice. A culture of checks and balances is
I think what I’d like to say. I think this extends beyond your immediate squad. Because
eventually if you’re a big enough company the metrics are going to go to someone in
another department. I always want to make sure that the quality control process does
not end with us. It is the responsibility of the downstream analysts and business people
and whoever else who is coming consuming this data to also approach it a little bit skeptically
and understand where they came from, too. A culture of checks and balances is required
for your whole organization. Any time you use data, it’s great to instill a sort of
shared ownership right across your organization. Sometimes people hear all this, right and
these quality control checks and they revert to waterfall kind of things I’ve noticed.
But don’t let this happen to you. I believe the data solutions can be delivered in iterative
agile fashion, that the best dashboard is the first one, which is literally just one
metric and one SQL query on the other end. Do not let the need to do real thorough QC
let this turn into a waterfall process for you had. You’ve lost your way if that’s what
it becomes. Overall I kind boil down data engineering into three segments. Culture,
fundamentals, technology. Are they at least once, only once, or could
they possibly skip messages? Different solutions have different promises, dinner guarantees,
different semantics and finally the technology itself. You have to get it running, get learn
about the security protocols, learn how far to set tup, how to monitor it, how to make
it function. If you did all this, oh, good question is, do you need experience on your
team if you’re going to undertake data engineering? Do you need someone who’s done it before?
It will definitely save you some grief. But every day these cloud platforms make it a
little easier. Every time there is another re: Invent conference or something, I would
say if you’re brave and bold and know what you’re doing and know what kind of value that
is not out of reach who don’t have experience with this kind of tech to jump right in and
do it. A few books that I really like: Designing data-intensive applications by Martin K is
fantastic. I really cannot rave enough about this book. Agile data warehouse is a fantastic
book. It was really, really wonderful. I’m Brad. I work in Procore in California and
oh, I’ve got a surprise announcement for you. For the first time ever I get to announce
that we are now hiring engineers in Austin. SRE DevOps, Mobile Web, machine learning,
you name it I’d love to chat with you if you are interested. Thank you very much.

One Reply to “Zero to data infrastructure – Brad Urani | #LeadDevAustin 2018”

Leave a Reply

Your email address will not be published. Required fields are marked *