Charity Majors is the founder and CTO of Honeycomb, provider of observability tooling for modern engineering teams to build resilient production software that delights customers and reduces toil. Charity tells Beyang about how Honeycomb derives its definition of observability for software systems from its original definition in control theory, and how observability differs from monitoring and logging. She shares war stories from her time keeping systems online at Facebook and Parse, gives her predictions about how the landscape of observability and monitoring tools will evolve, and discusses how developer tools can make programming more accessible to everyone.
Charity Majors, CTO and founder of Honeycomb: https://twitter.com/mipsytipsy
Christine Yen, CEO and founder of Honeycomb: https://twitter.com/cyen
Linden Lab: https://en.wikipedia.org/wiki/Linden_Lab
Observability in control theory, as the mathematical dual of controllability: https://en.wikipedia.org/wiki/Control_theory#Controllability_and_observability
Testing in Production: https://www.honeycomb.io/blog/testing-in-production/
The Four Tendencies, by Gretchen Rubin: https://www.amazon.com/Four-Tendencies-Indispensable-Personality-Profiles/dp/1524760919
Other companies founded by ex-Facebookers, partly inspired by tools at Facebook:
DORA report and DORA metrics: https://cloud.google.com/devops/state-of-devops/, https://thenewstack.io/dora-2019-devops-efforts-improving-but-not-done/
10x engineer (trope): http://svdictionary.com/words/10x-engineer
If you notice any errors in this transcript, you can propose changes to the source.
Beyang Liu: All right, I'm here with Charity Majors, former engineer at Facebook and Parse now co-founder and CTO of Honeycomb. And I believe the person who first coined the term observability as it relates to DevOps and modern application development. Charity, welcome to the show.
Charity: Thanks for having me. I think that I was the first person to give observability a specific technical meaning. People have been using observability to refer to generic telemetry. And of course there's the long and storied history of observability as a mathematical tool for controllability. But I think that we, Honeycomb were the first people to sit down and go, "Okay, this is about the unknown unknowns. And if you accept that definition, then what else proceeds from there?"
Charity: So we'll accept that part.
Beyang: Thank you for the clarification. And I think we have a lot to talk about today. I want to dive into observability and Honeycomb and your thoughts on a lot of different things. But before we get into that, we'd like to kick things off just by asking people to tell us a bit about their backstories. How did you get into programming and engineering? And what's been kind of your brief life story as an engineer, that's brought you to this point?
Charity: Oh, I have a weird and wandering road. I was homeschooled in the Backwoods of Idaho and I went to college in a performance piano scholarship, when I was 15. It was really just like a, "Get me out of here. Oh my God, I haven't seen another human being in six months. I'm going insane." But I got to college and discovered computers there. Because I noticed pretty quickly that all of the music majors were poor and I was very tired of being poor. So I swapped keyboards and I've just kind of been tinkering ever since. I love it. I really loved it. There are so many people with non-traditional backgrounds in engineering and it makes me a little sad that I feel like the doors are closing a little bit, that there are these certifications and these qualifications and stuff that, when I got started, it was really just like, everyone was so desperate.
Charity: If you could write a... If you knew how to launch Emacs, you were hired. Professionalization is a good thing. But, there weren't many growth industries in the entire world, I guess, through my lifetime computer has been that. It's been a real source of opportunity for weirdos.
Beyang: Yeah, definitely. I think at least in my experience, I've found a fairly weak correlation between official certifications and even degrees and skills.
Charity: Oh yeah. No. It's not about being able to do the work. It's about cover your ass, as a hiring manager, totally.
Beyang: Yeah. Yeah. Well, I certainly hope that-
Charity: I feel like we could rant about this for a long time.
Beyang: This is like a whole other podcast.
Charity: Oh boy.
Beyang: Yeah, definitely. I'm actually just a little bit curious about, you said your original, I guess academic pursuit was piano performance. Are there any favorite pieces or songs that you still like from back in the day?
Charity: I love Rachmaninoff. I really loved the romance composers. I did music and then I did Latin and Greek and then I did electrical engineering. And then I got a job as assistant admin for the math stat department. This is back when they were still giving kids room on campus, which was a terrible idea. What were they thinking?
Beyang: So much power.
Charity: Oh my God. Then I had root for the entire campus and I kind of slowly stopped doing schoolwork and started working more and more. And then I got offered a job in Silicon Valley for like $75,000. I had never heard of anyone making that much money and I was just done. "I will go to that." And I've been down here ever since. But I feel like I've been really lucky in my career. I've worked some with some amazing people that it's a small, small Valley and there's some people I've been working with over and over, off and on for 15 odd years now.
Charity: I've had some shitty jobs, but they're mostly pretty brief and not... They weren't emotionally traumatizing. They were just boring. But I feel like I had a job... I worked at Linden Lab for five years, straight up my first real job. And that set the bar for me so high. I feel like Christine and I have once in a while, we'll just look at each other and go, when we're feeling like we're failing at everything, we'll be like, "If we can just create a job for the people who work here that sets their bar high so that they don't settle for bullshit, then it'll all have been worth it because there's really no excuse for accepting shitty jobs." Our skills are in far too much.
Beyang: Yeah, definitely. I mean, the part, I think that gets everyone into programming is the creative part of it. You're creating these beautiful abstractions and algorithms and data structures. That's what I think we should always be striving toward.
Charity: Or the impact. I feel like dev ops is this noble attempt to tear down a wall that never should have existed. Right? It's like the wallet that you throw your coat over. This idea that you're done with your job, once you have merged your changes to master, right? That's terrible. Not only is it bad for the code and the systems and reliability, it's bad for you because it decouples you from the outcome and the impact of what you've built. And I feel like all we're trying to do is stitch back this feedback loop so that the people who are experiencing the pain are the ones who are empowered to fix it and who have all the contexts fresh in their head and just getting that feedback loop going... Ops has a well-deserved reputation for masochism.
Charity: The point here is not to invite everyone to be masochist. The point is that this actually makes things better. That it shouldn't have to be painful to support these systems and I definitely believe that.
Beyang: Yeah. I think the way that you describe dev ops is interesting because when you talk about it, it's like this thing. In the beginning and ops were one, and then there was this schism, and now we're trying to put back kind of the pieces together and then rebuild that feedback loop, as you said, but before talking with you, I feel like most of the people that I spoke with, it's kind of a new term, right? Dev ops is this new thing, we're merging these two separate and distinct things as dev and ops.
Charity: That could all... It's just getting back to the way it once was. Right? When we were all happily editing source code files live in production [crosstalk 00:08:28] route, the good old days. Yeah. I mean, you can understand why specialization emerged because of complexity and because it just became impossible. Right? And so the trick, I think the needle that we have to thread moving forward is to allow for specialization, but not to lose sight of that feedback loop that really critical, just heartbeat of shipping code to your users and understanding what you've built.
Beyang: Yeah. I think that's a good segue into observability. So, first of all, what is observability and how does it help glue back together those two sides of software development?
Charity: Yeah, totally. Observability is basically just at a high level, it's being able to ask any question of your systems, understand any state that the system's gotten itself into without having to have prior knowledge of it, without having seen a break before and without shipping any new custom code to handle the question that you're trying to ask. Because that implies that you could have predicted what question you were going to need to ask, right? If you think back to the telemetry we've had to date, metrics and logs, right? With logs, you can always find what you know you have to go look for. And if you don't know to log it, and if you don't know to look for it, well, you're screwed. Right? And with metrics, you're just like, there are always little counters and stuff that are firing as your code is executing.
Charity: It might fire off, couple hundred metrics while one request is executing, but none of them are connected. They're not tied together because there's no connective tissue. That means you can't ask all these new questions that, "Oh okay, this metric spiked." Well, what else did the things that spiked having common, right? They're just all these things that because you didn't click the data in the right way, you can't ask those questions. You can't understand these complex states. So observability is... The term comes from control theory, which has its roots in mechanical engineering.
Charity: When we were building Honeycomb, we were non-trivial. We've had to build our storage engine, query planner and everything from the ground up and support these data structures. Far more difficult than building the technology was figuring out how to talk about it because every term and data is so overloaded. Every data tools demo looks exactly the same. They're all like, "Oh." And six months in, I was still trying to... I knew we weren't a monitoring tool because monitoring is very reactive, right? You define these arbitrary thresholds and then you just monitor the thing, just check over and over. Is it up? Is it up? Right? And really weren't that.
Charity: It wasn't till six months through that first year that I saw the term observability. I think it was from the Twitter team, observability. I looked up the definition and I just looked... I just had light bulbs going off in my brain, just like, "Oh my God, this is what we're trying to do." So the reason we were building this backup just a little bit, Christine my co-founder and I, we both come from Parse the mobile backend of the service. In Parse, we had about 60,000 mobile apps in our backend when we got acquired by Facebook. And that was around the time that I was coming to the horrified conclusion that we had built a system that was basically undebuggable, by some of the best engineers I've ever known doing all the "right things." But a few times a day, it'd be like Disney says their app is down. I'd be like, "Well, behold my wall full of dashboards. Everything's fine. It's all green." Right? Because maybe Disney's doing like four requests per second and I'm doing 3000 requests per second.
Charity: It's never going to show up in my time series aggregates or whatever it is. So it could be an app that's [inaudible 00:12:45] down, whatever. So I'd have to go and figure out what was going on. Very brute force, manual labor process, because you've got your top 10 lists and you've got the questions that you defined advanced monitor. And then if those weren't the problem, you're looking for a needle in haystack and you're looking for a needle in a haystack when, okay, so Disney thinks their app is down. Well, it could be something that they did, something that we did some combination of the two, or because we're using these big pools of unicorn workers, shared databases.
Charity: It could be any one of those other 60,000 mobile apps is doing something that caused a starvation of resources on any one of those pools of resources. It's just literally impossible to figure out what the fuck is going on. I tried every tool out there. The first glimmer of hope that we had was we started feeding some datasets into this tool at Facebook called Scuba, which is aggressively hostile tool. Like it's not fun to use. But it does one thing really well, which is it lets you break down by dimensions of high card melody. So if you've got 60,000 mobile apps, it'll let you break down by that app ID. And then by whatever else you want. This was like, I didn't really get why at the time, but this is a core pillar of, what it means to have observability, is high cardinality. Because, if you're looking for a needle in a haystack, what is going to be the most identifying information.
Charity: It's going to be any unique ID, right? And everything out there that's built on top of metrics, you can't have high cardinality dimensions in tags. You could have maybe a hundred and then it's just like your cutoff. You're going to explode the key space and you can no longer even tag them with that data.
Beyang: Yeah. So, high cardinality, help me understand-
Charity: It's the unique items in a set. So you have a collection of 100,000,000 users, your social security number is the highest possible cardinality you can get. Last name and first name are very high cardinality. Gender is very low cardinality. So if you're searching for somebody and you're searching by gender, that's going to not be super useful to you.
Beyang: You're going to get a lot of results and you're not going to be able to focus down and trill down.
Charity: Exactly, exactly.
Charity: And so this tool that Facebook let us break down by these high cardinality dimensions suddenly we could... And if you think about all of the questions out there that you as a software engineer want to answer, they're often by chaining together, many of these high cardinality... So it's high cardinality and high dimensionality. So it's like this bug is only tripped when it's on a user using this version of iOS using this version of the firmware, using this version of the app, using this region, using this language pack using, you know, like every single one of these is high cardinality. That's the only way to like zero in and track it down. But once you can do that, it becomes really easy, really dead simple.
Charity: This is what I realized was the... Instead of the way that we could put debug right now is, we form these castles in our minds and we guess what the answer is and then we go look for evidence of that answer. Well, instead of, if we had a tool that just let us take one foot after the other and follow the trail of breadcrumbs, so that you have to know what the answer is, you don't have to know where you're going to end up, you can just start looking. So, for example, I see a spike. What's wrong? Well, I don't know what's wrong, but I can go and break down by endpoint, which ones are slow?
Charity: Oh, it looks like all the right end points are slow. Is it all of them? No. It's just the ones that talk to this particular backend. Is it all of them? No. It's just the ones to that shard or those two shards. What do they have in common? Well, the primary... So it's just like you're just... It's almost like bringing it back science back into computer science. This is why I feel very strongly about things like testing and production. I feel like, all right, I don't know how much you want to expand this rant to, but-
Beyang: Keep going.
Charity: Testing in production is inevitable. It is something that all of us do. We have to do it. It's called reality. I feel like TDD was great. Most successful software movement of my lifetime, but it acquired, it got predictability and repeatability at the sacrifice of everything interesting. You're just like everything industry interesting is now mock. Test like the end at the edge of your laptop. And that means that everybody should know better than to think that their staging environment is going to resemble production. Right? Instead of like writing code until your test pass, if instead you write code while instrumenting with an eye towards how you're going to understand, is this working or not, in production? And then if you can get it, if you have everything automated so that when you merged it master, it gets out to production in a matter of minutes and then you have muscle memory. You just go and look, is it doing what you expected it to do? Does anything else look weird?
Charity: Closing that loop right there, you will catch 80 or 90% of all problems before your users ever even catch a whiff of it. It is weird that we aren't doing this. That's how we used to write code. Right? We would write code right there on production, hit save, reload the browser, see what happened, right? That virtuous feedback loop has gotten all broken up and tossed to the four winds. And we've lost that really that tight and really grounded in realities form of testing.
Beyang: Yeah, totally. And a lot of what you're saying really resonates with me because, just to be clear to our audience Sourcegraph is a Honeycomb customer. And we use it for one instance of Sourcegraph, which is the most important instance, sourcegraph.com. But we can't use it for a lot of our on-prem customer instances because, they don't want to send their data over to the cloud. And the kind of difference in the experience of debugging an issue on sourcegraph.com versus one of those on-prem instances is quite different.
Beyang: When we don't have honey come available, it's like, okay, there's a Prometheus alert that's firing. It indicates that an end point is taking longer than we expect. Okay. Then the question is like, okay, let me kind of pin down that to a specific issue. And that's a fairly big jump to have to make. Because then you have to like dig into the logs. You might like try to reproduce it in the UI, open up Yeager to try to capture the trace as it happens. And it's just kind of this song and dance. Whereas I feel like with Sourcegraph.com, we can answer a lot of those questions just within Honeycomb for us, because we're... Like you said, it's providing that high cardinality data set to you and you can just kind of go in and explore the data instead of having to jump to different tools to answer the question.
Charity: This is an experience that very few engineers have ever had. This is why debugging... This is why we think computers are hard is fundamentally because we have so little visibility and insight into how they work. And we don't have the tools to even ask pretty big. The experience of after you figured it out, it's the most obvious thing in the world, right? Those categories of problems should never have been hard. It's just that you had to almost figure it out from first principles by reasoning about it in your head. That's insane. We can't do that. We're not good at that. It's so much easier when you just bring it up into the open where you can just watch it, you can ask simple questions, you can see what's happening and then you don't have to guess you don't have to model the entire caching ecosystem in your brain just to make a reasonable guess about what's going to happen next.
Beyang: Yeah. Now, let me kind of take a devil's advocate position here a little bit. And so, earlier you were talking about the importance of a high cardinality in observability tools. What would you say to the skeptics who say that high carnality is great, but you're never going to have kind of infinite cardinality and this whole promise of granting... Giving someone access to a data set that represents everything that's happening in production. That's never going to be the case. There's always going to be something that you're going to miss because you haven't instrumented your app to track that particular event before. And so it's... You're always going to go back and forth between like, "Oh, this thing is only going to be..."-
Charity: It's not really about always being able to find exactly what's wrong. It's much more about being able to find out where exactly the problem lives, right? You might not be able to. So from the perspective of your Honeycomb dataset, you probably can't... If there's a firmware bug that's causing 10% of host to blot. Honeycomb is not going to tell you that. But it's going to tell you exactly which subset is erroring and what they have in common. Right? It's like the hardest problem in distributed systems is not debugging the code. It's figuring out where is the code that you need to debug? This is where it's so interesting, because with distributed systems recently, we've taken what used to all live inside the application monolith and we've just blown it up.
Charity: So now we're hopping the network all over. So you can no longer attach a debugger and just step through your code. Right? If you want to follow your code logic, you actually have to do all this operational stuff and hop from machine to machine. So, I feel like there is still... Hardware is not yet cheap enough that we can pump the output of a debugger run of all of our processes into something that's tractable. No, but what we can do is say, this is the conditions under which the problem happens. That's enough for you operationally to get to a known good state and be stable. Right? Resiliency is not about... It's not about making it so that things never fail. It's about making it so that lots of things can fail and your users don't notice. Right?
Beyang: How prescriptive are you about what sort of things your users and customers should track? Obviously I could in theory, send anything over in the metadata that a company is like an event and-
Charity: Well, pretty, I would say about a third of the magic and power of observability is in the gathering of data and story in the right way. With metrics, like I said, there's maybe 200 different verbs of data there, but they're all blown up and separate from each other. You can't work backwards. You can't go back from 200 metrics to an event, but you can go from an event to 200 metrics. So, it's very fundamental to observability that the source of truth are these arbitrarily wide structured data blobs, from what you can drive logs or metrics or traces or whatever. And it's important that they be arbitrarily wide because like anything was schemas or indexes or anything is again, locking you into. You're saying, "I am only ever going to want to know these facts about my system" Which is kind of anathema, right?
Charity: You want to be able to toss in more detail whenever it occurs to you, "Oh, this might be interesting." And you want to gather it up basically one blob per request, per hop. So, your request enters the system, maybe hits the API server. That's one blob and maybe it bounces off to hit the payment service, that's another blob. When the request enters a service, like we initialized an empty honeycomb blob, and then we pre-populate it with everything that we know or confer about the system, the language internals, the parameters or past it, all the basics. And then while it's executing in your service, you can effectively do a print F, just print any details that you want into it.
Charity: What you want to capture is any unique IDs, anything where you're like, "Oh yeah, somebody's going to file a bug someday and I'm going to want to be able to find it by this." Right? Shopping cart ID, et cetera. Just stuff a million. And then when the request is ready to exit the service, it just ships that off as one very wide structured blob. And we find that a maturely instrumented service will have usually 300 to 400 dimensions. It'll be 300 to 400 verbs-wide, so to speak. That just seems to be like where they stabilize. That's way more than you can keep track of in your head. But that's fine, right? Because, well, for example, with bubble up, if you see a spike and you're like, "Ah, what's going on here?" Will you just draw a little bubble around it? And then we precompute for all of the dimensions, both inside the circle and outside the circle, and then we diff them and then we sort them, so that the ones that are different come to the top.
Charity: So if it's like, "Oh, this spike, these requests are different in these five ways." Well, you can just see that as a glance. You don't have to actually keep a dictionary in your head and refer to them. There should be very little friction to toss them. People can do this. And I recently found out actually that Amazon follows an internal logging spec that is almost exactly like this.
Beyang: Interesting and has for like a decade. It was really difficult for us to figure that out. I wish they would have just opened sourced that so we could have leaped from that year.
Charity: Yeah. But no, it's really important that you gather things up that way, because that allows you to ask all of these novel questions later on, that you may not have ever predicted that you might want to ask or associations that might not have been obvious. Right? Because it's all about making it visible and easy for the human user to figure out what's important to them, which means it's about aligning you with your users. Right? And that aggregating it around the request means that you'll be able to see exactly what kind of experience your user is having. That's another core pillar of observability honestly, is shifting away the emphasis from the systems and the infrastructure to the user's experience.
Beyang: Because in order to kind of gauge what the user experience is, you really have to pull in the kind of a variety of data, right? It's not just a single log line, it's not just a single-
Charity: Yeah. It's all about from the perspective of that request, was it... We don't actually care if the cash server was down or something was down. What we care about is the ability of each request to exit or not. Right? Versus, if you're doing traditional monitoring of your infrastructure, all you have is aggregates. You can't ever actually trace that back and figure out what anybody's actual performance was like.
Beyang: Yeah. That makes sense. Kind of taking a step back here from moment, one of the things that I struggle with, and I imagine a lot of listeners also have kind of dealt with in the past. This is just like a lot of different tools in kind of the space of observability, monitoring, log aggregation, that sort of thing. So like APM tools, Datadog, New Relic, you got distributed tracers like Zipkin, Jaeger is application level monitoring tools like Century, and then there's the-
Charity: And it's in a really confusing the state right now. And the reason is because they're all in the process of converging. I think over the next three, four, maybe five years, you're going to see the collapse of APM monitoring metrics, log aggregation, except the security use case I think might hang on as its own thing for well. But you can see this already starting to happen and just see acquisitions that they're doing. And the Splunk acquired signal effects in our mission. Right? It's partly an artifact of this three pillars myth. There's no such thing as pillars and observability because those are just data streams. They're just data-
Beyang: What is the three pillars?
Charity: This is something that all of the big players like to say, is that there are three pillars of observability; metrics, logs and traces, which just, they happen to conveniently happen to have a metrics product, to sell you a logs product to sell you and a tracing product to sell you. And not only is this wasteful, it is worse than that because you should only have to pay to store that data once, not three times. If you're paying to store it three times in these prematurely optimized formats, then you have to have a human in the middle, who's just sitting there, copy pasting from one to the other, right. This is where you've got people... You'll sit and look at your dashboards, "Oh, there's a spike. I wonder what's going on."
Charity: So then you have to turn over to their logging and it's a completely different data set to dive in and figure out what's going on. And if they want to trace it, they have to copy paste an ID over to their tracing thing. That is broken in so many ways. It should just be a visualization by time to trace. And that's it. It should not be a separate product.
Beyang: Got it. And so, the way honeycomb looks at these different facets of observability, how do you integrate all that into one application? I mean, do you think it's all going to be one interface or do you think there's going to be plugins and-
Charity: Yeah. The important thing is that you need to be able to come in at a very high level, like your dashboard view, right? There's a dip, there's a spike, there's something. Right? And you need to be able to slice and dice and figure it all the way down to the [raw rows 00:31:16], so that you can figure out exactly what is going on. Observability absolutely depends on having access to [raw rows 00:31:24]. Because if you don't have access to the [raw rows 00:31:25] then you can only ask the questions that you happen to aggregate on when you were ingesting the data. Right? But if you have those [raw rows 00:31:34], if you have the ability to just slice and dice and cut it up in various ways, you should be able to flip back and forth between tracing, which is just an overview of a waterfall, just viewing events by time instead of viewing by count.
Charity: Yeah. They should just be two sides of the same coin. It should feel absolutely seamless to move back to the goal.
Beyang: It's like data science in a way. It's just another data set that you're trying to explore and you have kind of your database and then various, I guess, like visualization.
Charity: Yeah. So you might want to... You might be like slicing and dicing and then, oh, you find an example of the error. So you want to trace it and see where, visualize where exactly that time is going. Oh, and then you see the is going there. So then you might want to zoom back out and see who else is impacted by this? Just going in and out of the views like that. So, you were saying, yes, the space is very congested. And in fact, when Christine and I were starting this company, four and a half years ago, we had so many people tell us very condescendingly that there was no room for any... It's a self problem. There's nothing left to be done, which is kind of true. I think that metrics have reached the end of the road.
Charity: I do not think there will ever be another better shinier metrics, product, built in Datadog and signal effects. That horse has been driven into the ground. But I think that what you're seeing right now is companies in those three or four different markets. They're all trying to get technically to where honeycomb sits right now with our truly wide structure data blocks, faster than we can get to where they sit on their business side.
Beyang: Yeah. That makes sense. And it's definitely far from assault problem. I mean, the gap between dev and ops is still a difficult one to traverse and revisiting that for bit, I think one of the challenges is that a lot of developers who don't come from an ops background, it's almost like ops can seem a bit intimidating-
Charity: Oh. Yeah.
Beyang: And for-
Charity: We've worked-
Beyang: Yeah. Sorry, go ahead.
Charity: We've worked very hard to make people think that we're scary too.
Beyang: Yeah. And so I was going to ask for a tool like Honeycomb, when you go inside a customer and let's say a customer buys Honeycomb and they want all their developers or as many developers as possible to know that exists and to use it and understand it. Because, they want to bridge that gap. Do you have... I've talked a lot of other developer tools, creators who sometimes have trouble spreading the word, the people who brought you in are gung ho, they love your product, but then the rest of the team is like, "Ah, that seems like not my job or not my thing. That's yet another tool I have to learn." Can you talk a little bit about how you kind of grow awareness inside companies about, this amazing tool that-
Charity: Yeah. This hasn't actually been something that we have really struggled with too much. And I think part of that is because from the very beginning, we have always seen ourselves, not as building for people, individuals, but as building for teams. So you'll notice there's these Slack buttons where if you have a graph and you're like, "Ah, cool thing. I want to share this with my team. You just push the button, it goes to Slack. And then they see the preview of the graph and then they can also click it, which makes them jump into Honeycomb. And then they can see not only your graph, but your history.
Beyang: That's cool.
Charity: I feel like this is an area where the entire industry has been... We talk about people and culture so much, but we don't really bake that into our products. I feel like Christine and I used to talk a lot about how debugging was following a pathway. Sometimes you go down the wrong fork, you need to be able to go back to when you last knew that you had the plot. Right? And similarly, each of us who's working on our own little corner of this giant distributed system, we know our own little plot intimately for a while, but we're responsible for the whole thing and we don't know Jack shit about anyone else's course. Right?
Beyang: So true.
Charity: And so it's kind of like, we need to... You need to wear grooves in the system as you use it, so that people who come along after you or you, who comes along after you, a few months from now, when you've forgotten everything that you ever knew about what you were doing, you need to be able to look at your history and what were you doing? What kinds of questions did you ask? How did you actually solve a problem? So one use case that people use a lot is, if I got paged about something, say I get paged about MySQL and it's 2:00 AM.
Charity: I don't know all about MySQL, but I know that the experts at our company are Ben and Emily and I feel like the last time this happened, I think Ben was on call. And it was like 2:00 AM like Wednesday or Thursday. Right? So I can search back. I can just go back and look, what did Ben ask that involved My SQL? What questions did he ask? What did he post to Slack? What did he think was meaningful? What did he rerun a bunch of times? Right?
Charity: It's like a better version of everyone's basket bash history file, right? You just want access to a little snippet of their brain, so you don't have to call them and wake them up. That collaboration aspect and collaborating is not just for other people's benefits. Collaborating is for past you, collaborating with future you. It's the same mechanic there. I feel like, for all that, we talk about collaboration, it astonishes me that even basic history stuff, isn't baked into most tools, which yeah... I don't remember. What was the question?
Beyang: Yeah. All right. Well, I want to dig into that a little bit more because, so at Sourcegraph we used to have this thing called the ops log and we still do it. Basically the idea is, anytime you're on call and you have to resolve a production issue-
Charity: Yeah. But the stuff you remember to write down is never going to be the actual stuff that you needed.
Beyang: Exactly, exactly. So, it helps to a certain extent because you can... If you take the time to write stuff down and if that stuff happens to be the relevant stuff later, it's helpful. But I think one of the shortcomings we've seen is that, that's not always the case. Can you talk about how Honeycomb helps address that?
Charity: Yeah. I mean, you should not have to consciously decide this is going to be important because it's always the shit that you do when you're not thinking it's going to be important. Or you're just panicked. You're in a rush. You're not thinking about doing things with a record for the future. It really has to be something that just ambiently is captured. The way that you work with the system independent of you deciding, right? The way that you describe your behavior is never the same as your actual behavior. I feel like I also just... I want to reward curiosity and exploration and people who are having fun with their systems.
Charity: I feel like it's one of the most unfortunate characteristics of most systems is that the person who knows it the best is the person who's been there the longest. We have just decided that this is normal and this is just how it is, but it's not. It's a sign that you're tooling sucks. Because, it's a sign that you're not actually using your tools to understand your systems. You're mostly relying on your memory of past outages and your scar tissue. Right? And if you were actually relying on your tools, then it should be the case that the person is the best debugger is the person who spends the most time debugging or the person who's the most curious, every team has a couple people who are like this, right? Who just follow their nose. The unfortunate thing is that so many of us get it beaten out of us because, you pick up the rock and then, "Oh, that was a mistake." Right?
Charity: But, if we could not punish people for following their curiosity, but if we could reward it, if they could swiftly and simply just see answers to their questions and then understand their systems a little bit better than they did before, that's just a better world. I have worked now on two teams where that was the case where it wasn't the people who've been there the longest were the best debuggers, it's the people who enjoy debugging the most. That was a Parse at Facebook and here at Honeycomb. This is one of the many, many ways where I feel like the biggest hurdle that we face in computers is our low standards for ourselves. Our expectation that the world is just this crappy, and this is just as good as we get to expect, on-call.
Charity: I believe that every engineer who builds 24/7 systems should participate in on-call in some form or another. I believe that, that commitment should be met by management in committing to giving you the time to actually fix the things so that it isn't hell, right? If you're getting woken up more than two or three times a year, it's too much, and we should treat that as a heart attack, not like diabetes. The reason that these systems are so flaky and so fragile is because we've never understood them. We keep shipping new code every day that we do not understand onto these coughed up hairball systems that we've never understood. And we're just closing our eyes and hitting deploy, crossing our fingers and hoping for best. The outcome is not going to be that great. But that is a choice. Right? And yes, that is the best... That is the most visibility that you would expect from monitoring tools. But from observability tools where you can pinpoint exactly which requests are failing, and what's different about them, swiftly and correlate it to a change of the system. There's no excuse for this.
Charity: Your system should be comprehensible. People should be in the habit of spending time going and looking through the ship they just pushed to pride and understanding it and expecting it to be understandable.
Beyang: Yeah. It's kind of the idea that I have some SQL terminal and maybe SQL is not the right analogy here. Because it's a structured, it assumes a schema, but some sort of query language over a database that contains every single log line, every single request latency with metadata, every single kind of distributed trace. That is kind of the way I think about debugging a production issue, hopping into that and exploring. Is that the kind of Holy grail here?
Charity: I think I'm following. Yeah. I think so. Yes. So, the Holy grail here is to push back that moment of figuring out that there is a problem to wait earlier, right? Instead of getting a Jira task about something that's been broken for months and everybody's forgotten. It should be that you find-
Beyang: It kind of starts with alerting then. Being alerted if there's an issue.
Charity: So I think that most alerting systems are you have to set the threshold somewhere that won't like kill you with alerts and we'll give you... And therefore you're going to need to go and look at it. You need to go and look at your systems through the lens of your instrumentation and see, "Is it doing what I expected it to? Does anything else look weird?" You're humanized, if you're interacting with your data. You're going to pull out so many more subtler bugs and problems than would rise to the level of paging someone. Right? Which is why we really have to make it a production practice and an expectation that everyone who's writing code spends time every week with their eyes on production, on their code. Right? Because otherwise it just... You're going to accumulate all these little bugs that are not quite catastrophic enough to wake someone up, but they're still bad. But I feel like they're... So, with honeycomb alerts, the way we've built them in there, we're thinking less of like high level, you're on call.
Charity: This is a sign that your customers are in pain. That's a job for SLO. But if you're an engineer who's developing on an end point, you might want to put a trigger on there. Because you want to know if anything that seems out of the norm is happening while you're shipping. Right? So it's like over a two or three week period while you're making changes to a particular end point. Right? You might just put some triggers there.
Charity: Just to shoot you a Slack message during the daytime and let you know if... I don't know what, just make up some things that you think might be a sign of something weird or bad or odd. Right? It's like bringing your systems into like this constant conversation with you. Right? You're in conversation with your code and your users right now. Right? Because you're going to write some code, you're going to ship it. It's going to have some unintended consequences, but while you're working on it, you're going to be there. Your eyes are going to be on it. You're going to notice some of these things, right?
Beyang: Yeah. That makes total sense. It's like, I guess part of it would be, you want to get, not necessarily... You don't want to wait until the patient is in the emergency room and you got that midnight phone call.
Charity: Exactly. You want to exercise and like go for walks and eat right and stuff. You don't want to wait to have a heart attack before you see the doctor.
Beyang: Yeah. You want Apple watch indicators.
Charity: Exactly. Exactly. The Apple watches of systems. Exactly. That's perfect.
Beyang: I like that. Kind of going back to the beginnings of Honeycomb, I'd love to hear about kind of the story of how you met your co-founder Christine and what was the point at which you both decided, "Hey, let's go build a company called Honeycomb."
Charity: Yeah. First my brain was just kind of wandering off in the direction. What you were saying earlier about we as engineers, how we love to build things, we love to see the impact of what we've built. And I feel like this, people who resist being on call for the system, people who who don't like the sound of this are people who have been burned, people who've been burned out. It is so deeply satisfying as an engineer to just watch what you've built work.
Charity: I feel like there's just something like intrinsically, sometimes you have to push people over the hump to get them to try it. But it's so much better than driving in the dark. All right. So yeah, Christine, Christina is amazing. So Christine was a parse with me. I was the infrastructure tech lead and Christine single-handedly built the Parse analytics product. And so she had built this product for our users, built on top of Cassandra time series database, and she started encountering all these frustrations where our users were wanting to ask questions of their analytics and they couldn't. Because they'd been locked into the questions that they had decided to capture the data for upfront. And they couldn't ask new questions.
Charity: And so Christine would get frustrated and she would fall back to scuba. So Christine left Facebook a while before I did. She went to the East coast and stuff and we didn't really know each other, all that well. But then when she was coming back, she was asking me, "Do you know of anything interesting going on?" I was like, "Well, starting a company is totally going to fail, but it might be fun." And bless her heart, she was all in. We really... Sorry. I was so sure we were going to fail and I was fine with it because the reason I was doing this was because I couldn't imagine being an engineer without it. The idea of having to live without the tooling that we had built, my ego couldn't take it.
Charity: I would have been such a less powerful engineer. And so I was like, "Okay, there's some people pursuing us with some funding, we'll take it. We'll build it, we'll fail and then we'll open source it and I won't have to live without it." That was really the grand plan from the beginning and then we just kept not failing by accident. This is the first year where I'm kind of like, "This might be a real thing." Which probably means we're now doomed. That's my ops plan talking now. But Christina is amazing. So, in the beginning it was three of us and our third co-founder didn't work out pretty quickly. And that's how I got pushed into being CEO, which I did not want to do.
Charity: I had nightmares about being unemployable for the entire three and a half years that I was CEO. Christina and I swapped places a little over a year ago. She is now CEO, and this is much better. It's much better. CEO is the worst job in the world.
Beyang: Talk to me about some of the pain points of being a dev tools startup CEO.
Charity: Oh. Some of the pain points for me were... So I've never been one of those kids who is like, "I'm going to start a company when I grow up." Because I really despise those people. I really don't like the whole Silicon Valley cult of the founder. I find it very off-putting.
Charity: When it comes to the CEO gig, I dislike the I don't know, there's just so much grandiosity about it. And there's so much, I alienated more than a few venture capitalists, which isn't ideal when you're in my position.
Beyang: Oh no.
Charity: All right. Here's a pro tip. If you're into people or management or whatever, there's a book called The Four Tendencies by... Oh, what's her name? She did The Happiness Project. Gretchen Rubin.
Beyang: Gretchen Rubin.
Charity: Yes. And I'm pretty skeptical about almost all personality stuff, but this one's very straightforward. It's like, basically what motivates you living up to internal expectations that you have of yourself or external expectations that people have of you? And there's basically four possibilities, right? Either yes, you're motivated by both. No, you're motivated by neither or you're motivated by living up to other people's expectations or you really only care about you, yourself. I am the rebel type that rejects as soon as there was an expectation of me, I do not want to fulfill it. I'm kind of a little shit. That personality type, she even calls out in the book. She's laughing. She's like, "I don't see how any rebel could be a successful CEO, ha ha ha." I was just like, "Fuck you, lady." But there's some real truth. And Christina's holder type where she really gets a lot of joy and meaning out of living up to expectations. Both that she herself has and others have of her and I feel very grateful, blessed to have a co-founder who does?
Beyang: Yeah. I mean, what you're saying, I think I have a little bit of that as well.
Charity: It's a really illuminating book. I got so much out for my personal relationships, my work relationships, my understanding of... I highly recommend it.
Beyang: That's cool. I'll check it out.
Charity: Which type are you?
Beyang: I think most people just know off the bat.
Charity: What were my options again? Are you motivated by other people's expectations of you or... So the types are upholder where you uphold both, rebel where you reject both, obliger where you... External expectations. Or questioner-
Beyang: Definitely not an obliger.
Beyang: Just the word.
Charity: Most people are. That's great. That's why the world works because most people want to please other people. And that is fantastic.
Beyang: Yeah. I mean, I honestly think the rebel thing is that the persona is probably what I identify most with. I was a very difficult child, I would say, I was one of those kids where if you told me I couldn't do something that I would go and try to do it. I think part of me never really grew out of that, either. So any sort of rule that gets imposed. Even like-
Charity: You automatically feel like, "Fuck you, no. This is the last thing I'm going to do."
Beyang: Yeah. Why do I need to listen to you? It's a freedom. That's like freedom.
Charity: You asked how I got into computers. Literally it was because I walked past the computer labs, I signed zero women in there and I was like, "That's where I belong."
Beyang: That's awesome.
Charity: Yes and no.
Beyang: Well, yeah, not the zero women being in the computer lab part, but-
Beyang: So one of the things that strikes me about the story of Honeycomb is you mentioned there's a tool inside Facebook called a Scuba that at least partly inspired what Honeycomb is is doing. And subscribers is a little bit the same. There was this internal tool at Google that I had the chance to use called Code Search that partly inspired me to want to build something like Sourcegraph. I would love to get your take on... First of all, people working inside fantastic developer organizations like that. They're going to see tools that they find useful that are probably found nowhere else in the world. But at the same time, there's kind of a set of unique challenges that go along with that because you know, Facebook and Google or Facebook and Google for a reason.
Beyang: There's really no other place like them. And if you're trying to build a tool that's inspired by stuff, that's inside those organizations, you have to adapt into the broader world. Right? So talk about that.
Charity: Yeah. It is interesting. I think Google has kind of made the... They're a pretty services-oriented organization from what I understand internally and Facebook is not. They do not have services at all. They talk to each other and go to meetings to decide that they're going to do things it's super weird. Yeah. I think it's interesting because I think that often, people and you can think of so many... Like Quip came out of Facebook tools, Asana came out of Facebook tools. There's a lot of them. The thing that often trips people up is they come out thinking that the same problems are going to be hard out here as were hard in there. And it is not.
Charity: You get so used to just like turning on the spigot and whoosh people show up because they have no choice. Because they're going to use your tool or they're interested in it. You get out here and the little startups and you turn on this spigot and nothing comes, right? And you really have to work much harder to understand users and court them and look for ways to surprise them with how helpful you can be. Scale is almost never, yeah. Scale is never the hard part.
Charity: Like I said about Scuba, it's an aggressively hostile to user's tool? And they can get away with that. They can just be like, "Fuck you." We're a big ads company. We're not a developer tools company. And so our developer tools are going to suck and, and they do, and people use them anyway. You know what are you going to do? So from day one, we knew that we had to pay a lot more attention to the user interface. Yeah. I don't know. What I do love is I do feel that just in the last few years, there's enough, there are so many of us that have kind of sprinkled up that you can plum together of full like CICD pipeline that is Facebook quality or Google quality. And you get everything from feature flags and observability and progressive deployments and all this stuff that for forever, either you didn't have them or each shop had to home brew them, for their custom to their environment. Now you can actually just pay a couple hundred bucks a month to a handful of tools and you can get some really nice things.
Charity: So it's interesting. If you look at the DORA report, the yearly DevOps research report that gets put out where they show, they break teams up into low performers, high performers, et cetera. If you look at year over year for the last couple of years, the bottom 50% is losing ground and the top 50% and especially the elite 20% or so, they're getting better faster. There's this total split down the middle. And this is because frankly in tech, if you're standing still, you're moving backwards. Because complexity is always conspiring to overtake you. There's always more exceptions.
Charity: You have to be actively fighting against that in order to just keep up. But the tools really have, they really have begun to make a real difference. It's the kind of thing where if you get one, then you want another, and then you want another. You get feature flags and you want observability and you start like getting all these things. And with the free time that you buy for yourself, you buy more free time for yourself. It's like getting on a treadmill of escalating awesomeness once you start. But if you don't start, your life is getting worse and worse and it's really... And the thing is... People often think, "Oh, well this is only for great engineers." Or, "I'm not that... This is only for high school." It has nothing to do with how good of an engineer you are, nothing to do with it.
Charity: These are socio-technical systems, right? The people, the code, the tools you use for deploying and managing that code and observability is like an important step, just so that you can see what the fuck is going on. But it's about the effectiveness of the team. How high-performing the team is? Because I have seen engineers, who leave high-performing teams and join low performing teams. And they don't drag the team up to their level. They go down to the teams level, right? And slowly I've seen leave low performing teams and join a high-performing team. And within three to six months they're holding their own, right? I feel like 80, 90, or more percentage of your velocity and your ability to ship code with confidence has nothing to do with your personal skills. It has everything to do with your team.
Beyang: Yeah. I mean, that's a fantastic point and I've never really heard it described in kind of that fashion, but it really, I think clicks with me. I mean, there's that kind of an old trope about like the 10X engineer in Silicon Valley and-
Charity: That 10X engineer goes and joins a low-performing team and they're going to perform right down with them. It's not about the person. And this is why I feel so strongly about managers, about the pendulum, about engineers becoming managers and going back every couple of few years. Because I feel like if you want to be a technical leader who has the skillset that it requires to tend to a team and help it raise its level of performance, you can't just focus on one corner. You can't just focus on the people. You can't just focus on the tech. You can't just focus on the tools. It takes the ability to reason about the full system, right? Which means that you can't bury your head in either side. It takes out everything.
Beyang: What are some of the things that Honeycomb does internally to kind of foster this culture, this level of key performance?
Charity: Oh God. I almost feel embarrassed when I talk about. Our metrics are an order of magnitude better than the most elite team captured in the DORA report.
Beyang: What metrics are those?
Charity: Four metrics are the time, how often do you deploy. Time between when you merge and when the code goes live. How long outages, how long till recovery and I think it's duration of outages. I don't remember what the fourth one is-
Beyang: Yeah. We'll drop a link in the show notes.
Charity: The reason ours are so good is because our system has always been well understood. We have the expectation of building the instrumentation in, looking at the instrumentation when it's live, be able to swiftly, pinpoint and fix problems. So it never becomes this hairball. And even when things do go down, it's comprehensible, people can get in there and fix it. It's almost like cheating, it's so much easier to work on a system that's never been terrible than it is to dig yourself out of a pit. I don't want to underestimate the amount of work that, that could be, but it's not that hard.
Charity: We very intentionally did not just go hire all the ex Google, ex Facebook people. We knew we're building a team, a product for everyone, and we wanted normal people. But the thing is that you become a better engineer so much faster when you're working on a team that ships way more often and gets that feedback loop quickly, doesn't bury the feedback in ops, but gets it right to you. It's almost... I was joking today that we're going to have to keep hiring new junior engineers because they become seniors, which is a good problem to have.
Beyang: Yeah. That's a great problem to have, I want to revisit a kind of thing that you touched on earlier, which is you're starting to notice more and more kind of gate-keeping in the technology industry. I think one of the things that I really care about is making technology and software, especially accessible to a wider range of people. So, people don't have to have that experience of like, "Oh, there's no women in that computer science lab." Can you talk a little bit about how Honeycomb can kind of realize that? Do you think that using tools like Honeycomb can actually bridge the gap-
Charity: I really do. Because, think about how we learn from each other. You want to look over the shoulder of the senior engineer and see how they do what they do. How does their brain even work? What questions do they ask? Which is something that we baked into the product, right from day one. Thinking about if you're too bashful to approach someone, you should be able to go and look at how they interact with their systems, just capturing a snippet of their brain. I fundamentally believe that there are so many things... Systems that are broken, give people the wrong idea about their own abilities. It's not your fault. It's the shitty systems.
Charity: It is so hard right now to get that first job. My little sister just graduated with an engineering degree a couple of years ago, and I got to see up close and personal, just how hard it is to get that first job. Once you've gotten that first job, the sky is the limit. You've got recruiters, everybody's pounding on the door, but nobody wants to take the leap. And just like, what does it take? Three or four months to get them like, to be getting to be productive? It's not that much. And you get loyalty, they're so grateful. It's just the energy, the fresh eyes, I don't know. I feel like this industry, we really need to figure out how to mentor junior developers remotely, because I feel like the push towards distributed teams is... Nobody's figured out how to do this. Nobody's figured out how to help teach-
Beyang: It's a brave new world.
Charity: I'm nervous about it too, but we can't keep expecting someone else to pull the cart for us. We all share that load.
Beyang: I think that one of the bottlenecks to getting more people, especially in entry level positions into software jobs is also the time of senior engineers. So, if you have a limited number of senior engineers, mentorship takes time.
Charity: It really does.
Beyang: It's an active job. And oftentimes you can get into the position where if you hire too many junior folks too quickly, the senior engineers spend all the time mentoring. So I guess where I'm going with all this is, do you think that tooling can help with that? With better tools, do they not only help individual engineers be more productive, but also, can they facilitate more scalable mentorship?
Charity: That's an interesting question and I'm not sure. What I know that tools can do is do a much better job of rewarding individual effort with results and scaling that out. Somebody who's motivated to go poke around and see how, then solve that My SQL problem. Capturing that in a way so that people have access to how his brain was working when he did that. I have confidence that tools can help with that. There's an interpersonal human element. People have different learning styles. Some people really rely on the interpersonal motional connection. And for some people, I think it's actually kind of almost confusing, but almost an obstacle. I actually find it much easier to sit down and learn them by myself. And I find it really confusing. I find it very... Well, it's not confusing. It's when people are-
Beyang: For me it's almost stressful. When I'm with another person I have to be on.
Charity: Yeah. I don't do my best thinking while someone else is watching or paying attention. For sure. I don't know. I feel like where tooling can help in general is just with making things asynchronous and anything that can be made asynchronous can be made scalable. But when it comes to things like giving someone real fine grain, personal feedback on their code, that's just... There's only so much you can do there.
Beyang: Yeah. I am kind of the personality type where I like to learn by myself too and I feel like there are all these kind of how to put it. They're kind of these like rabbit holes in software, not exactly rabbit holes, but things that are extremely useful, but you don't really acquire unless you... An example of this is bash scripting. When I-
Charity: There's no class.
Beyang: Yeah. There's no class on that. And then you're like, "Oh, I know how to write Quicksort. I know all the data structures." And you get out and you're like, "What the heck is this? I need to like write this in order to make some change on the server." And so, that sort of thing, I feel like there could be a lot better resources either in the form of tools or tutorials that help people kind of teach themselves through that. So that it's not such a black box. And they don't view that as an impediment.
Charity: I definitely feel like ops, operational skills in general have been a real blind spot as far as our... And it's not about the operations, but it is about the ownership. And we can't protect people from the consequences of their code. We've got to help them be exposed to it and understand what happens to their code after they've merged it, right? Everyone should know what happens between when they merge their code and when the users are using it. That's just something where like in a classroom setting, you need to have reality to go debug before you can learn those skills, I think. Because there are some things you can't just really make into a lesson plan because they rely on the inherent unpredictability of reality.
Beyang: So if someone's listening to this and they want to get started with Honeycomb, what should they do?
Charity: Go to honeycomb.io. We have a blog that I think has probably the best roundup of observability resources in the industry. And there's my personal blog @charity.wtf. And you can also follow us on Twitter, Honeycombio or Mipsytipsy.
Beyang: My guest today has been Charity Majors. Charity, thanks for being on the show.
Charity: Thanks so much for having me. This is really fun.
Beyang: The Sourcegraph Podcast is a production of Sourcegraph, the universal code search engine, which gives you fast and expressive search of the world of code you care about. Sourcegraph also provides code navigation abilities like jump-to-def and references in code review and integrate seamlessly with your code host, whether you're working in open source or on a big hairy enterprise code base. To learn more, visit sourcegraph.com. See you next time.