Why is a systems engineering mindset essential for a scaling startup? In this episode, Nelson Elhage, creator of the open source code search engine Livegrep, co-creator of the Ruby type checker Sorbet, and Member of Technical Staff at Anthropic, joins Beyang Liu, co-founder and CTO of Sourcegraph, to discuss how Rust is changing the security landscape, explain why Patrick McKenzie, better known as patio11, called his live code search tool “miraculous,” and dive deep into the weeds on the differences between trigram- and suffix-array-based search systems. Along the way, Elhage explains why developer productivity is nonlinear and why investing in developer experience should be axiomatic.
Click the audio player below to listen.
Made of Bugs: https://blog.nelhage.com/
Smashing The Stack: https://inst.eecs.berkeley.edu/~cs161/fa08/papers/stack_smashing.pdf
Computers can be understood: https://blog.nelhage.com/post/computers-can-be-understood/
Regular Expression Search with Suffix Arrays: https://blog.nelhage.com/2015/02/regular-expression-search-with-suffix-arrays/
Thread: Circuits: https://distill.pub/2020/circuits/
I’m here today with Nelson Elhage. Nelson is the creator of the open source code search engine Livegrep, used by organizations like Stripe and Mozilla. He is one of the creators of Sorbet, the Ruby type checker that’s in use at Stripe, where he worked on a lot of the developer tooling and developer experience. He’s just joined a company called Anthropic, which is trying to develop AI into a systematic science. Hopefully we’ll get into what that means a little bit later in the episode.
He’s also the author of a fantastic technical blog and newsletter, which I subscribe to and highly recommend to anyone who’s interested in computer system deep dives. It’s really great. I enjoy reading it. Without further ado, Nelson, thanks for being with us today.
Thanks for having me, and thanks for the kind words in the intro.
Before we get into your numerous programming accomplishments, I always like to start things off by asking folks how they originally got into computers. What was your origin story?
I got into computer programming in what I feel like was a very popular way at a particular point in time. I liked video games as a kid, and I wanted to write video games, so I learned how to program. Then, I quickly discovered that writing video games is hard, and it wasn’t actually something I was naturally all that good at, but I really enjoyed the programming part. I voraciously took off from there, learning as much about software engineering and programming and computers as I could, and I haven’t really stopped since.
That’s cool. What was the video game that originally got you interested? Do you remember?
I don’t remember if there was a specific one or what. I grew up on the Nintendo 64, and that was my formative platform as a child that I played a lot of games on and learned the joy of gaming. I love Nintendo games to this day. Then also, I think one of the other platforms that really got me was the TI calculators. We got the TI calculators in school at some point, the TI-83 Plus, I think. I had to do math with it, but those are programmable in their own dialect of BASIC, and so I started learning to write games there. I would program on them during class when I was bored or distracted.
I actually eventually learned a bit of assembly programming very early on because the other language that you can program the TI calculators in is by writing Z80 Assembly against TI’s OS. That’s the only way to get high performance, so all the really good games were written in that. I started learning that without having any idea what I was doing at that point, but that was a lot of fun.
That’s awesome. I actually got into programming the same way. I had a TI-83 Plus, and I was fortunate enough to get one with the manual. In the manual, they had a BASIC tutorial that I read through, but I never made it to the assembly, so you were operating at another level. I topped out at just the BASIC.
The assembly was a high-stakes game because there was no memory protection on those things, so any mistakes would reboot the calculator. I think it was even relatively possible to completely break it and require a factory reset of some sort–especially without having any idea what you’re doing. It was a very slow process. Often, you’d try to write something and it would fail. You wouldn’t have any idea why it failed, but you’d have to go take out the batteries and the backup battery, reset the RAM, and start over.
Do you remember any of the programs you wrote? Do you have any crowning achievements that you’re super proud of on that platform?
Honestly, I feel like the weird one that I was most proud of was one of the very first I did. I built a trivial whack-a-mole game, where little moles would appear up, and you had to whack them using the number pad as a three-by-three grade corresponding to a three-by-three grade on the screen.
That was in TI-BASIC. It was super basic. The graphics were like ASCII art. This was shortly after we got the calculators, and before we had discovered the wealth of games that were freely available online. For a lot of my classmates, this was the first calculator game we had ever encountered, and so for the first time, during math class, we could be sitting on our calculators playing games instead of paying attention to the teacher. Even though it was a crappy game, it was the first that anyone had ever seen, so it was a little bit mind blowing.
How did that affect your relationship with your math teacher, Nelson?
I had a little bit of a rocky relationship with math teachers when I was younger, especially because they were still doing a lot of very rote mathematics, and I just… I’ve never been good at plugging through the details of equations. I hadn’t quite learned yet that I liked math once you get to the more interesting conceptual stuff. I was not actually a star math student in those days, but fortunately by high school, I found math teachers that were much better at teaching the interesting parts and helping you grasp the conceptual bits, the patterns, and the beauty of the methods of abstract reasoning that math can bring to you.
But in those days–this was eighth grade or so, I think–it was a little rocky going.
Great teachers can make such a difference at that stage for sure. I think the first thing that put you on my radar was the kernel-hacking stuff that you were doing. What was the line or trajectory from TI-83 to hacking the kernel? I imagine there are a couple steps or stages in there.
I think that in college, really, was where I got into both kernel engineering and low-level systems engineering in general, and then into security work. I had a group of friends in college, many of whom would go on to found Ksplice, which was the first company I had a full-time job at, but who were really great developers, and were collectively really interested in systems programming, systems engineering, the lower level, the bits closer to the metal. I took MIT’s operating systems class with a number of them, and we all just had a blast.
It was one of the hardest classes we ever took, but it was also really fun, very hands on. Over the course of the class, you build a basic UNIX operating system that runs on x86. You can boot it on your own laptop if you want, more or less from scratch, and really understand all of the layers. I got into systems programming because everyone around me was doing it. I thought it was fascinating. You have a group of friends who are doing something, and you think it’s interesting. That’s just a great motivation to keep going.
Similarly, I was curious about security and exploit development. It’s very adjacent to systems programming because both of them involve peering underneath your usual abstraction layers, and vulnerability development is one of the fields that absolutely most strongly forces you to look under abstraction barriers, and to understand how things are actually working under the hood. I started by reading some online tutorials, reading the classic paper on smashing the stack, and started fiddling. Then, I started playing with exploits on my laptop, writing toy C programs with trivial vulnerabilities and learning how to exploit them, and reading other work.
Then, when I graduated, like I said, a bunch of those same friends that I was in the computer club with who were really good systems programmers started this company, Ksplice, where we were doing hot patching for the Linux kernel. So, applying security updates by modifying code in place without a reboot–very deep systems work.
It was my friend Jeff’s master’s thesis that was the foundational work. He actually did that work to some degree in conjunction with a bunch of the group of us that were in the MIT computer club, and thinking hard about systems work. Five of my friends founded that company, and then I joined a year later when I graduated. I was a year behind them. We got into that to do a general-purpose tool for applying updates to the OS kernel without reboots. It turns out that the main reason anyone cares about applying updates without reboots is for security updates. It’s because they have a machine that has, in many cases, local users who are untrusted. They’re a shared hosting provider of some sort.
When there’s a security update, they want to be able to apply it without taking downtime immediately. Those are the ones that have the most attack surface, that have the most motivation to apply these updates. We accidentally discovered that we were really building a security tool more so than a general sysadmin or operations tool.
We spent a lot of time working with kernel security patches, because we were figuring out how to apply them in a zero downtime way. We were trying to understand the kinds of patches, trying to track things that might be coming down the pipeline to get ahead of them. Then, that really intensified my interest in low-level security and kernel security engineering. I started doing hobby vulnerability research, looking for bugs in the Linux kernel, just because I thought it was fun. It wasn’t really part of my job.
I found a couple of interesting things, like I found one interesting bug class or meta bug, a bug that made other bugs worse. I also found a bug in the KVM hypervisor. The KVM hypervisor is the Linux native hypervisor for running virtual machines. I found a bug there that would let you break out of a VM guest, and run code in the host.
That one seemed cool enough that I took the time to write a full exploit for it, and actually gave a talk at Black Hat and DEF CON. I reported it appropriately, and got it fixed months before I gave the talk. The exploit wasn’t super weaponized. It was really just intended as a demo, but it was a lot of fun and really cool to stitch something like that end to end, and just see it work. You’re running code in a VM. You run some commands, and then suddenly a window pops up on the host, and it’s escaped.
Not sure if you’ve kept up with the security world. I’m a security newb, by the way, but do you have any thoughts on the current state of Linux security in the container world?
I’ve dropped out of being full time in that world, and we might actually see this as a bit of a pattern through my career, I feel like. I go deep somewhere and then I feel like I understand it pretty well, and then it’s more interesting to me to go learn something new than to stay somewhere where I’m an expert.
I’ve followed it from a distance. I’m less close to it. I think in many ways, we’re at very similar places to where we were, but more so.
I do think that there’s been a real trend over the last decade or so since I left security of, I think, the defense side, the people who are trying to figure out how to build secure systems, and coordinate security, really leveling up in maturity and sophistication and rigor. They’re thinking about the whole system, and where exploits come from, and where to invest effort, and moving beyond the level of just thinking about individual bugs to much more systematic analyses. I think it’s really exciting that there’s been a lot of work in the last couple years really, and then starting to accelerate in getting Rust to the point where it’s usable for a broader and broader class of low-level systems software.
I think that Rust will not fix every security bug ever, but memory safety bugs of the Use-After-Frees, buffer overflows, the kinds of things that C and C++ are known for. Every study we’ve done says that they are something like 75% of exploitable bugs in the wild, and working in something like Rust basically fixes most of those. There’s work right now to get Rust’s support into the upstream kernel. I know several of the people working on it–although I haven’t really been involved. That seems likely to land.
There’s increasing energy. Some of the web assembly run times are written in Rust. Amazon has a hypervisor written in Rust. Google has a hypervisor written in Rust that they use on their Chromebooks. I don’t think that’s going to completely fix security, but I think as we push that work through, it has the possibility to really change the shape of the security landscape, and how easy it is to find exploitable bugs and things.
That’s cool. I mean, catching these things as far left, I guess, in the software lifecycle as possible, building in the language features that prevent these bugs from being created in the first place.
I’m optimistic about it, but at the same time, it’s going to be a long road, and we have a lot of software written in C and C++, but I think it’s also exciting to see that the people working on these things are increasingly sophisticated around thinking about, “All right, we’re not going to rewrite the whole world, but where are the places where we can add the most value for the least work? How can we target there? How can we use those to get footholds, and just build out all of the infrastructure, and interoperability that are necessary in order to make this easy to adopt, and then encourage more and more of the next generation of software to be written on these tools?”
You mentioned your MO is that you go deep in one area, and then learn what you want out of it, and then you look elsewhere for a new area to learn and explore. About when did you start to top out in the security world, and what was the next thing that got your attention?
I was at Ksplice. Ksplice ended up being acquired by Oracle, who wanted the Ksplice technology for their Linux distribution, Oracle Unbreakable Linux. That Ksplice team actually still exists within Oracle. It contains a number of the people that we hired to operate the technology. That’s been really cool to see that we built a technology and a team that was enduring enough to last for a decade or so after all of the founders left. I think that in some ways that’s one of the parts of that whole time that I’m most proud of. We built really cool technology, but we also built something that was valuable enough, and that we taught the next generation, and encoded the knowledge, and all of those things worked out, that it lasted past us.
It’s not the case that everyone gets all of their updates without reboots, which was our original moonshot goal, but I think it’s still a real success that it’s still around. The team’s still doing good work.
But anyways, we were bought by Oracle. Essentially, all of the early teams stayed at Oracle for a year, and then decided that it was time to go pursue something somewhere else. That was my peak of involvement in the security world because I was working a little bit less hard now that we were acquired, and spending more time looking at security stuff.
I found it was a lot of fun, but especially doing offensive security, vulnerability research, exploit development–it wasn’t a great place to do work. I found that, especially at the time, it was a very toxic community, very adjacent to a lot of shady folks–both like criminal enterprises and the shittier organizations that find bugs, and sell them to governments for large sums of money in exchange for keeping them quiet.
A lot of egos in that space and there were very nebulous just norms around like, “What’s valued? What’s honored? What’s good?”
I loved it. The work was really fun and really interesting, but I wasn’t sure where it was going, and the career paths there. I wasn’t really thrilled about it. I think some people have made really great careers doing that, but it wasn’t for me. I ended up at Stripe next, almost immediately after leaving Oracle and almost by accident.
Greg Brockman, who was Stripe’s CTO for a while, and one of their very early engineers, actually worked briefly at Ksplice as an intern when he was at MIT with us. I was actually his intern mentor for a summer at Ksplice.
I taught him a lot about the Python development that we were working on at that point. Then he left to go join Stripe, and then came back and tried to recruit me. I was very skeptical at first, but I met more of the team. I met John and Patrick Collison. I eventually got sold that it was a promising business. Also, I got sold that it was a good place to do interesting technical work. Then that, of necessity, resulted in me pivoting a lot toward web technologies: Ruby and a lot of infrastructure work. It was interesting because I wasn’t doing kernel development in any way. I wasn’t writing any C code.
I wasn’t doing that low-level stuff, but it was really valuable to come in and have a deep systems understanding to know the kernel. There were a lot of bugs, a lot of fiddly things that were relatively easy to debug when you understood the lower layers so deeply. I thought it was a really fun opportunity to take those skills, learn new skills, and then also bring value to the team by being a good systems engineer. Stripe had incredible product people, incredible frontend people, incredible Ruby developers, but there was a skill set that they didn’t have as much of.
Having some of that around was really helpful, especially on the system side as we scaled up, as we debugged, scale off, and so on.
Of course, these days, Stripe handles some large percentage of all e-commerce transactions. I imagine all of that knowledge came in really handy.
Yeah. We were growing so quickly, and you run into all kinds of problems when you scale up that fast. Everything you’re doing breaks in some way, and it’s useful to have people who can come at it from all levels, people who can do the low-level optimizations, start to do the systems re-architecture so that you can scale out better, do the product-level things of, “Can we tweak the products to stop doing the inefficient thing to make things faster?”
It was important, being able to comfortably operate at any part of the stack as a team. Then to some extent, as an individual, as I leveled up, I always made a point of trying to maintain familiarity with most of our stack. That was really powerful.
To place it in context of Stripe’s history, how big was Stripe in terms of…
I joined in late 2012, which was, I believe, about a year after Stripe’s public launch, and Stripe had around 30 employees at the time. About half of that was engineering. I think when I joined, it was about 15-ish engineers, and then 15 people in legal, sales, ops, whatever.
Got it. These days, it’s like, “How could you not join Stripe? It’s huge. It’s amazingly successful. It’s synonymous with e-commerce.” But in those days, the initial product, if I remember correctly, was just a nice developer-friendly API for accepting credit cards, right?
That’s right. The initial product was purely credit card transactions in the U.S., USD only, and almost entirely the developer API part of it–not really any front end or anything else.
It was a much more bare bones and focused product. When I joined, I think Stripe was starting to become trendy or seen as successful and desirable. We’d been launched for a year, and we’re clearly seeing traction in growth. It definitely wasn’t widely known yet. It definitely wasn’t a clear runaway success.
There were a lot of commentators who said, “Okay, Stripe’s doing pretty well, but this product is niche. Sure, it’s great for onboarding, but customers will churn off of Stripe as they scale. If you’re a multinational, or you need to operate across the world, Stripe can’t handle you. It’s really unclear that they’ll ever grow outside of their niche.”
But by the time that you’re successful enough that people start writing contrarian pieces about you, it’s usually evidence that you’re achieving some real success. We were in the very early days of that stage of some success, not yet clear if it would scale, some skeptics, but also, above all, just still pretty small.
The absolute dollar values were small. The number of people was small. The number of people who had even heard of us was small.
I imagine they’re extremely thankful and lucky to have someone like you join the team at that point, because I feel like one of the things that a company has to do is you have your early adopter market, which is like, “Let’s make this user friendly and really accessible to this segment of the market,” but then as you grow the business, scaling becomes a first-order concern. Did you immediately dive into that area of problem, or did you first start in another area, and get gradually pulled into that?
No. I was working on infrastructure from the very beginning, but because of the rate of growth, infrastructure was essentially synonymous with scaling work. We were growing so quickly that we would outgrow our database clusters, outgrow our scale in some way constantly, and so almost every big project we had in flight was in some way about supporting the next generation of scale.
There’s at least two axes of scale there. One is just transaction volume, number of customers, number of requests, but there’s also growing the team–that’s its own scaling. We had to keep the CI fast, keep developer tools stable, keep Git working, keep code quality up, keep developers able to get started and just run code at all on their DevBox while everyone else was continually adding dependencies, and adding new features, etc. Size of the team is a very real scaling issue as well.
I think Stripe is known today, among other things, for having a very great developer experience internally. I think you probably played a big hand in that. What was the point at which the company started prioritizing developer experience, and what was the point that you got involved in those efforts?
I think it’s something we always valued a fair bit on a values level. I think Patrick Collison, one of the founders, was always really obsessed with the power of tools and the power of making it easy to do things. In some way, that was Stripe’s whole thing. You’ve been able to accept credit cards online forever, but if we can make it really easy, then that’s a game changer–even if it’s, in some sense, already possible. There was a somewhat similar attitude where we wanted to make sure that our developers internally can be really productive and effective.
But at the same time, it’s not something that we really funded in a concerted way until maybe 2014 or 2015. That was when we had a team that really would spend time on developer tooling and various things, but it wasn’t really a priority area or funded in terms of having a standing team until maybe 2014 or 2015.
It was sometime around there that I took a three-month sabbatical, because I was a little burned out on a bunch of the infrastructure work I was doing. When I came back, I decided that I wanted to go work with the developer tools team full time as my next project.
They weren’t called developer tools, but that’s what they were. Every team at Stripe has been renamed 30 times as the org grows, and grows, and scales, and moves managers around the center. I joined the developer tools team in, I think, it was probably 2016. I might be off by a year.
Got it. What projects came out of that team?
I think one big thread was the developer environment that developers work with on their laptop day to day: “I’m writing code. How do I run that code?” That code runs in an HDP API endpoint, or maybe an async job. “How do I run it? How do I test it?” As you grow, this becomes a surprisingly hard problem, because you have a huge number of dependencies. You depend on MySQL, and Redis and RabbitMQ, and Kafka, and S3. You need to have appropriate versions of all those configured, and credentials wired up.
You start building multiple services, and you need to be able to run things that work on those. The environment that developers used, we ended up building a pretty sophisticated setup where developers would code locally. That code would live on their laptops. It would get synced up to a DevBox in the cloud. All Stripe servers would be auto-started on demand. If you talked to the right port, there was a proxy that was like, “Ah, you’re talking to the HDP. You’re talking to the API server. I’ll go spin up the API server. I’ll keep it running, so you only pay the startup costs on the first run.”
This way, you can have lots of services. You don’t have to pay the cost of starting all of them. You only pay the cost for the ones that you’re developing on, but you, the developer, never have to think about, “Oh, today, I’m working on the admin interface. Let’s start the admin interface.” They would do source code watching to auto restart on changes. They would do various tricks to make code loading faster and do partial loads so that you only reload part of the code unchanged. All of the dependencies would run on the server.
Then, because these servers were centrally provisioned, and you synced your code to them, you could always throw away your DevBox, and get a fresh one with the latest configuration, which meant that it was very easy for infrastructure teams and tools teams to maintain this server and keep it always working. If you have a problem, you just throw yours away and restart.
There’s nothing magic there, but I think one of the things that I really learned firsthand there is that there’s a lot of value in picking a problem, and identifying all of the rough edges, and doing them all well, and working through the whole lifecycle, and really making something that works well as a finished unit.
That was one of the biggest distinct projects that we did during that phase of time. It was really a game changer for a lot of people. It meant that on day one, you were guaranteed that you could be issued a laptop, and you could have a checkout, and everything would just run. You’d never lose time on day one trying to get your dev environment working. Managers who wouldn’t really write code, but who would occasionally have to come back to write code after not having touched it for three months were especially grateful.
That used to always be a nightmare, because you’d have a three-month, out-of-date environment. The first new setup instructions wouldn’t work, but also, the incremental instructions didn’t work. We fixed that problem too–one flow, blow it away, give you a fresh box in the cloud, sync the source code up, and the centralized infrastructure takes care of all the details.
I mean, that’s such an amazing undertaking and accomplishment, I think, not just because the underlying technology is really, in my opinion, hard to get right, and make a good experience, but also the fact that this predates the current wave of cloud IDE companies. If I recall correctly, at that point in time, it was like every single cloud IDE or dev environment in a box, at least startup, had not really succeeded. Did you get pushback at all internally like, “Oh, we’re not really sure that this is going to pan out,” or like, “Oh, nothing’s going to ever be as good as a local dev environment despite all its warts?”
No. I think one of the things was that Stripe had almost always, since I started, done a model of: people write code locally, and we have a script that syncs it to the cloud, and you run services in the cloud. That basic model had been around forever. I think the big change was that you would have a server that was statically assigned to you in the cloud, and was long lived, and you were somewhat responsible for it. We were moving to a world where those were truly ephemeral and replaceable, and then doing a lot of the polish work to make it all work. It wasn’t a radical redesign so much as it was taking roughly the shape of the thing we were doing.
There had always been some battles because that used to be the main way of doing things, but Stripe was never a place that enforced workflows on developers. Some people would run code on their laptop for a while. We had a Vagrant set up where people would run on a VMware VM that ran on their laptop. There was a hybrid set of things. There was some pushback around trying to invest in one workflow, and optimize it because people who still wanted to run code locally were going to be like, “Oh, are you going to break my workflow? Is this still going to work?”
We ended up with a compromise where we wouldn’t explicitly break their workflow, but they were responsible for it. We also wouldn’t go out of our way to maintain these unsupported workflows. We would try to make our workflow good enough that people wanted to switch. I think, by and large, we succeeded.
Do you think the fact that Stripe’s main language was Ruby, and Ruby is a dynamically typed language, played a role in making this easier? Just because there was no, like, “Oh, but I could pilot on my local machine, and that’s tied to the editor experience, because I get co-intelligence or…”
I don’t think it made a huge impact. By this point, we actually had our own build system that would have to run in order to run code.
Ruby is a very dynamic language, and you can do all kinds of meta programming. We started doing a bunch of code generation where we would generate Ruby code ahead of time. In many cases, for performance reasons, we built an autoloader where our Ruby code at Stripe had no require statements. Require statement in Ruby is how you load another file of code. At Stripe, there’s no require statements, and the entire codebase is statically analyzed to figure out which symbols are defined where, and then we have a custom autoloader that ensures that if you reference a name, it is loaded.
Then that also does static analysis so that in production, we preload every symbol you’re ever going to need.
A problem with autoloading is that it gives you very inconsistent performance. The first time you hit something, it’s very slow. In development, that’s actually great, right? In development, you only want to load the things that you need because that means that you can restart, and you hit one endpoint, and you don’t have to pay the cost of all of the rest of the code in the system. In production, you want to load everything at startup, and then have predictable performance.
Actually, a couple years before we built Sorbet, the team that I would later join but wasn’t on at the time, built this autoloader and this static analysis thing that did a coarse grain static analysis on Ruby code that would power this autoloader system.
But that did mean that there was this static analysis stack that had to run as part of a workflow. They made it incremental and it worked such that an out-of-date version would usually work. So, it wasn’t that you were waiting on it on every build, but you had to have a server continually running that. We already had a fairly sophisticated build environment.
I think it’s possible. Maybe you’re thinking if we were in an IDE, and people did their builds through the IDE, it might have been a problem. I think it might have been a problem, but I feel like it would’ve been a solvable one. We would’ve figured out a way to do the builds remotely, or cross-compile, or write a custom plugin. I think we probably wouldn’t have had the IDE be the source of truth for the build. Maybe you would’ve done a dev build locally to get compile errors and syntax checking. In parallel, we’ll do a real build on your DevBox in the background in the cloud so that the code is ready when you want to run it.
I can imagine a lot of solutions we would’ve done. I think one of the key things here is that a lot of things are possible if you’re willing to staff a team of good people, and make it their goal to do it. At some point, that was what we decided. We were like, “All right, developer experience is important. It’s a huge multiplier on everyone’s productivity. It is worth having some people who just work on it.” Some quite good engineers, some of the best engineers I worked with, were on that team. They were investing in everyone else’s productivity in a holistic way.
That’s awesome. Quick, probably stupid, technical question. I’m not super familiar with the Ruby internals, but if you were changing things so that you were statically detecting which symbols were used where, and then preloading those in production, does that mean you’re running a fork of the Ruby run time, or is there a mechanism in the language to…
No. Ruby is sufficiently dynamic that you can do that all from within the Ruby language.
Ruby actually has its own autoloader, which we didn’t use for somewhat technical reasons. Ruby has a generic autoload mechanism where you can basically register hooks that say “When a name is referenced, and you don’t know what that name is, call me, and I’ll provide it for you.” We used that infrastructure.
Maybe this is different, but like active record hooks into this if you-
It’s very similar. It’s a slightly different technical mechanism than method missing, which is the one that you might be familiar with.
We were running on stock Ruby, but Ruby has enough dynamic features that we could layer this on top.
Got it. You mentioned Sorbet, which is one of the things that I would love to learn more about. Sorbet is the Ruby type checker. Tell us about that project, how that got off the ground, and what it does.
We’d actually known/suspected for a long time that we wanted to do this. One of the founding members of the developer tooling team at Stripe was this guy named Paul Tarjan, who was a fairly early Facebook engineer who worked on Hack and HHBM, Facebook’s PHP VM. He’d had experience working on that, but I think, more importantly, he had just seen the rollout at Facebook of the change of going from a giant monolithic codebase without types to a giant monolithic codebase with types.
He was very adamant that when you’re at that scale, it’s an amazing quality-of-life improvement to be able to add types to your codebase, to be able to make explicit your interface boundaries, and what objects are expected where, and check those.
He always wanted to do this project at Stripe. Me and a couple of other engineers were enthusiastic about it. I thought it made a lot of sense at scale. I’ve always been partial to static types myself. I feel like, at a small scale, especially, it’s a little bit of a religious war/preferences thing. Some people have very strong opinions either way. I think that when you get a very large codebase with a lot of people contributing to it, it’s an absolute no-brainer that types help. They give you richer tools for communicating abstraction boundaries and layers, and communicating between developers and teams.
I thought it would be a good idea, and I always thought it would be fun to work on. I’ve always enjoyed working on compilers and type systems. I’d never really worked on one in earnest, but I dabbled. I had a couple of small patches in LLVM to fix bugs that I’d run into at some point. I thought it was a good idea, and I’d always thought it would be a fun project. We sat on it as an idea for a couple of years, making noise of, “We think we want to eventually go this way. It doesn’t make sense yet. We don’t have the funding for it.”
Then, at some point, we hired this other engineer, Dmitry Petrashko, who is an incredible engineer who had just wrapped up his PhD working in type systems and compilers, and building a lot of the Scala 3 compiler. We hired him with the pitch of, like, “Come join us. Build a type checker for Ruby.” Then we put together the team for it. It ended up being Dmitry, Paul, and me. We designed the initial prototype and shopped the idea around internally. I think there was some pushback internally. Some people were like, “This seems crazy. This is way too ambitious. Is this even going to work? Can you even type Ruby? Is this a good use of resources?”
We had a manager who actually was also ex-Facebook, so he had also seen the success of the project at Facebook. He was enthusiastic, he got it, and he was able to buy enough organizational air cover for us to get off the ground. Then, once we were going, we focused on trying to get to a point where it was delivering obvious value fast enough that it became self sustaining.
That’s awesome. One of the things that you didn’t mention was the editor experience, like hover tooltips, go to definition, find references. Was that a design goal of Sorbet at all?
Yeah. That was interesting, because Dmitry always had that in mind as a long-term design goal. I think he’d seen, in his previous work in compilers and Scala, the power of that and the power of how having a sophisticated compiler makes that possible. I personally didn’t really see the vision at that point, or I was like, “That seems nice in theory, but will we ever get it to work? Will people use it?” I think Paul was also less sold on the idea, but I think Dmitry ended up being absolutely right.
It wasn’t one of the first things we built. The first thing we built was a static, offline type checker. You run it once. You get your type errors.
But eventually, we augmented it to have an LSP server and IDE integration so that you could hover over things, and get types, and get compiler errors in your IDE, and type-aware, go-to-definition. I think that did end up absolutely huge. People loved it. It was a game-changer for how people related to code. I think Dmitry was absolutely right when he saw ahead, that that’s one of the things where this is going to be a game changer.
These days, the developer environment in Stripe is: you have a dev environment in a box that’s running on a server somewhere, and then you have Sorbet providing static type checking and also intel sense or code intelligence to… Is the editor a cloud-based editor, or is it a local editor?
Again, Stripe has never quite been willing to force things on developers and their work but the flow that it supported works really well. At least when I left, it’s been almost two years now, and it’s possible that it’s changed. When I left, it was local VS Code. It would run VS Code on your laptop. Sorbet would run on your DevBox in the cloud.
VS Code would talk to it over a network connection, and it would stream errors, and do LSP RPC calls to get type information, go-to-definition, etc. to the server. Sorbet ran, again, in the dev environment in a box. This is where the central dev teams can make sure that everything works and can access logs if they have to, etc.. But it talks to the editor over a network pipe.
I think that the Ruby community, whatever that means, doesn’t love types. I think Ruby has really loved these very fluent, short, expressive things. However, the large organizations that run Ruby have, by and large, been a lot more receptive because they’ve had similar problems. Sorbet has had a lot of contributions from the Shopify team. Shopify is also a giant Rails application. They run Sorbet. They contribute back. I believe Airbnb run Sorbet.
Many of the large Ruby installations that I know of run Sorbet. I don’t think GitHub does, but I’m not sure. Actually, the other day, I was running a Homebrew command on my Mac, and I saw that it was installing Sorbet.
As you might know, Homebrew is written in Ruby, and they seem to be going through and adding Ruby type annotations to Homebrew. That means that Sorbet is now running on basically every developer Mac in existence, which I thought was pretty cool to discover.
I think adoption has been somewhat unevenly distributed, but as best I can tell, most large Ruby projects have picked it up because they see the advantages at scale, both in terms of code size and team size. I don’t know that it’s widespread, but it’s got some real adoption, which has been really exciting.
That’s great. Another open source project that you’re known for is Livegrep, which is a code search engine. Can you tell me about that, and when did you start building that?
If listeners haven’t used Livegrep, there’s a demo at livegrep.com. It’s a code search engine with this very simple premise, which is it’s a regular expression search engine over large codebases. You type in a regex. It shows you all of the matching lines, and it does so more or less live. As you type character by character, the new results show up.
It was actually also during the time when I was at Oracle after Ksplice and before Stripe. This was in the window where Google Code Search, which you may or may not even remember. Google used to run this project called Google Code Search, which was search over basically all of the open source code in the world using regular expressions.
Google had recently sunset Google Code Search, and there weren’t really any replacements yet. Russ Cox, who was the author of Google Code Search, had not yet released a series of blog posts on how Google Code Search had worked. I and a bunch of my developer friends were missing Google Code Search, and we were talking about how the heck do you build a search engine that uses regular expressions? It’s not super obvious how you build an index that can make that efficient.
Livegrep originally started purely as a way to explore the technical question of, “How do you make fast regular expression search?” Because it existed in this window where Russ Cox hadn’t yet released the blog posts explaining how Google Code Search worked, I and some friends figured out our own indexing solution and data structure solution more or less from first principles.
Livegrep has a wacky index that I’m not really aware of anyone else using for regular expression search. It has some definite trade-offs compared to the trigram indexes that Google Code Search and that most popular regex search engines use these days.
It has some advantages and it has some very real disadvantages, but I thought it was certainly a lot of fun to develop. I built that purely as a tech demo of, like, “Can I make this fast?” Then I discovered that once I made it fast, like, “Oh, this is actually a powerful interface.” When I went to Stripe, I stood up an instance there mostly for fun, but then it ended up being really popular. It’s still running since I left. People swear by it.
Dropbox had an instance internally. I know a lot of ex-Stripe people, wherever they’ve moved on to, have stood up a Livegrep instance with them because they thought it was so powerful. That’s also been really cool to see.
That’s awesome. I’m trying to find this tweet from one of your former colleagues, Patrick McKenzie, about how useful this tool is:
When folks ask me a question about our codebase internally I try to— Patrick McKenzie (@patio11) January 18, 2019
a) answer the question
b) say "If I were trying to find the answer to that question with our tools, here's my entry point, here's the search query, and here's my mental heuristic for why I'd click on result #3"
He goes on to say, “Since it’s publicly available, let me mention that the most common tool I use for answering these questions is Livegrep and that I intend to boot up a Livegrep instance on the first day of every startup for the rest of my life. It borders on miraculous.” It doesn’t get much better than that as far as user testimonials go.
It’s been really rewarding. Livegrep doesn’t have a very super wide user base, but it has a lot of users who really swear by it, which I feel really, really proud of.
Maybe speak a little bit more about the implementation, because I think that’s interesting, because a lot of code search engines… Sourcegraph is obviously a code search engine. There’s also Hound, which is implemented at Etsy, and OpenGrok as well from, I think, originally Sun Microsystems, now Oracle.
A lot of these code search engines took inspiration from the Russ Cox blog posts about building a trigram index as a backend. But it’s super interesting that you almost took the concept of code search, and tried to reverse engineer what you thought would be the optimal implementation, and came up with a different backend, which works extremely well, but is not a trigram index.
If not a trigram index, what is the backing index for Livegrep?
We’ll get a little technical here. I have a blog post that explains some of it. Maybe I should actually do an updated version of that, because I think it’s been tweaked a little bit. I have a blog post that explains it.
Livegrep uses a data structure called a suffix array. You can think of a suffix array as: you have some string of texts, and you take every suffix of that string, so the whole string, then the substring starting at index one, the sub strings starting at index two.
You take all of those suffixes, and you sort them. Now, you have a sorted list of suffixes. Now, there’s this property that any substring of a string is the prefix of some suffix. So, we’ll unpack that a little bit, right? Every substring starts at some position. There’s also a suffix that starts at that position, and runs to the end of the string, and so your substring is the start of some suffix. If I have a sorted list of all of the suffixes, I can do binary search over them to find any substring of the string I’m searching for.
Now, that gives me substring search. I want regular expression search, but most regular expressions have either a bunch of literals or character classes, or things that decompose into something like that. This is where you’re like, “If you’re looking for two strings, you can think of that as… You can collapse that a little bit loss-ily. If I’m looking for ‘foo’ or ‘bar,’ I’m looking for either F or B.” I can do a search in that suffix array to find all of the ranges starting with F and all of the ranges starting with B, and then I can recurse and find everything starting with FO.
If I’m going case insensitive, and I want capital F or lowercase F, followed by capital O or lowercase O, I start by finding in the capital F range and the lowercase F range. Those are all contiguous ranges, because we’re in an array. Then, within each of those, I find lowercase O and uppercase O, because I’ve expanded out into these four ranges in this way that is like, “If I do this for a length-end string, I get exponentially many, but in practice, most of them disappear, because your code probably doesn’t have all of the possible permutations of case.”
At every point, it’s a pretty efficient search. To represent a suffix array rather than actually copying all of those suffixes, you just represent them as an index, you represent that this is the suffix starting at this position.
Your suffix array is stored as a list of indices, the same length as your string, and there’s well-known literature on how to efficiently construct these. I just used an open source library to build them efficiently.
Then, the actual indexing and lookup code is some custom stuff that I wrote. Suffix arrays are a pretty big blowup in space, because for every character, or actually, for every byte, you have to store index. I use 32 bit indices. That means that you have a 4x blowup or 5x blowup if you include the original text.
Livegrep has some compression where we try to deduplicate lines, because source code often has a lot of duplicate lines in it–often just like white space and curly breaks. We only store a unique line once, and then we store a whole bunch of metadata that lets us reconstruct the files from that.
We pay some computational cost and some storage cost for the metadata, but we make up for it by shrinking the corpus substantially while still ensuring it’s efficiently indexable.
There’s a whole bunch of other tricks to make the details work out such that once you’ve done this index query, you need to actually run the regex engine, and so how do you find out exactly which spans of code to run the regex over, and then how do you reconstitute results?
Got it. It is a beautiful elegant solution. Can you talk a little bit about the trade-offs between this approach, and the trigram-based approach?
One of the biggest ones is storage. Trigram index is pretty compact. This blows up your corpus by, naively, like 5X. You can get some of that back with compression. Let’s see. It’s been a while since I thought through the details of this. There are some queries that my suffix arrays can index that trigram can’t really index. Basically, if you have this set of character classes, the number of possibilities blows off exponentially. If you’re searching for all hexadecimal strings, you have A through F, times A through F, times A through F.
In a trigram index, that’s 16^3 separate trigrams that might appear at the start of a hex string. 16^3 is a big number. It’s basically too big, and so that means trying to index every hexadecimal string just doesn’t work. In my suffix array, zero through nine is one contiguous range. Then A through F is another contiguous range, and so you have two ranges. With each of those, you have two sub-ranges. You’ve actually walked that down far enough to get a pretty small number of points, and then run the regex query on that.
It’s actually capable of accelerating a broader class of searches than the trigram index can. I think it’s an open or fair question whether it’s a useful class of searches. Is searching for every hex string something anyone ever does? One of my demos is that I can search for every UID in the Linux kernel, and that comes back pretty quickly.
You can’t really do that with a trigram index.
Yeah. Whereas the trigram approach will just look for string literals in your thing and try to match against that. But if everything is a wild card, then…
Right. If everything’s a wild card, you’re completely lost. But if you have something like a hex string, where you have a limited set of characters in relationship to each other, you can expand that into trigrams. For instance, case insensitive searches, if you have FU, assuming we’re only considering half the alphabet, that’s two to three… That’s eight trigrams. We can index that. Trigram indexes have some ability to do that, but the suffix array is much better.
One of the disadvantages of the suffix array is it has really bad locality characteristics, because you’re doing binary search in this gigabyte array, and so it’s very dependent on latency to the store.
Whereas it’s possible to store your trigrams pretty efficiently so that you do a couple of metadata searches, and then you just stream the posting lists from storage. The suffix arrays are both larger, and have worse locality, which means they basically have to live in memory. They’re large, though, whereas it’s pretty reasonable to put trigrams on SSD or even hard disk.
I have a hare-brained scheme to build a trigram index that backs to S3. I don’t think you could make it super fast, but I think you could probably get latencies like sub five seconds. If you can go in S3, you could potentially index terabytes of code for absolutely pennies with minimal infrastructure. I don’t know if it works out. I have a sketch in my head, but I haven’t tried to build it. You can think about that thing with trigram indices, and it’s not even close to possible with Livegrep’s approach.
It sounds a lot more difficult to scale due to the memory and requirements, and the memory locality.
Typically, you store a list of documents under each trigram, so your trigram index essentially gives you back, “Here’s a list of files to search.” The Livegrep index gives you back a list of lines to search. The way Livegrep works is it actually gives you a list of source lines to run the regex over.
The index potentially means that you need to run way fewer regex searches after the index. I haven’t actually pushed through detailed head-to-head benchmarks, but I think that’s part of the reason that Livegrep is able to be so fast: the index is being a fair bit more selective. You could, in principle, do that with the trigram index, but no one does. I think there are some reasons it’d be fiddly.
Another big downside of the suffix arrays is that it’s not really clear how to do incremental updates, because you have this massive sort. I have some hare-brained schemes that look like log-structured merge trees. But, again, they’re purely hare-brained schemes that exist in my head. I think they’d be a real pain to actually implement.
We’re running up against the end of our time here. I wanted a real quick pivot back to developer experience at Stripe. When we were chatting earlier, you had some interesting things to say about the impact of building developer tools, and the nonlinear effects as people start to use them in interesting ways and the effects that that has on their productivity. Can you share some of your observations there?
Building tools of any kind is really interesting, because you’re building something that humans are working closely with. Humans are creative and adaptable in ways that you don’t predict, so the impacts of your changes can be very hard to predict. The naive view is if I make a tool twice as fast, everyone who uses that tool–we measure how much time they spend using that tool–and we save the path that time, right? That’s many people’s mental model.
What actually happens is that those people use that tool more. If it gets faster, they may spend the same amount of time using it, but they make twice as many queries. They may switch from running something and getting coffee to running it incrementally and interactively. In some ways, that’s disappointing, because it feels like you haven’t saved them time, but what it means is that making it faster gives them a new capability. They often relate to this tool in a new way.
Similarly, performance is the cleanest way to see this, but it shows that any time that you make something easier or harder to do, either because it’s faster or slower, or just because you’ve reduced the number of steps, or you’ve made the steps more annoying, or you’ve added cognitive overhead, then people react by changing how they use your tool. When you’re building developer tools, if the officially supported developer environment doesn’t work for people in some way, they build their own approach.
They start standing up their own VMs, and running their own things. I think it’s really interesting to note. It makes it really hard often to reason about the impact of this kind of work, because there are no easy metrics. One of the takeaways that I take from it is that making tools easy to use, fast to use, and pleasant to use is really powerful. It’s really powerful in ways that are hard to predict until you’ve done it, and so you should just take it as axiomatic that it’s worth a little bit more time than your organization otherwise would spend investing in tool quality, because people will change how they relate to those tools.
They’ll find new ways to use it. They’ll use them more often. It often leads to this productivity flywheel in somewhat non-obvious ways.
100%. I feel like everyone who’s worked in a developer productivity organization has struggled with how to convey the impact of their work. We struggle with it too as a developer tools company, because we’re essentially making the same pitch but to external organizations and teams. We go to them and say, “Look, code search is going to make your organization a lot more productive.” They’re like, “Well, how much more productive? Can you put that into a spreadsheet form for me?”
Let’s say they use OpenGrok, and they’re like, “Oh, our developers, they say they only use OpenGrok once or twice a week on average. Let’s say it saves them an hour each query. That’s two hours saved per person. That’s not a lot.” We’re like, “Whoa, wait a minute. Why aren’t they using it more?” Once things become more performant, once it’s a friendly UI, that totally changes the quantity of queries that you’re going to do. It’s hard to boil all that down into a straightforward economic argument.
We struggled with this a lot as the developer productivity team at Stripe was figuring out how to measure success and figuring out how to know if what we were doing was working. I think in large part, well, there’s a couple of parts here. Organizationally, I think we solved it by having an org that really understood and believed in the value of having good developer tools. We didn’t have to spend a lot of time justifying our existence up the reporting chain. That’s really valuable.
There’s also no one checking you in some cases. You have to actually convince yourself and your immediate manager that what you’re doing works, because you’re not going to necessarily get huge external pressure to show that you’re working. Then, internally, I think we solved it through a combination of reasons. Part of it was having senior engineers whose judgment in a prioritization we trusted, having senior engineers who had seen things work at other organizations.
I think we never would’ve tried Sorbet if we hadn’t had the… We might’ve guessed that it was a good idea, but we never would’ve tried it if we didn’t have ex-Facebook engineers who were like, “No, I’ve been here. I’ve been in that environment. The problems that you’re having, I recognize them. Then, we rolled out types, and they got better. We’ve seen it happen.” Being able to just trust those people, and not ask them to prove it or to make the spreadsheet is really powerful.
Then the last thing was just trying to make sure that we had good relationships and connections to our developers. That meant both spending time individually with developers who were users of our tools, and getting their complaints, and then also trying to find imperfect but systematic ways to aggregate their feedback and opinions and responses. Every six months, we ran a developer tools survey where we would send a survey to every engineer in the company. We gave them a couple of one through five satisfaction scores so that we had a couple of headline metrics we could track over time.
But mostly, we gave them a lot of optional free text, like, “What’s your favorite part of your developer workflow?” “What’s your least favorite part?” “What do you think about tool X?” Then, we would sit down and read every single response, and aggregate them into themes and pain points. It was a lot of work. It wasn’t super numeric, but it was enough that you would be able to quantify over time, like, “Oh, we rolled out types, and people started saying nice things about them.”
People were like, “Oh, this is better. I can now understand code when I’m reading it. I have more confidence in changing code.” Six months is a long-ass feedback cycle, but it’s better than none at all. I think it was an expensive, slow one, but it was reasonably high quality.
Would it be safe to say that your passion for developer productivity and tools and experience has nicely rolled into what you’re doing now at Anthropic, which in a way is trying to improve the developer experience of debugging these complicated, hard-to-understand machine learning models?
That’s an interesting one. I hadn’t really thought of it quite that way, although I think that’s some of it. I do think that, coincidentally, Anthropic has a lot of people who share some of my values. I think, interestingly, often in different contexts, less about developer tools, but more around research-y tools, or the power of good visualizations to change how people relate to tasks, or relate to understanding, or the power of having good tools that are pleasant to work with and fast.
I think one of my personal, deeply held philosophical things that has actually rolled over a lot into Anthropic is what comes of being a systems engineer. I have an obsessive desire to actually understand the computer systems that I’m working with. I’m not happy with cargo culting or copy pasting things, unless I have a decent mental model of, ideally, everything I’m doing all the way down to the hardware. Not in detail, but I want to know how the things fit together. I want to have the confidence that if there is a mysterious bug or behavior I don’t understand, at least in principle, I’m confident that I could go read the Ruby interpreter source code, read the kernel source code if I have to, and understand the weird behavior.
I have another blog post about this that is another one I’m very proud of, because I think it conveyed a real mindset that I think is commonly held by some senior engineers, but I hadn’t seen well expressed. I think that the Anthropic team has a somewhat similar relationship towards machine learning and AI and models. These models are very commonly treated as absolute magic black boxes, where you build a model and maybe it trains. Or maybe it doesn’t train. If it works, if it does train, it does well. It classifies images. It generates text. Whatever you’ve trained it to do.
Don’t ask how it does that. That’s just unknowable. Just be glad that it works. I think that’s really the default way of relating to large ML models. I think the Anthropic team has a very different attitude. They almost take it as axiomatic, that these things do behave in understandable ways and in intelligible ways. And if we don’t understand them, that’s a bug in our understanding. That’s not a statement about the universe, and so let’s figure out what the tools are, the software tools, the mathematical tools, the workflow tools, whatever else.
Figure out what the tools there are that let us put these into an intelligible framework. They do that from a couple of angles. I find that really meshes with my desire to have systems that make sense, and that I understand. I think that they’ve had some real success in the past. I think it’s an open question whether this is a research agenda that will bear fruit indefinitely, or where it will lead, but I find that really appealing and really powerful, I think, as well, because it’s a very powerful organizing principle of: dig until you find the thing that makes sense.
If it works, you’re guaranteed to end up with something that you have a better understanding of than if you’re just tweaking hyper parameters, and seeing if it trains or not. You’re just much more likely to end up in places where you can generalize.
Anthropic, as I understand it, is a research-based company that is all about making AI and machine learning systems safer, more understandable, and debuggable, right?
That’s right. We expect we will probably eventually release products, and the plan is to eventually productize this kind of research, and turn that into products that other people can use in revenue streams and all that goodness.
But we’re very much in research mode right now, and we’re focused on “How do we build models that are safe, that are understandable, that are responsive to the instructions or goals of the people operating them?” “How can we, in general, make the field of AI and ML more intelligible?” “How can we make it more likely that you can know whether a model is going to work or roughly how well it’s going to work before you run the massive training job?” “Can we find the science and empirical and mathematical laws that govern these things in various machines?”
My last question for you in the remaining 90 seconds that we have is: Can you explain what neural nets are doing? Just kidding. Just kidding. We’ll have to save that for… You should come back at some point.
Well, my coworker, Chris Olah, and some of his collaborators, I think if you ask that question about vision models in particular, the answer is yes. They can explain how vision models work.
Not in every detail, there’s open questions, but to a way better degree than I realized. They’ve published some really awesome work with some really great explanations and visualizations. We could throw a link to that in the show notes as well, but it really changed my perspective on how much it is possible to know about these models, and how much we do know in some cases.
We’ll definitely throw a link to that. Maybe you can put in a good word for this podcast with him, and at some point, he can come on the show and explain all that. I think that’d be a really fun conversation. Final parting note, if you could ask the people listening or watching this to do one thing after this episode is over, what would that call to action be?
Well, I feel like I’m supposed to tell you to come apply to work at Anthropic, because we’re hiring.
That’s totally fine.
I feel like my actual answer is some tool that you use, probably a software tool, but maybe some meat space tool that you use that you find either a little bit frustrating or a little bit baffling–like there’s some sharp edge that you’ve worked around, or it’s a little bit slower than you want. Find time, take an hour to try to fix it or try to understand it, right? Maybe you use Git a lot, and it’s a little frustrating and a little bit magic. Try to find a way to learn a little bit more about it, and make it a little bit less magic, or you run some process every day for your job.
It’s too slow. Figure out how to attach a profiler to it. I really believe in the power of getting to know your tools and of fixing the squeaky pain points. I want other people to have the joy of succeeding at that quest.
That’s great advice. Well, thanks so much Nelson for taking the time today. This was a fantastic conversation. Thank you so much for being with us.
Thank you. It’s been fun.
This transcript has been lightly edited for clarity and readability.