Episode 13: Andrew Gallant (BurntSushi), creator of ripgrep

Andrew Gallant (a.k.a. BurntSushi) is the creator of ripgrep, a popular command-line search tool that powers the search box in VS Code. Andrew tells me how ripgrep began, explains why it's faster than GNU grep and other grep alternatives, and gets into the nitty-gritty of regex optimization.

We also discuss another matter near and dear to both of us: Linux window management. Andrew talks about what he likes about Go and Haskell and why Rust is his current go-to programming language, and finally he shares a humorous anecdote involving algorithms, technical recruiting, and everyone's favorite New England sports team.

Show Notes

Andrew Gallant: https://burntsushi.net, https://twitter.com/burntsushi5

ripgrep: https://github.com/BurntSushi/ripgrep

Bulletin board systems: vBulletin (https://en.wikipedia.org/wiki/VBulletin), YABB (http://www.yabbforum.com)

Window managers: xmonad, i3

Russ Cox posts on implementing regular expressions: https://swtch.com/~rsc/regexp/

EWMH: https://en.wikipedia.org/wiki/Extended_Window_Manager_Hints

ICCCM: https://en.wikipedia.org/wiki/Inter-Client_Communication_Conventions_Manual

Xinerama: https://en.wikipedia.org/wiki/Xinerama

RandR: https://www.x.org/wiki/Projects/XRandR/

Openbox: http://openbox.org/wiki/Main_Page

wingo: https://github.com/BurntSushi/wingo

History of grep: https://en.wikipedia.org/wiki/Grep

GNU grep: https://www.gnu.org/software/grep/

Ken Thompson, author of original grep: https://en.wikipedia.org/wiki/Ken_Thompson

Silver searcher (ag): https://github.com/ggreer/the_silver_searcher

ack: https://github.com/beyondgrep/ack3

Ridiculous Fish post on Treacherous Optimization: https://ridiculousfish.com/blog/posts/old-age-and-treachery.html

Mike Haertel "why GNU grep is fast": https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

GNU Parallel: https://www.gnu.org/software/parallel

xargs: https://en.wikipedia.org/wiki/Xargs

Boyer-Moore algorithm: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm

memchr: https://man7.org/linux/man-pages/man3/memchr.3.html

Rust memchr crate: https://crates.io/crates/memchr

Rust regex crate: https://github.com/rust-lang/regex

Vector instructions: https://en.wikipedia.org/wiki/Vector_processor#Vector_instructions

ripgrep FAQ: https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md

Memory-mapped I/O: https://en.wikipedia.org/wiki/Memory-mapped_I/O

ripgrep in Visual Studio Code: https://github.com/microsoft/vscode-ripgrep, https://code.visualstudio.com/updates/v1_11#_text-search-improvements

PCRE2: https://www.pcre.org/current/doc/html/index.html

re2c: https://re2c.org

Ragel (finite-state machine compiler and parser generator, the name slips Andrew's mind during the episode): https://en.wikipedia.org/wiki/Ragel

Backtracking regex implementation: https://users.cs.cf.ac.uk/Dave.Marshall/PERL/node80.html, https://docs.microsoft.com/en-us/dotnet/standard/base-types/backtracking-in-regular-expressions

Finite automaton regex implementation: https://deniskyashif.com/2019/02/17/implementing-a-regular-expression-engine/

Lazy DFA: https://github.com/rust-lang/regex/pull/164

Rust: https://www.rust-lang.org/

Norman Ramsey: https://engineering.tufts.edu/people/faculty/norman-ramsey

QuickCheck (Haskell): https://hackage.haskell.org/package/QuickCheck

Standard ML: https://en.wikipedia.org/wiki/Standard_ML

Viterbi algorithm: https://en.wikipedia.org/wiki/Viterbi_algorithm

Andrew's blogpost "My FOSS Story": https://blog.burntsushi.net/foss

Bill Belichick: https://en.wikipedia.org/wiki/Bill_Belichick

Transcript

If you notice any errors in this transcript, you can propose changes to the source.

Beyang Liu: All right, I'm here with Andrew Gallant, creator of the popular command line search tool ripgrep and prominent contributor in the Go and Rust open source communities. Hey, Andrew, how's it going?

Andrew Gallant: Doing great. How about you?

Beyang: Not too bad. Not too bad. Thanks for coming on the show today.

Andrew: No problem. Happy to be here.

Beyang: So there's a lot of cool content I'm excited to delve into. But before we get into ripgrep and all your open source contributions I always like to start off the conversation by asking people, what was your inception point into the world of programming? How did you get started? And what was your journey from there?

Andrew: Yeah, I'll try to give you the short story. But basically, back in, I don't know, let's say middle school, early high school I was obsessed with the Nintendo GameCube release coming out. I even went to one of the cube clubs in Boston, when they had those back then. Where you got to try out the GameCube before it was released. And anyway, I was really obsessed with that. So I started creating websites and like they were terrible geo cities, kind of websites with lens flares everywhere. So that kind of got me into the HTML, CSS part of it. And it kind of bridged me into the whole bulletin board system stuff that was going on back then. If people were around back then like PHP, BB, vBulletin, Envision, YaBB, I think, yet another bulletin board, something like that.

Beyang: Yeah.

Andrew: And a whole bunch of other things. Kind of got me into that. And figuring out how to set up my own bulletin board system, you kind of had to start learning a little bit of code and how to host websites and that sort of thing. I eventually just got started hacking vBulletin was where I pushed me to learn PHP, which is my first programming language. So I started adding like little tiny mods and changes to it and that sort of thing. And then I just eventually said, "Hey, this vBulletin doesn't work the way I want it to." So I built my own bulletin board system. And that kind of pattern repeated itself over and over again throughout my open source career where I would see something and just kind of say "I could do this better." Or maybe not better, but a different way and the way that I wanted to. So that was probably back in 2003, something like that. And that's kind of what started my love affair with code.

Beyang: That's awesome. And from there, what was kind of the path into delving into the world of like regular expressions and command line search tools? I assume it wasn't a direct path must have been circuitous.

Andrew: No. Yeah, I don't know if there's any... I think the overall path, a very brief version of it is I did the bulletin board stuff for a few years. Then I got into X11 stuff because again, the same thing kind of repeats itself, where I had some window managers out there i3, awesome, Openbox, xmonad, I try kind of all those and they just didn't quite work the way I wanted to at the time. This was probably back around 2010 or so. And I wrote my own like little tiling manager that would sit on top of an existing window manager called [Pi Tile 00:04:32] tile. And that kind of like got my feet wet with X.

Andrew: I created a fork of Openbox that supported multiple monitors. That didn't quite work the way I wanted it to. So then I just started... That's also the same time I discovered Go a little bit before the 1.0 release. I'm like, "Hey, I could build a window manager in this language. It'd be much nicer than using C." I had tried Python and that failed. And yeah, so I kind of did that. And then eventually, I found Rust, several years later. And I don't really know what caused me to get into regexes other than the fact that I was just kind of looking at what libraries were needed. And people were like, "Hey, we need a regex engine written in Rust." And it should be like re2.

Andrew: And I'm like, "Hey, okay." So I started reading the Russ Cox articles on regex engines. And honestly, that's kind of what started my love affair with like lower level text primitives and searching and whatnot, was basically at that point in time. Before that, before Rust, and before, I have read Russ Cox's articles, I don't think I had any particular love affair with regexes, although I did do a freshman project where I implemented a very basic regex engine in Assembly, which was fun. That was several years before that, but I think it was mostly just a matter of serendipitous I was getting into Rust at that point, looking for things to do. And the regex engine idea just really kind of lept out at me as something that seemed really interesting.

Beyang: So, yeah, nice. I guess one thing just leads into another.

Andrew: Yeah. And I guess just to finish off the rest of your question there, I realized that didn't quite get to this command line search tool aspect of it. But ripgrep came a few years after the regex engine, and its original reason for being was just to benchmark the regex engine itself. In particular, I think the reason why I initially did it was because I wanted to see whether the regex engine and its overhead could be competitive with grep. So if you want to go implement a grep. I didn't necessarily want to, but if you wanted to would the regex engine be fast enough for that task. And so I kind of built up some of the internals, and I'm like, "Hey, this is pretty much as fast as grep in most cases."

Andrew: So I just kind of took that and the silver searcher idea and ran with it. So ripgrep is nothing revolutionary. It's the same pattern of "Hey, I can do a little bit better." Or my version of it, that kind of thing.

Beyang: Interesting, interesting. I want to dive into ripgrep. But real quick, since you brought up the topic of window management, and also specifically i3, so we had the creator of i3, Michael Stapelberg on the show a couple episodes back. And I just want to hear really quick what were kind of your complaints with existing window managers, and what were the features that you just had to have that made you implement your own?

Andrew: Yeah, we could probably do a whole podcast on that alone. But basically, what it came down to was I wanted to run three monitors, simultaneously, and in a way that worked well. And pretty much all, I don't want to say all, because maybe there's one I missed. But as far as I could tell at the time, all window managers basically implemented what's called the EWMH spec and the ICCCM spec. And basically those are conventions for how window managers and other windows other clients. Window Manager is itself a client, but it's kind of a special client. And the Xserver, how they communicate with each other.

Andrew: And it's like all these things about... It's how you're able to have a pager on a desktop that is implemented completely separate from the actual desktop Window Manager itself. Because there's all these different conventions that allow you communicate, okay, what's the current workspace, and what's the active window and all that sort of stuff. So those specs kind of, I won't go into a lot of detail. But basically, they hardcode the assumption that every single root window, there's exactly one workspace active for that root window at any point in time. And if you put aside the old school X screen model of multiple monitors, where you can move your mouse between but not your windows, the modern day version of that is Xinerama, which was replaced with the resize and rotate or the RandR extension.

Andrew: And those basically implement multiple monitors by stretching the root window across all the monitors. So that assumption of one workspace per route window holds in all of these window managers, because this EWMH spec basically requires it. So if you have that, you all of a sudden, if you switch your workspace, it switches the workspace across all the monitors. Not a big deal if you have one monitor, maybe you can kind of get by with two. But when you have three, it's just like, "Oh my god, I want to have one workspace on this monitor, one workspace on this monitor," and so on. And I want to be able to switch them independently. And I want to be able to move one workspace to another monitor, and I just don't want to have any restrictions.

Andrew: So my fork of Openbox added that functionality. And then eventually it just kind of became difficult to do some of the other things that I wanted. And that kind of bled into "Hey, it would be a lot of fun, or so I thought to build my window manager." And that was kind of the driving reason. Shortly thereafter, I believe Apple added that feature to MacOS, not that they stole it from me because nobody knows about wingo which is the name of the window manager. And I don't know if i3 ended up adding it. I know xmonad had it way before me. So yeah, that was kind of what led to that Window Manager being built.

Beyang: Interesting. Around what year was this?

Andrew: I think wingo, the first like, release that I could actually use was somewhere around 2012. Maybe 2011. Somewhere around there.

Beyang: Got it. Yeah, I wonder what the percentage of programmers out there is who are just kind of obsessed with like getting their window manager to behave the way they want it to? Because I've definitely gone down like rabbit holes, myself.

Andrew: Yeah. I think there are a lot of them. And I think, by and large, most people probably get by with existing window managers. But I am certainly not the first and I won't be the last to have written the window manager because something didn't quite work the way I wanted it to or I wanted to scratch an itch. But yeah, I've been using that window manager since it was usable in 2012, or whenever it was. I think probably my only other user is my wife, and she runs it on her laptop. And so far, so good.

Beyang: That's great. Cool. So ripgrep, we got a little bit into kind of the beginnings just now. But before we dive further, can you explain what it is? And why does it matter to someone who may not be as familiar with it?

Andrew: Yeah. So ripgrep is, at its simplest description, is a tool that will search a directory tree of files, mostly plain text files, for a regex, or literal word, and will report the results to you. The simplest model is, if you can think about it's just to explain the algorithm, which is for every file that you want to search, you iterate over the lines of that file, if the pattern you provide matches that line, then you print the line. Right, simple. And that kind of like simple algorithm is built on top of with all these extra features. And ripgrep is not the first to do that. GNU grep has tons of features if you go and look at it.

Andrew: The main difference between ripgrep and GNU grep is probably that its defaults are optimized for code search. And what that means is that it will do recursive search by default. So if you don't specify a file path, all you do is type RG pattern, it will just search the current directory. Whereas with grep you have to pass a -R flag. And the other big thing here is that it will respect your gitignore files, your .ignore files, your .rgignore files, it will skip binary files and it will skip hidden files. So all three of those things are what's called smart filtering, for some definition of smart.

Andrew: And basically what that does is it does a search, and it searches the files, you probably want to search in most cases. And it doesn't search the files that you probably don't want the search in most cases. And I think that's kind of why it matters. The defaults are from a user experience level, are why it matters. And I would say secondarily to that is performance. And in a large number of cases, ripgrep and GNU grep will perform similarly. But if you look at it from the user experience perspective, ripgrep will often perform much faster than GNU grep because of the fact that it will do parallelism by default. And that's probably the primary reason. If you just run ripgrep, over a repository, it's going to skip some files. So if it skips some several gigabyte files, that will be why it's faster for one.

Andrew: And obviously, second thing of being that it uses parallelism by default. And I think that matters. So the default mode of operation for ripgrep is that it has better results from what you want, it has less noise. And it tends to be faster. Not because of any innate algorithm difference, although once we get into more details, there are some algorithmic differences. But basically, just using parallelism is the main thing there.

Beyang: Yeah, as a ripgrep user, myself, it's my main command line search tool. I have to say that everything you just said is 100% true, it feels and is so much faster. I remember back when I used grep a lot, every kind of instance of wanting to use grep was kind of an exercise in trying to remember which command line flags were the right ones to pass. What directories did I need to exclude? And because of that I probably didn't grep as much as I wanted to, just because there was always this like, "Ah I got to deal with that again?"

Andrew: Yep, absolutely. That was my experience as well. I had, I don't know how many different wrapper scripts around grep that would only search Go files and that sort of thing. Or whatever, basically it would approximate some of the defaults of ripgrep and Silver Searcher and of course ack before that.

Beyang: Yeah. So you mentioned those other grep alternatives. Silver Searcher, ack. So these predate ripgrep?

Andrew: Yes.

Beyang: Did you use any of those before creating ripgrep?

Andrew: I did not actually.

Beyang: Interesting.

Andrew: I had known about Silver Searcher and was vaguely aware of ack at the time. And for whatever reason, I was just kind of like... It has kind of repeated a lot of things that you hear people who don't use ripgrep today, which was "Grep's good enough." And for the most part, that was true for a lot of things. And I think it was just, I never really gave the Silver Searcher a try. I think I had used it before. Certainly, when I was building ripgrep, I started using and trying Silver Searcher more to understand what the competition was like, and what the existing tools were like. But before that point, I don't think I had ever used it in anger. I just use my grep wrapper scripts, just because it just worked well enough. And that's kind of like the story of tools. It works well enough. You don't really have a reason to use something new.

Beyang: Yeah, it's like is the expected gain I get from using this additional tool worth the added complexity, really learning that new tool. And then, God forbid I switch machines someday and have to reinstall all these new tools.

Andrew: Yep. It's another abstraction. It's another layer. I don't know if I'd say abstraction, but it's definitely another layer of stuff. And it's not in core utils. It's not ubiquitous. A lot of people know about it, it's in a lot of repositories, but it's not going to be on every machine, like grep is.

Beyang: Yeah, makes sense. So you mentioned when you created ripgrep, you didn't really go into it wanting to make a grep competitor, an alternative to grep. You started with the regular expression engine, and then you wrote this thing as a way to kind of benchmark that engine and it turned out to be performant, competitive with all the other solutions?

Andrew: Yeah.

Beyang: I guess at what point did you say like, "Okay, I'm going to actually invest in making this into a really neat command line tool that I myself am going to use day to day."

Andrew: That's a good question. I don't know if there was a specific cut over point where I said that to myself. I mean, I guess there had to have been. But I think really, it was just a matter of, "Hey, this is competitive in a lot of cases." Not just in some of the cases that I had expected that it might be or maybe didn't even believe that it would be. Because at the time I kind of had the same impression about GNU grep's speed as everyone else. And I say that with, there's two well known articles that are gone around the internet for a long time. One of them was the Ridiculous Fish post about the treacherous optimization, and how he was looking at how to optimize the Boyer-Moore algorithm and that sort of thing.

Andrew: And then there was the famous posts from I believe, I don't how to pronounce his name, Mike Haertel, I believe the original author of GNU grep. And he posted to the BSD list around 2010, about why GNU grep was fast. And he listed all these different reasons, and it was like, "Oh, my God how can you ever beat GNU grep?" So I think once I actually had a tool, like a very basic tool that was useful for benchmarking and was somehow competitive with GNU grep, I was like, "Hey, I could turn this into a real tool."

Andrew: I think the thing I was benchmarking at the time was whether if you have a match on every single line, I was benchmarking that procedure. And that procedure, a lot of the optimizations don't matter too much, because you're having a match on every single line. So there's a ton of overhead. You might as well iterate over every line and try to run the regex and that sort of thing. And even a naive approach will perform similarly there. So the main thing I was testing there was just the overhead of the regex engine, is that competitive with GNU grep?

Andrew: And that turned out to be okay. I think that caused me to do some optimizations in the regex engine. But then when I tried other cases where the matches are less frequent, and that's where the performance tricks really start to show the main difference between a naive grep and one that's been optimized crazily. And I think once I tested that case, and when that became competitive that's when I said, "Wait a second, I could go build a tool that does better than Silver Searcher."

Beyang: Got it? What portion of the kind of end user perceived performance gains come from just being smarter about what to search over versus the performance of the underlying regex engine and algorithm itself? In your estimation?

Andrew: Yeah, that's a great question. I think the first thing to tackle there is the smart filtering. Because oftentimes, that can actually make the search go slower. Even though it skips files and causes the tool to search fewer things, the actual process of resolving all the globs and the gitignore files can cause the search itself, overall search time to be slower, because it takes time to build those gitignore files into glob matchers. And it takes time to match those.

Andrew: So if you have a deep directory, and every directory in that level has a different gitignore file, in order to implement it correctly, you have to match a file path against every single gitignore file up to the root of the repository. And then there's like global gitignores and all sorts of other things. And then there's the ripgrepignore files, and do those take time to process. But if you have some huge binary files, maybe they're build artifacts, executables, images, whatever. I hesitate to say GNU grep will search all those because it has some optimizations for not doing as much work in a binary file. But for the most part, ripgrep will be a bit smarter about skipping those. It will also skip your .git folder which can get quite large in big repositories, whereas GNU grep will not.

Andrew: You'll often see like can GNU grep wrapper scripts will have like a --exclude.git. And that on its own could speed up the search quite a bit. So you kind of have that level of thing. Obviously, if there are huge files that are being skipped that's one aspect of the performance improvement. I think the next thing that comes from that, particularly when you're doing recursive search is mainly just parallelism, because GNU grep does no parallelism.

Andrew: The only way to do parallelism within GNU grep is to use some other tool such as GNU parallel or xargs. And those implications are not too bad, but they're extra stuff that you have to do. I know that there have been efforts to parallelize GNU grep, or I believe there have been, and it's been difficult because GNU grep was probably written in a day where parallelism wasn't really a big... Wasn't really a predominant paradigm. And so they're just global mutable variables everywhere. Or at least last time I looked there was. So that's kind of the second thing that will make ripgrep faster in a large number of cases. And I think the third thing is probably the regex engine itself, and some of the literal optimizations. I'll describe one of them, because I think it's the most interesting one. And I'll do my best to make it brief.

Andrew: But basically, when good GNU grep implemented Boyer-Moore, if you've done substring search algorithms, implemented substring search algorithms Boyer-Moore the main thing about that algorithm that you're supposed to remember is that it will skip characters. Sometimes it will look at a character, and it will consult a mismatch table. And if that character doesn't appear in your search string somewhere, the algorithm will know to skip over X number of characters and move on to the next piece. And especially Mike Haertel's post to the BSD mailing lists kind of popularized this thing. "Hey, we don't actually look at every byte."

Andrew: But it turns out that on modern hardware this may not have been true back when... Probably wasn't true back when Mike Haertel had written that mailing list post. But nowadays on modern hardware, that's actually not the thing that makes Boyer-Moore fast. What makes Boyer-Moore fast is the skip loop. And the skip loop takes the last byte of the pattern and it feeds that to a routine called memchr. And memchr is provided by libc. And on most platforms, that operation which is take a single byte and look in some haystack of bytes is super optimized. So it will have implementations using like assembly implementations like glibc, does GNU libc. As assembly implications for SSE2 and for AVX. The arm platform will have assembly implementations for it. musl has its own optimized version in C, so it doesn't use assembly, but it uses C. And so on and so forth.

Andrew: So if you have a really rare byte that you feed to that memchr routine, it's going to stay in that memchr routine for the vast majority of the file that you're searching. Now, if you happen to have a very frequent buddy, that just so happens to be the last byte in your needle, then that memchr routine is going to run, stop, run, stop, run, stop. And you're just going to be constantly ping pong in and out of the metro routine, which has... It's just going to be slow, because you want to spend as much time as you can, inside that SIMD vectorized loop.

Andrew: And so basically, I don't know of anyone who has done this optimization before, but I had the insight that you just have a background frequency distribution, a heuristic, you just estimate it. It's not something that changes. You take some common text, some source code, and you say, "Hey, this byte is really rare. And this byte is not so rare." And instead of always picking the last byte, instead, you look through all the bytes in the needle, and you look at the one that is probably the rarest. And you feed that the memchr instead. And that's your skip loop.

Andrew: And that's where the vast majority of the performance improvement, I think, comes in the common cases. Let's just say in the common cases, from ripgrep. When you're searching just a single file, when compared to GNU grep. GNU grep could implement that optimization. I don't know if anyone's tried, but they could. So I think those are the three things for common cases, is skipping really huge files, sometimes, parallelism, and heuristic tweaks to existing substring algorithms based off of background frequency distribution.

Beyang: Got it. That optimization that you described, took advantage of, I think, very domain specific knowledge about the memchr routine. Were you familiar with that going into this? Or was this something you uncovered in kind of like looking through the critical regions of the code as you're doing performance optimization of ripgrep?

Andrew: That's a good question. I don't know actually, when I realized. I feel like knowledge of memchr's just kind of always been there in my brain. But I know Russ Cox's articles mentioned memchr, there's a brief call out to it. But otherwise, I mean, I've implemented memchr. So if you use the memchr crate in Rust, in the Rust ecosystem, then you'll use my crate. And there's a Rust implementation of it using AVX. So when you use ripgrep on Intel machines, you're using my recommendation of memchr, not glibc's. So the fact that I'd also implemented it kind of also gave me knowledge of it.

Beyang: Got it.

Andrew: But yeah, I think it's kind of just one of those performance tricks that kind of has its own general rule, which is, you have a fast path, and then you have the slow path. And a lot of times the performance optimization, assuming the fast path already exists, is really just about causing more inputs to use the fast path in more cases. And that's kind of the case here.

Beyang: Got it. So you've kind of drawn a contrast between ripgrep and grep. And explained the kind of performance differential between those two. But what about the more recent grep alternatives? From my experience ripgrep is faster than those, but all those were written more recently than grep. I think grep was made back in what 1970, something or other. So presumably they had access to similar knowledge and awareness of modern systems as you did, what do you think contributed to ripgrep's ability to outperform those more recent alternatives?

Andrew: Oh, boy, is a complex question. So, just I want to make a first clarification, grep was invented, I believe, back in the early 70s. And it was invented by Ken Thompson. And as far as I know, I don't know if there's any direct descendant of that specific program that he wrote that's still in use.

Beyang: It's different?

Andrew: Yeah, maybe a Plan 9 system maybe, or one of the BSDs might use a direct descendant of Ken Thompson original grep. But I actually don't know. But GNU grep, I believe, was written in the early 90s or late 80s. And that was by Mike Haertel. And I don't know if that was before vector instructions. I'm pretty sure vector instructions existed by then, but I don't think they existed on Intel platforms, as far as I know.

Andrew: But, yeah, so just getting to the meat of your question, which is grep alternatives. And why aren't they as fast? Well, so I'll just talk about two, ack and Silver Searcher. I think they're the most popular people know about. There are several others that I know about. But I'll just talk about those two. So ack is written in Perl, and it uses Perl's regex engine. As far as I know, ack has no C code in it other than Perl's implementation of the regex engine, which I'm sure is in C. I know is in C, I've read it before. So there's like that aspect of it.

Andrew: And as far as I know, ack's author has not prioritized performance. His main thing is the UX, the user experience of it, the skipping of the files, and the features that are in ack. So as far as it being slower, I don't think I've ever done any in depth performance research there. But waving my hands, I'd say, "Well, it's written in a higher level language. And even if the regex engine is in C, you still have all this other glue code that's in the higher level language, and it has to communicate back and forth between C." And I would say that's probably where the performance difference comes from. As far as like Perl's regex engine, I've never done any kind of performance comparison between it directly and my own. Just because it's hard to do, because, as far as I know, there are no C bindings to Perl's regex engine. So there's that.

Andrew: So Silver Searcher is... I have a big FAQ entry on it in my ripgrep repo. So for people that are more interested in the details, they can go to that. But I would say that the first thing is that memory maps on Linux, if you're using them on a bunch of tiny files in a repository, is actually worse than not using memory maps. Silver Searcher's ReadMe claims memory maps is the reason why it's fast. And as far as I can tell, memory maps only make a difference when you're searching huge files. And there's a small performance improvement. So that's the first thing.

Andrew: The second thing is that it's parallelism is only at the search level, whereas ripgrep's parallelism is both at the search level, and the directory traversal level. And that's important for a really critical reason. Sure, there's the performance gains that you get from doing directory traversal using parallelism on its own. They're small, but they're there. But the big one is the gitignore processing. Because the gitignore processing is built into the directory traversal. Because if you are ignoring a directory you don't want to descend down to that directory. So if you parallelize directory traversal, then you parallelize gitignore matching. And Silver Searcher does not do that. But ripgrep does, so that's another reason. I don't know how much of a reason it is, but it's one.

Andrew: And I'd say the next biggest reason, honestly, is some combination of the regex engine and literal optimizations. So Silver Searcher does like the most basic literal optimization that there is, which is, if the whole entire pattern is just a literal, then I think it defaults to its own Boyer-Moore implementation. Ripgrep's literal optimizations are much more sophisticated, on the level of GNU grep's. And basically, what it does is if you have a literal in the beginning of a regex, a literal at the end of a regex, or even some combination of literals in the middle of a regex ripgrep will detect those, and it will use a substring search implementation on those. And it will search for those and if it finds a line that matches, then it will run the regex engine on that line. And that's kind of the same reason why GNU grep is fast, because it does the same kind of optimization. So I'd say that combination of things is what makes it faster than the grep alternatives.

Beyang: Got it. You have such a good command of what the bottlenecks and like critical regions are in both your own code and the code of tools that do similar things. Do you have, like a process or standard toolkit that you apply to identify and build up this understanding?

Andrew: I think the short answer is just science. It's the scientific method. You form a guess about something based on your experience. And you go and test it, and you try to prove yourself wrong. And if you're right, then you should be able to find evidence that supports your guess. And eventually you do that enough you develop experience, you develop kind of like an internal model of what things are doing and how they behave and where the bottlenecks are. And that kind of is what gave me my knowledge, I guess, I would say. Reading other people's experiences, their blog posts, the information that they've shared, their code, comments in the code that says, "Hey, this is a performance bottleneck."

Andrew: Maybe it's not correct. But it's something you put in your brain as, "Okay, I can test that later. Or if there's something that I run into with my code, maybe that's something I can look." Because when something is slow, and you don't know why, which happens, still happens to me, and you don't know why. You have to use your tools, your profiling tools, Perf, Valgrind, looking at the generated assembly using your own experience to guess at where things might be slow or fast, even println profiling. Printing out timestamps at certain points, all those sorts of tools are available to you. And doing those things over and over again.

Andrew: And when you don't know where the performance bottleneck is, you kind of have to just narrow it down with those tools. And when you do narrow it down, you've added a data point to your experience. "Hey, I had this problem. And the bottleneck ended up being here." And if you do that enough, you get some experience, you get some intuition. But when it comes to performance you always want to check your assumptions. Because even when you have tons of experience doing it, you can still be wrong. And yeah, I think that's probably a good enough answer for if you have other questions.

Beyang: Yeah, no, I think my reaction to that is like with regards to science, the actual scientific method part is kind of the easy part. It's the hypothesis-

Andrew: Yeah that's true.

Beyang: ... proposal and formation part that is really the secret sauce.

Andrew: Yeah, that is a really good point. You're absolutely right. And it took a long time to bootstrap that. So how do you even... Like it's a whole bunch of unknown unknowns. Like things you don't know, that you don't even know. So how do you deal with that sort of situation? And asking for help, or taking learning the tools. If you're not sure where to start learning the tools as best as you can, and just trying to narrow down where the problem is. Where's the code spending most of the time? And if you're having trouble reading some assembly, because oftentimes, that's what it comes down to, you're using your profiling tool, and you're looking at the generated code. And I'm not sure how to read this, hop on a discussion forum or a chat channel somewhere, and I'm sure there's someone who'd be happy to help you decipher some assembly. I mean, there are people who love doing that. So, yeah.

Beyang: Cool. Ripgrep is now just insanely popular. I think everyone on the engineering team at Sourcegraph that I know uses it. And kind of from the... How old is the project now? Is it a couple years?

Andrew: Four years, actually in September.

Beyang: Okay. Was there a point at which it really started to take off? Or was it kind of the steady growth as it went along? Do you recall or...

Andrew: I think the only... Other thing like GitHub stars, which people have opinions about whether they're meaningful in any-

Beyang: It's a proxy. Yeah.

Andrew: It's a proxy of something. Yeah. So there's that. And as far as I can tell, that was just steady growth over time. The only other thing that I've really monitored... There's two other things I monitored. One of them was, I will occasionally look at the homebrew statistics. So they have kind of like that... I forget what... I know there was a controversy about it a while back about opt out, opt in. But they collect analytics on how many times the package is installed. So ripgrep is on that list. I think it passed Silver Searcher a while back, if I remember correctly, so that was one milestone.

Andrew: Other than that, the other the only other thing that I really am aware of is ripgrep's adoption into Visual Studio code. That happened... Actually, I can't quite remember exactly when that happened. But I want to say it's at least 2018 or earlier. But yeah, I distinctly remember that definitely created an uptick in awareness about ripgrep. Because it was in its release notes, and lots of people use VS code, that sort of thing. I know, people see it in their processes tab on Windows, I've seen that screenshot. "What is this RG.exe process using up all my CPU?" So I've kind of seen that.

Andrew: Yeah, otherwise, I would personally say steady. There was kind of like a big bang when it was initially released. And other than that, as far as I can tell, steady. But I don't have any kind of like monitoring tools where I can actually tell how popular it is.

Beyang: Got it. Now that you have so many people using it to what extent do you allow external feedback to drive your future plans and roadmap versus, how much do you still lean on your own intuition and your own kind of personal needs?

Andrew: Yeah, I've actually... I can think of off the top of my head two big things that I've reversed course on in the past which was multi-line search, and support for look around and back references. Which inevitably was added via adding a new regex engine, because I didn't want to add that to the Rust regex extension. So I added support for PCRE2. Both of those things, were things that I initially set out as "No, I don't want to do them." Because they make the implementation too complex. And if the implementation is too complex it becomes harder to maintain. And I'm trying to build a... I want the project to be sustainable, where I'm happy to maintain it, because it's not too much work.

Andrew: So that kind of decision process is primarily what drives my inclination for adding new features at this point. Is if it's going to turn it into a project, I don't want to maintain like, for instance, wingo. That window manager is basically, it's in maintenance mode. And I keep the lights on. I don't do any new feature development. And I don't really want ripgrep to get to that place, I want to continue to evolve and to continue to improve on performance. History says that is not never going to be the case, there's always going to be some new young tool that comes out and does something better or smarter. It exploits some new hardware feature better, it exploits some new interaction, maybe there's a new source control program, source control thing that we move on to from Git.

Andrew: And the way it does, I don't know, I'm just spitballing here, but the way it ignores files, there's some new way of traversing files more quickly that allows a search tool to be faster. So in terms of like, future direction, and like how other people influence it. I try to let, if there's a lot of people requesting a feature. Then I tried to let that guide things as long as we can find a sustainable way for my definition of sustainable basically, to implement it and if we can then I'm willing to accept new features.

Beyang: Cool. Sounds very reasonable. You mentioned that you added support for PCRE, new regular expression engine to ripgrep. I was wondering if you could talk about kind of the varieties and different approaches that you've seen to regular expression matching in the wild. Because from my point of view regular expressions, they're something I learned about in four year computer science degree. It was presented in a way that suggested that these things have been around for a while, and there's kind of a standard way to create a matcher. And then in the actual real world, I've come into contact with a variety of different regular expression engines. And have been surprised at like the kind of diversity of implementation styles and performance trade offs and things like that. So what are the different varieties of regular expression processing that you've seen, and why isn't as simple as like, "Hey this thing, translate it to a DFA, and run the DFA, and then you're done."

Andrew: Yeah, that's another question that we could spend a whole podcast on. But I think if we limit ourselves to general purpose, regex engines, where... It's like a regex library that you might use. What's the regex engine using PHP, what's the regex engine you use when you're writing Perl, and so on and so forth. If we limit ourselves to that kind of scope and we put aside like the specialty regex engines where you might have something like re2c, or... Oh, god, I'm blanking on the name, the popular one from... Wow, okay, blanking on the name, but it's the one where you're able to write and generate DFAs ahead of time. And you can execute certain actions for them. But if you put aside those things, and you basically limit yourself to general purpose regex engines, you kind of have two kinds of constraints put on you that I can think of off the top of my head. Which is compilation time has to be reasonable. And match time has to be fast, ideally fast.

Andrew: And if you're at that level, there's really two different ways to implement a regex engine. One of them is with backtracking. And that's kind of what PCRE, does what Perl does, what Ruby's regex engine does, JavaScript's regex engine does. And then there's the finite automaton approach, which is usually some combination of an NFA simulation, a non-deterministic finite automaton simulation, and a deterministic finite automaton simulation. I've never implemented... Well, I've done backtracking before. But I've never implemented like a PCRE backtracking regex engine. So I can't talk too much about them. But my general understanding of them is that they tend to break a regex down into one of many different types of opcodes. And the regex engine itself essentially executes like a VM.

Andrew: And the main feature of backtracking engines is that they have tons of features in them. So like back references, and look arounds, and PCRE has a whole bunch of stuff like recursion and call outs, and all sorts of crazy things that... Really fun. And the main drawback of all those things is that it's hard to implement those things in a way that's fast from a time complexity point of view. So it's very easy to create a regex that will take exponential time on the input. And basically, that causes a class of security vulnerabilities called regex denial services, service attacks, REDOS. And like this has happened in some high profile cases, I think Cloudflare happened to itself. And StackOverflow is another popular one that I'm aware of in the last few years.

Andrew: But that's kind of that side of regex engines. The other side of regex engines is the one that I know a lot better, which is the finite automaton approach to it. And I'll try to be really brief here. But basically if you implement the standard NFA simulation that you might see in a textbook you'll get a great time complexity bound, which is the length of the regex times the length of the input. And that's the optimal case. And the problem with that approach is that it's dog slow in practice. And this actually happens to be what Go's regex engine implements. And it's why there are there's an open issue on the Go issue tracker that says "Hey, make the regexes go faster."

Andrew: And the typical way to speed that up, again, being very hand wavy here. The typical way to speed that up is something called the lazy DFA or hybrid NFA-DFA. And, again, I'll be super brief here. But the primary difference between an NFA and a DFA is that an NFA can be in multiple states at once. Whereas a DFA can only be in one state at once. And the main implementation difference between those is that when you see a new byte with an NFA implementation, you have to go through every single state that you're in, and process, "Okay, this state goes to this state, and this state goes to this state." For every state that you're in for that single byte. And it takes a lot of work, especially with alternations or repetitions or whatever.

Andrew: But with a DFA, you have a byte, you have a single state, and you move from that byte and that state to the next state. And that can be implemented very efficiently with a finite number of instructions, sorry a constant number of instructions. And using a table in memory. And basically a DFA in that implementation is going to be about an order of magnitude faster than NFA. The problem with a DFA, of course, is that building a DFA can have a potentially exponential number of states in the size of the regex.

Andrew: And especially when you have Unicode support enabled, those cases are not rare, they're very common. And it creates a situation where compilation time is just too expensive. In a general purpose regex engine, where you need to be able to compile a regex, maybe in web request response time, and be able to run that regex. So the hybrid lazy DFA, the hybrid NFA-DFA or also called lazy DFA is, I believe, something that Ken Thompson created or came up with. And it's something that is used in various places, GNU grep, does it. Russ Cox kind of re-popularized it by describing it in his article series on regex engines. And that's how re2 is fast.

Andrew: And the fundamental reason why it's fast is because it combines NFA with DFA. It does not compile the full DFA ahead of time, it only compiles what it needs for the next byte of input. And then the crucial bit is that it caches the state that it created. So it sees the next byte. If that state had already been cached, then it will act like a normal DFS. And it just repeats that eventually, you get to a point where you've cached all the states that you're probably going to see, which is maybe a very small portion of full DFA. And that tends to work great in practice. Its failure mode is that if you have to regenerate the state on every byte, and there are definitely some regex's where that occurs, then you end up spending a lot of time recreating a DFA and ends up being a similar speed as the NFA. But thankfully, those cases are somewhat rare. So it ends up being a decent trade off in practice.

Beyang: So it's kind of a just in time compilation approach where you're compiling the regular expression... Or really not the regular expression, it's like the NFA into the DFA, like expanding the state space of the NFA into DFA states.

Andrew: Yeah, you're actually doing subset construction, which is the standard way that you translate an NFA into a DFA, you're actually doing that subset construction on the fly. You're just doing it one byte at a time. And I would definitely not use the word just in time or JIT to describe it, because that is actually a whole different implementation strategy in regex engines. Google's V8 JavaScript engine has a JIT to it. PCRE2 has a JIT. And both of those are impressively fast engineering. It's just amazing engineering efforts there.

Beyang: Got it. Got it. And those JITs are actually like compiling the actual regular expression into-

Andrew: Machine code.

Beyang: ... some sort of machine code, okay.

Andrew: Yep. Yep. At runtime they're producing assembly code that match the regex's. And both of those engines are backtracking.

Beyang: Got it. Fascinating. I hope we didn't lose too many people in that discussion. That was super interesting.

Andrew: Yeah. It was quite not as brief as I wanted looking at the clock. But it's hard to succinctly answer that question.

Beyang: Yeah, totally. I think you did a fantastic job. And now I have some additional blog posts that need to go read.

Andrew: Yeah. Yeah, sorry. Russ Cox's articles on regex. Just Google "Russ Cox regex." And they'll be the first thing that's there. If you haven't read those, or if anyone that's listening that hasn't read those, those are just pure excellence. There's just treats to read. And really just cannot recommend them more.

Beyang: Cool. We'll link to those in the show notes. So I wanted to get to some of your other kind of contributions in the Rust open source community. You mentioned that Rust, I think it's safe to say that Rust is probably your go to language at this point and you've contributed a lot of crates to the open source community. The regex crate, being one of them. Can you talk about why you got into Rust, what you like about it? And how you see that community evolving over time?

Andrew: Yeah, yeah, I would say it's fair to say that, at least in the coding that I do in my free time, mt open source work Rust is definitely the language that I use the most these days. And for work, I use Go. But as for Rust, what attracted me to it, I think at the time, I was big into Haskell, I was big into Go. And I had been coming off of a few years spent in grad school doing... Actually my concentration was in computational biology. But I had spent a lot of time studying programming languages, and being a teacher assistant for the programming languages course there under Normal Ramsey. And I just got a love of programming languages and type systems from my time there.

Andrew: And when Rust kind of came to the scene, I had kind of been watching it for a while. But for whatever reason, for some reason, somewhere around like late 2013, early 2014, it hit a point where I felt like, "Hey, I could learn this. There are docs there. I could start writing code in this." So I wrote a QuickCheck implementation, which I had known from Haskell days. And the reason why I wanted to write it was because I had seen all these similarities, and these are intentional similarities. But seen all the similarities between the trait system in Rust and Haskell's type classes. And if you've ever used QuickCheck in Haskell, it makes heavy use of if type classes and return value polymorphism, and all this cool kind of stuff.

Andrew: And I wanted to say, "Well, are Rust traits, are they expressive enough to do this kind of QuickCheck thing?" And I was like, "Okay." So I did it, and it worked. And I was like, "Hey, this is cool." And so I transferred... I had all of this transfer of knowledge to say, "Hey, I knew all this stuff from Haskell and it worked over here in Rust." And the thing with Rust is that it took my like of systemsy stuff, which I had gotten from C and Go, and my like, of pl programming language stuff that I got from Haskell, nice, tight system features. And it just kind of crashed those together. And I really, really loved and still love the result.

Beyang: C and Go versus Haskell. Very different, I think, language paradigms. And you're saying that Rust is a synthesis of those two? Can you talk about... Because everyone has like a stereotype of like your classic C developer versus your classic Haskell developer. They're very different. So you talk about why you seem to like both and what you think the two communities can learn from one another?

Andrew: Yeah. Oh, my goodness. So I would say that, like the whole C, Go meets Haskell thing was kind of like my impression at the time. I would say a better or more accurate comparison would be, let's say, C++ meets Haskell meets standard ML, maybe is what Rust is, I don't know. But in any case, the thing that I liked about C and Go or things that I do like about C and Go are that I really loved working at a lower level of abstraction. I liked performance, I like making things go fast. And I found it difficult to do that with Haskell. One of our research projects actually was implementing the Viterbi algorithm. Goodness, it's been a long time. But it's a way of dealing with hidden Markov models, and basically how to trace a path through a hidden Markov model to come up with a probability. And the Viterbi algorithm deals with...

Andrew: In order to speed it up you can do a memoization technique. And I remember us, one of our research experience reports was can you write the Viterbi algorithm implementation in Haskell that is as fast as C's. And we actually couldn't. We had an implementation of it in C, standard ML, and Haskell. And we tried to apply strictness annotations and all sorts of stuff like that in Haskell. And my experience there is really not unique as far as I can tell. If you scour the internet for blog posts about optimizing Haskell code, you kind of see that same problem over and over again.

Beyang: Yeah.

Andrew: And that was kind of my path there as far as how those two things met and why I liked it. In terms of what the communities could learn from each other. I don't know where to begin with that one. And don't say that in a sense that I feel like they have a lot to learn from each other. Like, I don't mean it that way. I think I just mean it as, it's such a big question. I don't know, really, is the answer for me on that one.

Beyang: That's fair. That's fair. You also wrote this blog post, a lot of our conversation thus far has been like, deeply, deeply technical. But I also wanted to get to some of your thoughts on kind of community building and evolution, seeing as you've been a part of the Rust community for a while now. And as that language grows, and gains broader adoption, there's more people entering into the community. And you wrote a blog post about how the community should or would or could evolve, as it grows. Care to share your thoughts on that?

Andrew: Yeah, so, I guess just a quick background, I've been a moderator for the Rust community, since the moderation team, since its inception. Which was around 1.0 release or a little before, somewhere around there. And basically, our charge is to uphold the code of conduct which is basically a way of keeping a healthy community. And I don't want to just try to distill the Code of Conduct down to a single thing, but a lot of it has to do with civility. And making sure that even if people don't like each other, or they don't like, how other people... What they do in their free time, to some reasonable extent everyone has to treat each other with respect.

Andrew: And it's not necessarily just about treating others with respect, but it's actually about moving forward. And being able to drive discussions forward with a mutual desire to improve the language, improve the ecosystem, improve the libraries, or whatever it is that's trying to be done. Which is a long way of saying be constructive. And doing that is hard, especially as the community grows. As far as my blog post goes, I called my FOSS story. I think my goal with that post was probably less about my specific moderation experience with Rust, and more about trying to improve the empathy that we have for each other. Or maybe help others do perspective taking. Because a lot of us played multiple roles. Sometimes I'm the end user of software, and sometimes I'm the maintainer of software. And those aren't the only two roles in open source. There's documentation, writing, event organizing, community organizing, all sorts of stuff, UI design. And being able to make sure you're taking the perspective of each person that you're talking to, is really my goal with that blog post. But I hope I succeeded with that, at least in some part.

Beyang: The hard part of software, I think, often comes at the intersection of deeply technical discussions, as well as human and social side of things.

Andrew: Yep.

Beyang: I think both are important to get right. Especially since there's a lot of smart people out there and want to source ideas from...

Andrew: Yeah, and especially when you have a vibrant and diverse community like Rust where... And in particular an online community. You don't have the same sorts of pressures or social expectations that you would in, let's say, an office environment, or even a remote office environment with a bunch of coworkers. Like, that's a professional environment, and there's a whole bunch of different pressures that influence your behavior there. But in the online world, where literally anyone can comment, there's just those sorts of procedures of decorum and good faith assumptions and working together and that sort of thing.

Andrew: Developing rapport with people that you work with day in and day out. I mean, obviously, that happens to an extent because there are contributors that come back and do things over and over again, and people who are in the teams interact with each other. But by and large, there's always new people coming into the community and it's very difficult to moderate those discussions. Because you don't have those base assumptions that you normally do in a professional environment.

Beyang: Yeah. Makes total sense. All right, this is going to be a complete non-sequitur. But I wanted to get to this because it's such an interesting story, I guess. So in addition to open source software and programming, another one of your interests is football or American football.

Andrew: Yeah.

Beyang: And when we're talking earlier, you told me this anecdote involving like a software engineer that you knew and Bill Belichick. And I was wondering if you could relate that story to our audience, because I thought it was just such a funny tale.

Andrew: Yeah. I guess there's not too much to it. but I actually have this story, secondhand, or third hand, through a friend at Tufts University. And basically, what the story is, is that Bill Belichick... This was some years ago, several years ago, he was looking for someone to help him... I don't know what he was looking for help with. But something to do with tech, something to do with computers or programming, whether it was analyzing injury statistics, or I don't know, who knows.

Beyang: Data processing, or...

Andrew: Some sort of data processing, looking at like tendencies of other teams. Just to get a tiny, slightest bit edge, over another team. Whether it was data scientists or mundane programming tasks, I have no idea. But from what I heard he what he wanted was someone who understood tech and understood football. So he basically looked at the intersection of those two fields in the New England area. And from the story that I heard, he found like a few people, three people. And one of the people he found was someone who played football at Tufts that also was majoring in computer science, and was someone that I happen to TA. And that's pretty much the story, was that. I don't know what ended up happening, or if he's still there, or if he did anything cool. But whatever he did, I'm sure I'll never going to find out because they probably keep that stuff very, very secretive.

Beyang: Belichick recruiting both the best athletes and the best programming minds out of the nation's leading universities. I feel like it just illustrates like, for anyone who claims that software isn't eating the world, even coaches in professional sports are hiring computer scientists to give them whatever edge they can grab over their competition.

Andrew: Absolutely, yeah. I myself have had a couple of NBA teams reach out to me to recruit me. And so... It's there.

Beyang: I wonder if us programmers will ever reach kind of the celebrity status or notoriety as some of the athletes do. Maybe in the future, maybe it's a pipe dream, but maybe in the future ESPN commentators will talk about like, "Oh the LA Lakers. They have such a good algorithmic basketball team."

Andrew: Yeah, I suspect not. But I think maybe there exists a reality where that happens. And maybe that reality is when like a bigger proportion of the population writes code. I don't know that will exist. I'm not even saying that I want it to exist, because I know all those things are controversial. But yeah, maybe that reality does exist in our future. And if there are a lot more people writing code, then all of a sudden that becomes something that a lot more people care about.

Beyang: Yeah.

Andrew: I don't know. But other than that, I doubt it.

Beyang: All right. On that note, my guest today has been Andrew Gallant. Andrew, thanks so much for coming on the show.

Andrew: It's been my pleasure. Thanks so much for having me.

Episode 13: Andrew Gallant (BurntSushi), creator of ripgrep

Show Notes

Transcript

Start using Sourcegraph on your own code