Building the foundation of code search, with Han-Wen Nienhuys, creator of open-source code search engine Zoekt

Han-Wen Nienhuys, Beyang Liu

How do Google developers create and popularize internal tools? In this episode, Han-Wen Nienhuys, creator of the open-source code search engine Zoekt, joins Beyang Liu, co-founder and CTO of Sourcegraph, to discuss the agonizing experience with Perforce that drove Han-Wen to build his first dev tool, explain the value of coding on trains and planes, and share the story of how building code search nearly inspired a street named after him in Sweden. Along the way, Han-Wen offers an inside look at the history behind some of Google’s most famous dev tools, such as Blaze, Code Search, and Piper.

Click the audio player below to listen!

Show Notes

Lilypond: http://lilypond.org/

Why Google Stores Billions of Lines of Code in a Single Repository: https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext

FUSE bindings for Go: https://github.com/hanwen/go-fuse

Bazel: https://bazel.build/

Gerrit: https://gerrit.googlesource.com/gerrit/

Tar: https://www.gnu.org/software/tar/

Zoekt: https://github.com/google/zoekt

Google Code Search: https://github.com/google/codesearch

GerritForge: https://www.gerritforge.com/

Transcript

Beyang Liu:

All right. Welcome back everyone to another edition of the Sourcegraph podcast. Today I'm here with Han-Wen Nienhuys, the creator of the open source code search engine, Zoekt, which is based on Google's original internal Code Search engine. He has worked on developer productivity and dev tools at Google for, I think, nearly 15 years now, where he has worked on or adjacent to many of Google's famed internal dev tools: tools like Piper, its large scale version control system, Bazel or Blaze, its large scale distributed build system, and the open source code review tool, Gerrit, which is based on Mondrian. I think that was Google's first internal code review tool.

Beyang Liu:

So, Han-Wen?

Han-Wen Nienhuys:

Gerrit is actually the third internal code review tool.

Beyang Liu:

The third, okay.

Han-Wen Nienhuys:

Yeah.

Beyang Liu:

Got it. So Google was way ahead of the curve on code review. But yeah, Han-Wen, thanks so much for being with us today.

Han-Wen Nienhuys:

Yeah, no worries. Excited to be here.

Beyang Liu:

Awesome. So, before we get into all the cool developer tools that you've built over the course of your career, I always like to kick things off by asking people how they got into programming and computers in the first place. So, what was your beginner story?

Han-Wen Nienhuys:

Yeah, it actually starts with my parents. So my parents met each other when they were doing a PhD in mathematics, also they're both mathematicians. And so when I was very small we had our first computer, the Apple II, and my mom started... So, my dad was teaching at university and my mom first taught at a local high school but then she also started teaching at university.

As part of that, she had to do courses on computer science and so when I was small, we had a computer around and we had these books lying around. When I was in high school, my mom was like, "Oh, you should learn Unix because it's really cool. You can learn how to write a compiler and that's really cool."

Beyang Liu:

Awesome.

Han-Wen Nienhuys:

My dad had bought this book on the C programming language, which he never got around to studying. I opened it and I started working from it and I thought it was really, really interesting. I actually decided to study mathematics because my dad would correct exams and he would always complain about the computer science students who weren't all that good at mathematics.

Beyang Liu:

Not so rigorous?

Han-Wen Nienhuys:

Yeah. I had this idea that computer science is for the not-so-talented students had better do mathematics. And so I studied mathematics, but during my studies I discovered, first of all, that computers are fun, but also that to be a mathematician, you really need to have a lot of intellectual discipline.

If you write a computer program with a bug then the computer program will crash, and in mathematics, it's all on paper, so if you forget to carry a minus sign then the paper won’t crash and the paper just continues. So, you have to be really rigorous in checking whatever you're doing.

It turns out that I have a lazy streak and so it seemed better for me to not make this into a career. And so I decided to do a PhD in computer science, so I moved cities and at the same time I was... So I play music, and I had a friend, and together we built this software for typesetting music. It's like LaTeX but for music.

Beyang Liu:

Cool. Yeah.

Han-Wen Nienhuys:

The software still exists today–it's called LilyPond.

Beyang Liu:

Awesome.

Han-Wen Nienhuys:

It's probably, of all the things I did in my life, it's the thing that I'm most proud of because it's been around for 20 years and it's widely recognized to be the standard of fine typesetting.

Beyang Liu:

That's awesome. I used a typesetting software a while back on Linux–I forget what it was called, maybe it was LilyPond, but I remember it being super fun and easy to... I'm not a serious musician at all but even just messing around.

Han-Wen Nienhuys:

Well, you would probably know because if it had the graphical user interface it was not LilyPond because LilyPond is a command line compiler.

Beyang Liu:

Interesting.

Han-Wen Nienhuys:

You type code and then you run this program and then you get a PDF file.

Beyang Liu:

That's awesome. Now I'm even more intrigued. Yeah, because I feel like the graphical... a lot of the time GUIs can get in the way, especially when you're trying to write stuff.

Han-Wen Nienhuys:

Yeah. So me and my friend were quite... We both had long hair, we were trying to not wear shoes and we were very into free software. And so we were also against graphical user interface. I became more mellow over the years. But yeah, so our philosophy was that it's better to have a text-based system.

Beyang Liu:

Awesome. Awesome. So you worked on LilyPond, this musical typesetting piece of software and then how did you go from working on that in open source and in the free software world to joining Google?

Han-Wen Nienhuys:

Yeah, so I had my first job and I wasn't very happy with it so I quit. I tried making money selling my knowledge of music typesetting to users and I discovered that selling things to musicians is not a very good business. I think I made like 6000 euros in one and a half years.

And so I was one year into this and I got a call from a recruiter from Google. He said "Do you want to interview for us?" and so I did the phone screen. They rejected me on my second phone screen but I was talking to these guys on the telephone and they gave me the sense that Google was really the place to be to do exciting new things.

So, this was 2006, and Google was still the underdog. Everything had bright colors and colored balls, and everything was original and rebellious. And so I drank the Kool Aid and I was really excited to join. I really wanted to join Google but they rejected me.

And so there was this user from Brazil that said, "We have a free software conference in Brazil, in the south of Brazil, and I can get you in if you want. I can talk to the organizers and we can have you come over and give a talk." I said, "Sure. Fine."

Beyang Liu:

That's awesome.

Han-Wen Nienhuys:

So, typically... So this is typical for Brazilian organizations. I heard nothing for a long time, and so the date that I was supposed to fly was coming in. It was supposed to be on a Sunday and the Friday before that Sunday I hadn't heard anything.

I was getting desperate and so the Monday after the Sunday that I was supposed to fly, I was upset and disappointed and so I went drinking. I'm drinking and I came back and I decided to check my email and I'm very glad I did because there was this email saying, "Oh, your ticket is ready. You should pick it up at the counter at the airport at five in the morning." This was like 11 in the evening and so I had to grab my bag and my parents to tell them that I'm leaving.

And so I went to visit Brazil, I went to visit this conference and at the conference the Google recruiters for the Brazil office were there and they had this deal: if you give us your resume we will give you a T-shirt.

At the time that sounded like an excellent deal so I gave him my resume and I got the T-shirt. The recruiter, who's still there in Brazil, by the way, looked at my resume and thought, “These guys in the previous screens dropped the ball. We should interview this guy again.”

Beyang Liu:

Yeah.

Han-Wen Nienhuys:

And so I did a second round of interviews, they went very well. I got hired into the Brazil office. Getting there was also a difficult process because you had to get a work permit, which was very bureaucratic. It probably doesn't have anything to do with developer tools so I'll skip that bit. But yeah, so that's how I got into Brazil.

Beyang Liu:

That's awesome. I guess there's an important lesson there which is, as someone who's gotten rejected a lot, especially on the first try at various companies, I think there's a lot of value in persistence and trying again.

Han-Wen Nienhuys:

Yeah. At the same time, today I am a manager at Google and so I see these candidates from the other side and it's often really hard because you hire someone to do in-depth work on a codebase that's 10 years old and we do projects that take, I don't know, weeks, months, quarters, but we interview people and it takes 45 minutes. And so the question you ask someone has to fit in 45 minutes and it's very hard to get a good signal on whether someone is a good fit for your team. I haven't solved this problem even with my experience of being rejected at the first interview.

Beyang Liu:

I feel like it's one of those perennially difficult problems in the industry–figuring out a good interview process that's also time-efficient for both parties involved. Cool. So you joined Google, and what did you work on initially there?

Han-Wen Nienhuys:

So, the Brazil office was a very small office and at the time there was this product called Orkut.

Beyang Liu:

Social network, right? Yeah.

Han-Wen Nienhuys:

You may remember it. It was the social network that was built by this Google engineer called Orkut Büyükkökten.

Beyang Liu:

Yeah.

Han-Wen Nienhuys:

And he decided to name the project after himself. It's not my first choice for a project name, but he started and struck gold. This thing became very popular, and for some reason or another, Brazilians really like social media products, and so they started flooding the product. It became a quintessentially Brazilian and Indian product for social media.

Beyang Liu:

Interesting.

Han-Wen Nienhuys:

And so part of the engineering for the product was also transferred to Brazil and there was a desire to make money, and so there was a group that was trying to figure out a way to serve ads on Orkut.

Beyang Liu:

Got it.

Han-Wen Nienhuys:

And so I worked on that, and we also did ads on MySpace. Google had a deal with MySpace to put ads on their pages, and so that was the same work we did together with this group.

Beyang Liu:

That's neat.

Han-Wen Nienhuys:

And so maybe this is a good segue into developer tools, because this ad-serving system–basically every webpage that anyone viewed on the internet would have an ad and most of those ads would come from Google. And so basically the system that we were working on, and when I say “we,” I don't mean just the Brazilian office. There was like a larger group with, probably maybe a thousand people. They were all working on this gigantic ad serving infrastructure, and so it had to be really efficient. It was really complicated.

And so it was all written in C++ and when products grow very quickly, you run to keep up with the traffic and with the new features. It was a thing that had very rapid growth, and there was this serving system that had to talk to many different backends. That was the thing that if you compile it, it would take... I don't remember the precise numbers but I remember that at some point there was this unit test and compiling the unit test took one minute.

And then if you ran it, because it's a unit test it's dynamically linked so you think “oh, dynamic linking is great” but it means that you do part of the linking at runtime and so if you want to run the test then the dynamic linker tries to put these multi-gigabyte blobs together to create a binary and so it took a long time.

Han-Wen Nienhuys:

And so working on this system was actually quite frustrating and had a lot of pauses, a lot of places where you would press a button and then it would do something and you'd basically have to do nothing for five minutes. I found it very frustrating. And something else is that the Brazil office had very poor network connectivity and we were using Perforce. If you've never used Perforce, if you want to edit the file in Perforce... the idea of Perforce is...

Beyang Liu:

Perforce is a version control system, right, that predates Git?

Han-Wen Nienhuys:

Right. Yeah, yeah.

Yeah, and they used to make a promotion saying Perforce is really fast because it does all the operations server side. And that doesn't seem like a very scalable approach–particularly, it means that if you want to edit a file you can't just open the file and start typing, you have to tell the central server that I'm going to edit this file now and so you have to update your metadata.

And so you'd have to do p4 edit–that's the command. And then we were in Brazil and the server was often overloaded and that could take one second, it could take 10 seconds and I found it extremely agonizing to use. And I had just, for LilyPond, converted the LilyPond source code from CVS to Git.

So I knew Git, I knew the data model, I understood how to work with it. I thought to myself, “Well, I can do something better here.” I am going to build a wrapper around our Perforce installation, and then I can code against this local import into Git of the source code, and then once I'm done, I can press a button and it will make a Perforce change out of this local edit that I made. Then, it can do this in the background so I can work on more interesting things.

I was a big fan of Python at the time, and it was another reason to do this because if you work in Python, you have this direct feedback loop. You change something in the source code, press a button and there's no compilation–you immediately get to see the result. And so for me it was also a way to escape from this morass of having to work with this gigantic C++ codebase.

Beyang Liu:

Got it.

Han-Wen Nienhuys:

And so I built it for myself at first, and then I showed it to a couple of colleagues and they said, "That's nice." And before I knew it there were like hundreds of users, and they would have all kinds of ideas on how to make it better and so I would improve it.

So, Perforce has this central server, which is not a great scalability idea, but if you do all the logic in the server you actually have very firm control of the data and so if you move this data to the client, then the server can be very lightweight or not exist at all. Data storage is in the client, and for version control it's really important that data is not corrupted.

I discovered many ways in which to corrupt data and all the ways in which this goes wrong and so it was also an interesting lesson in how to approach designing robust software.

Beyang Liu:

Interesting. So this was essentially like a client-side Git proxy to a Perforce server. As an end user, you'd only use Git operations–it looked just like a Git repository.

Han-Wen Nienhuys:

Correct. Yes.

Beyang Liu:

That's cool.

Han-Wen Nienhuys:

Yeah, and so it became more popular. I think at the height of its popularity I had 5000 users across the company and so this was... I remember measuring the market share of my tool as the score of how well I was doing. My objective was for it to be 100% but I think I never got past 25%. But at some point, 25% of engineering at Google used this Git wrapper around Perforce.

Beyang Liu:

That's awesome. I would think that a lot of people would be inclined to use it because Perforce is this... I think this was back in what 20...

Han-Wen Nienhuys:

2009, 2010, 2011.

Beyang Liu:

Okay. It was like, Git was not as widely adopted as it was then but I feel like even then it had a certain cachet about it. People would have preferred to use Git over Perforce.

Han-Wen Nienhuys:

I remember being in my Noogler introduction and it was the time that Linus Torvalds came to give a talk at Google about his new version control system, and characteristically for Linus, he was very adamant about the qualities of his version control system versus other version control systems, and he upset quite a lot... there were people that really liked Perforce, or maybe were the Perforce admins, and they were quite upset having their system be described as crap.

Beyang Liu:

What ended up happening with this Git proxy? I mean, 25% adoption is huge.

Han-Wen Nienhuys:

I basically didn't want to use Perforce and so I made a tool that kept me away from Perforce. But if you build that tool you actually have to programmatically talk to Perforce and understand precisely what operations should you issue, what do they mean, how can you make the operations faster. While I was trying to not have to use Perforce I actually... because I had written this tool I had learned in quite a lot of detail how Perforce actually worked and so that made me the perfect hire for the Piper team.

Beyang Liu:

Got it. Describe Piper for those who might not be familiar with it.

Han-Wen Nienhuys:

Yeah. So if you want to know the details, there's a paper that Google published about it but in short, it's a version control system that, to the user, looks quite like Perforce, but if you try to look under the hood it's actually completely reimplemented from scratch, and it's reimplemented from scratch using Google infrastructure.

Han-Wen Nienhuys:

So it runs using a multi-data-center deployment. It has distributed storage using Paxos that makes sure that if one data center is involved in a nuclear war or is on fire then work just continues because there's five or nine or I don't know how many there are, but there's a lot of data centers to make sure that if one data center goes out then we can still continue to work.

Beyang Liu:

Got it. So, essentially, it exposes a Perforce interface to the user. So, from the user's point of view, from Google's point of view, people didn't really have to switch from their Perforce habits but underneath the hood, a lot of it is basically completely different in order to scale to Google's massive monorepo.

Han-Wen Nienhuys:

Right. Part of it is people's workflow habits but I think very important is also automation. If you're migrating people, and it's a lesson that I keep learning over and over again, then migration will be hard. You can teach people to do new things but if there's automation talking to these systems then you can only make a smooth switch if you have the old system and the new system responds exactly the same to an API–otherwise, it's almost impossible to make a migration.

Beyang Liu:

Makes sense. I think that's an important lesson that every, I won't say every, but probably a lot of engineers discover over time–migrations are super difficult. So the Piper team, essentially hired, it sounds like, their closest competitor, and that was the end of it.

Han-Wen Nienhuys:

You could see it that way. So the Piper team was in Munich and I relocated to Munich. I'm still in Munich–if I look outside I see the beautiful blue German sky.

Beyang Liu:

Awesome.

Han-Wen Nienhuys:

And so I worked on the Piper team for a while. So the Git wrapper was a 20% project–a classic 20% project. What is quite special is for you to create a 20% project and then leave it with another team, but I managed to do that. At some point I decided to work on another developer tool, so I worked on open sourcing our build system. And when I switched teams, I left this Git wrapper with the Piper team.

Beyang Liu:

Got it.

Han-Wen Nienhuys:

And they deleted it. So they built a replacement based on Mercurial–similar to Facebook, I think.

Beyang Liu:

Yeah, yeah.

Han-Wen Nienhuys:

And so that is now the de facto way to do distributed workflows at Google.

Beyang Liu:

So most Google engineers today see a Mercurial-like interface?

Han-Wen Nienhuys:

I'm not keeping track of how many people there are but if you want to do... So one of the things that is hard with Perforce is that, let's say you want to implement a feature but the feature needs a bug fix.

Beyang Liu:

Mm-hmm.

Han-Wen Nienhuys:

And so what you would do in Git is that first you make a commit that fixes the bug and then you do a commit on top of that that does the feature, and then maybe there's review and then, depending on what workflow you do, you can actually evolve both the bug fix and the feature independently.

With Perforce that's really difficult to do. You can kind of do it if you make sure that the fix, the bug fix, and the feature don't touch the same files but if it touches the same files then you're basically screwed.

Han-Wen Nienhuys:

And so that is the workflow that my Git wrapper made possible and that's also the workflow that this next generation of that, which is based on Mercurial, made possible.

Beyang Liu:

Got it. I'm just curious because I noticed in open source you've done some work on FUSEs, Filesystem in USErspace. Is that how you initially got into FUSEs, this work on Piper or did that come later?

Han-Wen Nienhuys:

No, it actually came earlier. Part of the reason why this Git wrapper was working was that Google had a file system, a virtual file system that gave you all the files in the repository available locally, and of course they're not really available because they're only once you actually go into the directory and try to read the file then the file system would go and fetch the file.

It was a read-only file system but what is even more nice than having a read-only file system is having a read/write-able view of the entire tree of the, I don't know, one hundred million files, and being able to version control that with Git. And so I built this cute little hack–we could use my wrapper to explain this–I basically made a union file system of this read-only view and a read/write view where you could run the Git version control system.

And so that was what I wanted to build and this was just around the time that Go came out, and so I thought that was a good opportunity to learn the Go programming language, and a FUSE file system sounded cool and interesting. I didn't really know what they were and so I thought maybe I should try to write this FUSE filesystem thing in Go.

And so I ended up learning how file systems work and how to write Go all at the same time. I used my FUSE library as a vehicle, which is probably not a good idea, because this FUSE library is still around.

Yeah, I’ll just say this, many of the early mistakes that I made 10 years ago are still there in the source code and they're hard to repair because, as I alluded to earlier, backwards compatibility is a great, great, fantastic thing to have but if you want to fix fundamental problems then often it means you have to throw out backward compatibility.

Yeah. So I learned Go, I learned how file systems work. I made a pretty neat FUSE library for Go that you can check out if you are interested in that.

Beyang Liu:

Yeah.

Han-Wen Nienhuys:

And it was all part of my greater plot–I said 25% of people are using my tool at Google–to push up this number from 25 to 30, 35. I'm sad to say that it was much less, much less successful than my Git wrapper.

Beyang Liu:

Yeah. That's awesome. I love the entrepreneurial spirit that you had with dev tools inside Google. Tell us about joining the Bazel team, or was it the Blaze team?

Han-Wen Nienhuys:

The Bazel team didn't exist as such because... But there was this idea that we could maybe open source the build system. I was involved with pitching that idea and then I said, "well, fine." Saying we should do this, I should put my money where my mouth is and I should join the team, and...

Beyang Liu:

And this was, just for those who may not know what Bazel is or Blaze is. Blaze was Google's large distributed build system, internally, right, or?

Han-Wen Nienhuys:

Almost.

Beyang Liu:

Okay.

Han-Wen Nienhuys:

It's a build system, but it's actually not distributed. And also in Google, it's written in Java and it's actually a single process and there's the actual compilation. So invoking the compiler is something that we do on a remote cluster so that part is distributed, but Blaze is the part that figures out which files are out of date and what commands to run to compile new versions of that. So it builds the whole dependency graph of the project in memory, and then it checks which files were last modified and then issues those commands.

Beyang Liu:

And that's crucial for performance reasons, right, so you don't end up rebuilding the entire source tree, most of which is unchanged?

Han-Wen Nienhuys:

Yes. Well, it's both... The challenge is that you want it to be correct, and it's easy to be correct if you just delete everything and build from scratch again. But people also want it to be fast and so you have to make this–often you have to make this trade off between fast and correct. And so the slogan of Blaze or Bazel today is that it's both correct and fast.

Beyang Liu:

Awesome.

Han-Wen Nienhuys:

That's also a valuable lesson because up to then I was using the build system as a consumer, and as a consumer inside of Google it seems very polished. Everything is using the same infrastructure and all of it magically works, but then I actually joined the team and I discovered... So I had to learn Java and it's really frustrating if you've had, I don't know, 10, 15, 20 years of programming experience and you're in this new language, and you have a brain the size of a planet but I don't know how to open a file–that's really frustrating.

The other part is that as a consumer it looked very slick or very polished, but like all software it was built... you always try to be scrappy, you don't want to do more work than necessary and inevitably you end up with, especially if the project is older you end up with source code that is under-documented or there's tests that are commented out or there's basically technical debt.

And so I joined the team and I learned that the thing that looked shiny from the outside, wasn't so shiny on the inside and it was my job to clean it up and it was a lot of very, very hard work.

Beyang Liu:

And I imagine it was, I mean, there's the tech debt part of it that you want to get the house in a bit of good order before open sourcing. I also suspect that a lot of it, like how much it was tied to the unique way that Google does development, was because you're taking this Google internal tool and essentially trying to bring it into open source. Was that a big problem you had to solve too, how to generalize certain things?

Han-Wen Nienhuys:

Yeah. So I want to add the disclaimer that I was part of the people that proposed this but I wasn't actually the technical expert and so when I went in joining the team I was the newbie and people that actually were thinking about these problems, I wasn't one of them but I can try to answer your question anyway.

So there is a part of the infrastructure, like I talked about this remote build infrastructure, and that was Google-specific, and wasn't going to be part of the first version and so that had to be cut out. That was a lot of very hard work by one of my colleagues, but in a sense it's straightforward because you know where you want to go.

Something that was much less straightforward is that inside Google we check in all the dependencies, so we check in the compiler, we check in the JDK. If you run tar, like the archiving utility, we actually also check that in. And so we do that to make sure that everything is reproducible.

But when you download Bazel as someone external to Google, you don't first want to download a fresh copy of the compiler and a fresh copy of the JDK, because you’d start to use a program and you'd be downloading gigabytes of data. And so, something had to be built to import external dependencies and that was an example of a problem that we didn't know in advance what, exactly, it should look like. And I think it went through a couple of iterations of refinement after I left the team as well.

Beyang Liu:

Got it, got it. So after the Bazel team, did you go on to join the Gerrit team, or?

Han-Wen Nienhuys:

Yeah.

Beyang Liu:

And when did you build the Zoekt? Was that while you were on the Gerrit team, or before?

Han-Wen Nienhuys:

Yeah, so untangling the build system from its Google internals was a lot of work, and at some point I burned out on it and I decided, okay, I should do something different. So because I had been working on this Git tool, this Git wrapper, I knew the guy inside Google that was the Git guy.

So his name is Shawn Pearce. He probably wrote like 1/3 of the upstream Git project and then made a copy and made a version in Java of the same–JGit, basically. It's the Java version of Git. Then he was hired into Google to provide the Android team with version control and code review.

And so he had built this open source project called Gerrit, for doing code review on top of his library in Java that is doing Git and ported that to run on the Google production stack, which is complicated because in Google everything is distributed. None of your computers have a file system because the files are all remote.

And so he pulled that off and he had been doing that for, I don't know, six or seven years, and all the time that I was doing that he had been doing that one thing. And that really inspired me because he really seemed to very clearly know where he was going and so I decided to work for him.

And so his vision was to provide tooling, open source tooling, basically for everyone but starting for the people that work on Android or that consume Android. And one of the things he was hearing from people that use Gerrit was that it didn't have code search. And so one of the things he asked me to do is, well, why don't you see if you can build some code search thing. So I work in Munich and the code search...

Beyang Liu:

I guess, real quick, just to set the stage for folks who are not as familiar with Google's internal dev environment. Code Search at Google's is a longtime tool, right, it covers the vast majority of Google's code base but Android is a separate code base, right, from the rest of Google?

Han-Wen Nienhuys:

We talked about Piper before, we call that Google 3 because there used to be 2. There was a Google 2 and then there was this big reorganization that wasn't backward compatible and it became Google 3. And that was a system that is so scalable that is basically where all the code that runs is on the production platform on the servers that's in Piper.

But there's a lot of other projects that need to interface with the external world. Android has lots of partners and so Android partners have to download the Android source code and of course they can't use Piper for that because Piper's an internal product.

The Chrome web browser is something that people outside of Google also want to develop on and so they need to download the source code and they need to do code review and so they need external tools for that and so they use Git.

And to come back to Code Search, so there's Code Search for Google 3, which is a standardized platform, and so you can do things like not just index the source code but you can, because it's the standard platform, you can also run a compiler and then the compiler tells you where the symbols are and you can do cross-referencing.

Beyang Liu:

Yeah.

Han-Wen Nienhuys:

Anyway, Shawn wanted to have Code Search for people that were using Gerrit and the Code Search team at the time was in Munich and he said, "Well why don't you go chat with these folks and see what you can come up with" and so I did.

One of the people that were involved explained to me how to build code search, what algorithms you do, and then he also said to me that that's a lot of work to reimplement from scratch. Maybe we should just open source our internal Code Search, which is written in C++, and it will be much faster than if you write it in Go, as it turns out.

If you say to me, "Han-Wen, this is way too difficult. You're not going to be able to do that." That is waving a red flag to a bull and that's why I thought, I'm going to prove you wrong. I'm going to prove that this is actually really easy.

So I implemented the first version of this search thing, and if you go back in the history of Zoekt, you can still find it. And it's actually fairly simple, so the basic idea is very simple and I got it to work in, I don't know, it took me one or two weeks.

But then you see it working and it's super fast, and it's an interesting problem, and addictive to work on and so I immediately saw all these ways in which I could do it–make it more fancy, faster.

And so I've been working on it. It started really small and then I thought, well, I should do regular expressions and then I should do indexing Git repositories and then maybe I should do multiple branches, and then...

And so every time, and I was doing it on the side and it was a fun project, and so I was doing my normal job but in spare moments where my mind could wander off, I was always thinking about what the next step was for making my code search thing even more amazing.

Beyang Liu:

Yeah.

Han-Wen Nienhuys:

So I got it to a point that it was really working quite well. I got my friends from the Bazel team to host me an instance. There's cs.bazel.build and if you go there, there's an instance running Zoekt, and I use that fairly regularly to search code.

But at some point I thought, okay I proved my point now. I proved this colleague of mine that said, "Well, it's a lot of work", and he was actually right–it is a lot of work. But it's also a lot of interesting work.

Especially if you work at a company that makes most of its money by doing ads on search, it's actually quite interesting to know how search works and so I felt that was useful background information.

Beyang Liu:

Yeah, awesome. So, aside from Bazel and your personal use, did anyone else at Google or externally pick it up, do you know who else is using it today?

Han-Wen Nienhuys:

So I occasionally get bug reports from people.

The Gerrit community has a couple of people and there's this company, a Swedish company, making security cameras that were also really anxious to get Code Search, and they told me they would name a street on their campus in honor of me if I would make it work for them. And I think they deployed it in the end, but I never checked if they really made Nienhuys gata or however you say “street” in Swedish.

Beyang Liu:

Awesome. I should mention that Sourcegraph uses Zoekt as a backend–it's probably the primary search backend for index search. So, every Sourcegraph instance has a Zoekt instance inside of it and we call into the API. It is, as you say, super fast and incredibly performant to use. So, nice work. It's quite impressive that you built all that in a matter of weeks. I'm sure it was a much longer maintenance and improving over time.

Han-Wen Nienhuys:

The first part is that you just search for strings very quickly and that, I don't know, that takes a day or so but then it's also interesting because you never stopped to think about that, but then you have to not just build the data structure but you have to put them on disk and then you put them on disk but then maybe you add features and then the data on disk goes out of date so you have to do versioning. All these problems that you don't stop to think about, they become actual problems that you need to solve but it's been interesting, yeah.

Beyang Liu:

Yeah. Very cool.

Han-Wen Nienhuys:

And my compliments to the folks at Sourcegraph. I heard first of Sourcegraph using this at one of the Git conferences where someone got a binary from Sourcegraph, and ran a disassembler on it and was surprised to see all these symbols and he told me about this. I was a little bit annoyed at first but then you open sourced all of it and now I'm really happy and really proud. It feels good to be part of a very successful business even if I'm not really working for Sourcegraph.

Beyang Liu:

Yeah, well, I mean, we are super appreciative for all the work you've done in this space. Zoekt is an amazing tool, it's so performant.

I think a big inspiration for Sourcegraph actually was Google's internal Code Search, at least for me, because I was a lowly intern at Google once upon a time.

One of the things that stuck with me from that experience was just being able to search over all the code that was in that codebase, potentially relevant to me from a single portal.

We always just wanted to bring that experience to every single developer, whatever codebase they're working on, wherever they are in the world and so thank you for helping us get there.

Han-Wen Nienhuys:

Yeah, my pleasure.

Beyang Liu:

Yeah, I guess, Google these days now has a public version of Code Search again. I mean, Google has always had seemingly multiple instances or versions of Code Search.

There's the internal version which has gone through various iterations. There was the public Google Code Search, which was around, I want to say circa 2010 through maybe 2014, and then it was sunsetted and now Code Search is available again.

Are all these code search implementations using the same backend or are they all just different tools and projects, inspired by the same rough concept but distinct and different?

Han-Wen Nienhuys:

That's a good question. So the backend has gone through multiple iterations and the frontend has gone through multiple iterations. Corporate priorities change and so that has its impact on how a product gets turned up, gets turned down. So I'm not the right person to talk to these kinds of organizational decisions, but the infrastructure has gone through several iterations.

I think the people that work on it, not sure if they’re from the original team, there are still many people there but it's been... so the original version was made by Russ Cox, I think, as an intern project. It's very rare for an intern to do a project that impressive but if you're Jeff Dean's intern then I guess that's part of the expectation.

And I think that the Code Search for open source, the original one which was launched maybe in 2009 or something became the internal Code Search. And it turned out that at some point the internal Code Search was more popular than the external one.

Nowadays everyone puts their source code, almost everyone, puts their source code on GitHub and then there's this tiny sliver of people that put it on Gitlab, and then there's these corporate weirdos like us that host our own infrastructures.

But just having all the source code on GitHub gives you a central place to get all the source code to index and you get signals about, people put stars on it so you know what is popular and if you know what is popular you can surface more relevant search results. And I think the first version of code search was grabbing random tarballs off the internet, so you've got a lot of duplicate results. And so I think that was a limitation that hampered its functionality and that made it not be very popular.

Beyang Liu:

Got it. Do you know if... because Google now has code search available for its own open source, as well as through GCP, is that based on the same internal code search or is that like...?

Han-Wen Nienhuys:

Yeah, it's the same infrastructure.

Beyang Liu:

Okay. Got it. I guess another thing that you work on, actually before we get to Gerrit I want to go back to the origins of Zoekt, because when we're talking earlier you told this really cool story that I think you're about to tell but then I interrupted you about how you built it on a train, or like a train from...

Han-Wen Nienhuys:

So my wife was doing a PhD in a different city, and when I met my wife, or then-girlfriend, you meet each other and you go out on the weekend and then during the week you work and you meet other people. And then we went to Germany and we got married and then she had to do this PhD in a different city. And when we started that I thought well, I've done this before. We see each other on the weekends and then during the week we do our own thing, it's going to be fine.

But somehow over the years something has changed, and so during the week I really, really missed her, and so I thought I should visit her in this other city. I live in Munich and she was in Erlangen, and this was a little bit far to do a daily commute. But if you go in the evening, take the train, and then next morning you go back, take the train back to Munich, then it's fine.

Every week on Tuesday evening, I would go there, have dinner together, and then the next day I would take the train back to Munich. And so I had this weekly commute, which took one and a half hours. Well, you can't work on Google internal stuff on the train because you don't know who's behind but you can work on open source.

During the week I would have all these ideas on how to improve my code search thing and then the idea of how to do it was brewing in my head, but then come Tuesday I had the idea ready in my head and I only had to type it out. And then it was always–you tried to finish it before the train reached your destination.

But if you look at the history of commits you can see that probably a lot of them were written either on Tuesday evenings or on Wednesday mornings.

Beyang Liu:

That's incredible. In some ways, coding on a train seems like a tougher environment because you don't have your full desk setup and you're on this moving vehicle. But there is something about it, I think you were touching upon this a bit, that might actually be conducive to making quick forward progress.

If you're thinking about this stuff in the back of your head for the whole week, and you're just so eager to start coding on it and then there's this time pressure element to it as well. Do you think that contributed to the progress you're able to make early on in the project or am I just completely wrong here?

Han-Wen Nienhuys:

I think the value of coding on a train–and planes work really well, as well–is that there's just nothing else, there's no internet... Well, so there's internet but it's very spotty.

Usually there's all these other people that you're sitting close to each other but then you pretend you're not looking at each other, right, sitting next to each other and you're not making conversation.

And so there's this isolation that's going on, and there's no way to distract yourself. There's no other thing to do. I think that is the value of coding on trains and planes.

Beyang Liu:

Yeah. Hearing you describe that makes me think of the times that I've coded on planes and trains as well and there is something different about that experience that might be worthwhile trying to replicate. But I don't know how you would do it–maybe it's just something inherent to being in that environment with all these other people surrounding you that you're purposely not talking to and it's moving somewhere, like the world's going by, I don't know.

Han-Wen Nienhuys:

Yeah, well I had very good luck with planes as well. So the train ride was only one and a half hours and you'd have to switch stations halfway, so I'd be on the train platform and then you have to open and close your laptop all the time so the plane is probably a better place.

Beyang Liu:

Awesome. Well, we are coming up on the end of the hour here but before I let you go, I wanted to hear a bit about your work on Gerrit today because that's what you're currently working on, right?

Han-Wen Nienhuys:

Yeah. So working is a big word because I'm what they call “overhead.” I'm the manager of the team. The people that do the actual coding report to me and I try to be nice to them and help them wherever possible.

Yeah, what should I say about Gerrit? Credit to my team for making Gerrit an awesome product.

Beyang Liu:

I guess, one thing that might be of interest to the audience because these days I think the vast majority of people are familiar with the GitHub pull request model of code review, and the Gerrit model predates GitHub and pull requests and the model is quite different. Most of the folks that I’ve talked to who have used Gerrit and also GitHub PRs much prefer the Gerrit model of doing code review.

Han-Wen Nienhuys:

It's interesting that you should say that because I was looking at our internal survey of our code review tools. And the thing that... So there's a freeform text field and we get a lot of feedback saying, "Oh, why isn't it like GitHub?"

Typically, people that are happy with the tool they're just going to write nothing and when they gripe then they're going to complain and take the effort to write something down.

Beyang Liu:

Got it.

Han-Wen Nienhuys:

Yeah, I agree, I think the model is better. So a part of me thinks that the folks, the Gerrit folks and I, my colleagues and I should not disparage them. Part of me thinks that Gerrit is really a better review experience and so that... how shall I phrase this without being disparaging?

Beyang Liu:

Maybe talk about the key differences between the two models.

Han-Wen Nienhuys:

Okay. So the idea is that when you write code and you review it, it means that you're not actually producing the final version of the code. If you're always producing the final version of the code then there's no need to review it, right?

The review is always there to point out things that should be different. And so, when you write code there's always going to be a second iteration–the same idea but with a slightly different expression.

Beyang Liu:

Yeah.

Han-Wen Nienhuys:

And so the idea of Gerrit is that you perfect this expression of what you want to do and then it's submitted, and then the result is that the history will have the perfect version of the source code. People don't need to know afterwards that you made a typo in version 22 of this change and so you shouldn't expose that.

And so Gerrit is based on the idea that you make your commits, and if they're not perfect you just tweak the history so that it becomes perfect, and once it's perfect, then it's merged into the tree. And I think the misconception comes from a lot of people that use Git tend to say, "Well, you should never change your commits because then people can't merge or you can't work together with other people."

And so the moment you publish your source code for other people to consume, to pull in, or to merge in, that's true if you then change history, it will create a lot of confusion. So, the model that Gerrit has is the central authority or the central server which has the canonical place, and you want to polish the source code that you do submit until it's perfect and then it goes into the central repository.

And one of the things that this enables, because each commit, I talked earlier about this case where you have a bug fix and then on top of the bug fix you do a feature. And so what you can do with Gerrit is you can build these stacks of changes, and have them out for review at the same time, and that is very hard in the pull requests model because... yeah, why is it hard?

Beyang Liu:

I've seen people try to hack it by having multiple PRs that merge into one another, or... Sorry, I'm not describing this very well. It's just not automatic, right.

Han-Wen Nienhuys:

Yeah. And so the Gerrit workflow forces you into this mode where you polish your code until it's perfect or until the reviewer is satisfied with it, and it lets you build these huge stacks of changes and still manage the code review process well.

Beyang Liu:

Yeah.

Han-Wen Nienhuys:

But the downside is in order for the tool to understand that version one of this, let's say it's a bug fix, version one of the bug fix is related to version two of the bug fix, it needs to connect those two commits. And so we do this by adding a footer to the commit message called Change-ID and then a big hex number.

And it only works if you add this random footer to the commit, and so that confuses people and also that changing history, rebasing that's often something that people don't, that takes a little bit of extra.

Yeah, it's a skill that you need to learn–how to edit history. If you don't have that skill then it's more confusing.

Beyang Liu:

Do you have any words of advice for folks who might be listening who are interested in trying it out? Are there any good getting started guides or ways into the Gerrit world?

Han-Wen Nienhuys:

Yeah, it's a good question. So we develop Gerrit mainly for our internal customers and they... if we were selling this as a service we would make much more effort to have very polished Get Started documentation, but we don't and so I would have to look for that.

There is a partner in the open source Gerrit ecosystem, called GerritForge, and they run a version of Gerrit that can easily import repositories from GitHub and it can actually also... you do the code reviews on their Gerrit instance and then when you submit it gets pushed to GitHub.

This is actually how I do reviews for my FUSE library.

Beyang Liu:

Cool.

Han-Wen Nienhuys:

That's called GerritHub, I think.

Beyang Liu:

GerritHub, yeah. I've heard of that. I was just talking to, do you know Paul Jolly at all? He's a Go...

Han-Wen Nienhuys:

No.

Beyang Liu:

Anyways, he's the creator of this thing called Cue, which is a configuration language and he was tweeting about how they just moved over to...

Han-Wen Nienhuys:

Right. Yeah, now I know. They used to be at Google, and there's this Dutch guy called Marcel van Lohuizen, I think, that was also involved in that.

Beyang Liu:

Cool, awesome.

Han-Wen Nienhuys:

But yeah, they moved to GerritHub.

Beyang Liu:

Cool, well I think we're basically out of time so just wanted to say thanks again for joining us today, Han-Wen. This has been a very enlightening and insightful conversation. Thanks for sharing all your stories and experiences with us.

Han-Wen Nienhuys:

Yeah, and thanks for having me.

This transcript has been lightly edited for clarity and readability.

Start using Sourcegraph on your own code

Sourcegraph makes it easy to read, write, and fix code—even in big, complex codebases.