Episode 11: Michael Stapelberg, creator of i3, Debian Code Search, and distri

Michael Stapelberg shares with us a multitude of experiences and contributions across the Go and Linux open-source communities. Highlights include creating the popular window manager i3, building Debian Code Search, and researching fast package management for Linux with distri. Thorsten Ball, author of Writing a Compiler in Go and Writing an Interpreter in Go, joins. The three of us talk about the importance of developer experience to open-source communities, how code search changes how you work, and how to decide when to build something new.

Show Notes

Michael Stapelberg: https://twitter.com/zekjur, https://michael.stapelberg.ch

i3 window manager: https://i3wm.org

wmii, inspiration for i3: https://wiki.archlinux.org/index.php/Wmii

Other window managers: dwm (https://dwm.suckless.org), xmonad (https://xmonad.org), awesomewm (https://awesomewm.org)

x11vis: https://x11vis.org

Wayland vs X11: https://www.secjuice.com/wayland-vs-xorg

Meson build system (vs Autotools): https://mesonbuild.com/Comparisons.html

Debian: https://www.debian.org

Debian Code Search: https://codesearch.debian.net

Google Code Search: https://en.wikipedia.org/wiki/Google_Code_Search

Russ Cox blog post on Google Code Search: https://swtch.com/~rsc/regexp/regexp4.html

Planet Debian: https://planet.debian.org/

NVMe SSDs: https://en.wikipedia.org/wiki/M.2

Distri: https://distr1.org, https://github.com/distr1/distri

Russ Cox blog post on why SAT solving is hard: https://research.swtch.com/version-sat

pdiffs and why they should be disabled by default: https://debian-administration.org/article/439/Avoiding_slow_package_updates_with_package_diffs, https://people.debian.org/~stapelberg/2013/11/27/pdiffs.html

Project Atomic: https://www.projectatomic.io

Silverblue: https://silverblue.fedoraproject.org

Distri mailing list: https://www.freelists.org/list/distri

Linux From Scratch: http://www.linuxfromscratch.org

Transcript

If you notice any errors in this transcript, you can propose changes to the source.

Beyang Liu: All right, I'm here with my colleague Thorsten Ball, and we are joined by Michael Stapelberg, created of the i3 window manager, Debian Code Search, and many, many more open source tools in the Go and Linux communities. Michael, welcome to the show.

Michael Stapelberg: Thank you, thank you for having me here.

Beyang: So, we have a lot to cover in the next hour just because you're very prolific in terms of your work, but before we get into all that, I always like to start things off on kind of a personal note and ask people, what was your earliest memory as a programmer? If you think back to the very beginnings.

Michael: Yeah, I'm not sure if I would call it the earliest memory as a programmer, because I wouldn't consider myself a programmer back then, but the earlier memory that I have that relates to programming is that I was using one of these teaching computers, which were kind of popular in the '90s, and they were running BASIC, because that was supposed to be approachable and understandable. So the earliest memory I have of something that is programming is I read this program in a literal booklet, and was typing it into the computer, and what it was doing is it was converting Celsius temperatures to Fahrenheit temperatures.

Michael: But at the time, I didn't understand what Fahrenheit could possibly be or why temperature would have different units to be expressed in. So it was entirely unclear to me what this program could possibly be doing, but still I was interested enough to type it in and run it with a couple of example values. So that is my earliest memory.

Beyang: That's awesome, how old were you when you were doing that?

Michael: I think about 9 years old or so.

Thorsten Ball: That's usually when you need to convert between Celsius and Fahrenheit, right?

Michael: Yeah, exactly. Some people only need to do it as a teenager, but for some it starts a little early.

Beyang: And from there did you just continue programming after that first experience or was there, some people like-

Michael: [crosstalk 00:03:14] No, there was distinctly a gap, which is why I was mentioning that I wouldn't call this the start of my programming career, because as I mentioned, I didn't know what was going on neither on the actual, I would say in quotes, "business logic" level, logic, probably, nor on the actual language level or anything else, so I typed it in, I was confused, I stopped caring for a couple of years and then eventually I got back into it because I was helping out the local youth outreach community, they needed some computers and I could help them out and then it kind of spiraled from there.

Beyang: That's awesome. So, the thing I want to start with, kind of selfishly, is the i3 window manager, which you created and I use. It's a window manager for Linux, it's a tiling window manager, and I guess to kick things off for those people who might not know what a tiling window manager is, could you explain how a tiling window manager is different from the regular window managers that ship with the standard Ubuntu distribution or MacOS or most consumer oriented operating systems?

Michael: Yeah, for sure. I think, first of all, we need to establish what is a window manager at all before we can get into the nuance of what is a tiling window manager, because many people will not be familiar with the concept, like Windows and MacOS users, on these platforms you don't have the option to easily exchange your window manager. But on Linux, things are split up between the x server which does the actual visual rendering of what you have on screen and the window manager, which determines where your windows are actually placed and how big they are, and crucially what sort of interface you have to interact with them.

Michael: So, typical window manager actions that everyone of us is familiar with is maximizing a window, or closing a window using the X button, or pressing a keyboard shortcut so that you either close a window or you switch between windows like alt tab, or things like that. That's the window manager on Linux. Now, the tiling window manager is a specific variant of it, and the tiling refers to your windows being arranged like tiles on your screen. So, all of the available space that you have in terms of available pixels are divided up, so if you just open a single window using i3, it will be full screen. It will span the full screen.

Michael: And then as soon as you create another one, the screen will split into two halves, but all of the space that you have is always in use, so it's very efficient, and, in fact, tiling window managers in general are sort of code for minimal window managers that target a very specific demographic. It's really, on the i3 website we say i3 is targeting advanced users and programmers. We're not trying to convert the every day user, a casual user of computers that i3 would be better for them, because it isn't. But for people who do a lot of window manipulation and who just want something that gets out of the way and is minimal, i3 is a great window manager choice.

Beyang: Yeah, it makes sense, if you're a programmer who's got a bunch of terminal windows open, editor windows, a bunch of web browser windows and various other things open all the time, and you're tired of moving your mouse and finding the corner to drag something and you just wanted something more automated and in a lot of cases more keyboard driven.

Michael: Yeah, especially, different people have different workflows, but what I kind of observe sometimes, especially when people are not so familiar yet with the ecosystem, is that they just have multiple windows open in their day to day development workflow, so when they make a change in their editor, maybe they switch to a terminal window, they run a build command there, and then they switch to a browser window and then they actually reload the page in there and then for every change, for every iteration that they do, they repeat this, they switch between these three windows. If you do three windows in a row, already you can see that a keyboard shortcut might be a good idea.

Beyang: Yeah. I'm at the point now where when I do a screen share with someone else and they're using a non-tiling window manager, it's slightly painful for me to watch, because it's like every second spent moving the mouse and dragging, I was like, oh.

Michael: Yeah, you really do get used to it.

Beyang: Yeah. So, you mentioned Linux has this ecosystem of tiling window managers, there are several others I think that predate i3, as well. Dwm, awesome window manager, xmonad, can you talk about your motivations for creating i3 and what kind of sets it apart from the other windows managers out there for Linux?

Michael: Right, I think as it often is when you are frustrated with a program and you decide to set out and write another program, a replacement program or an improvement on this, maybe you don't actually survey the entire space first. So, in my case, if you ask me when I started out the i3 project, how does it compare to awesome or xmonad? I would be like, I've never tried these, I don't know how this compares. But in retrospect, it's fairly clear, and there is actually a legitimate niche for it which is also proven by it's popularity. So the specific motivation that we had was we were actually using a different window manager at the time, which was wmii.

Michael: And if you have used or seen wmii, you will instantly recognize that i3 is visually similar to it. For example, the stacking layout, or stack layout, is very distinctive visually, you can recognize it anywhere, and in wmii there is a very similar idea, though it behaves a little bit different in nuances. And that is really the immediate push over the edge, so to say, that was the impetus of i3 is that we were unhappy with a nuance of how wmii had changed between versions and we decided, well, maybe it's time to finally change the code a little bit, send some patches, change all of these things that have been bugging us for a while, and now this is really, enough has changed that it really makes sense.

Michael: So we tried digging into the code, and we didn't have the best experience. So this was partly because in the X11 space, there are many different concepts, it's a very old space, documentation might be from the 1900s, which it sounds funny if you grew up during that time, but now that's like 20, 30 years ago. So, yeah, this is all pretty dated stuff. For example, they have their own string encodings and things that are nowadays, unthinkable, right?

Thorsten: I want to tack on and ask, a follow up question to something you said, you said when you get frustrated with a tool and you decide to build a new one, you don't survey the whole landscape. I am not bad at getting frustrated with tools, but I am bad at finding the right spot to say, now is the right time to build something. I'm often, you get a new, let's say, a little script or shell alias and then you realize, oh, I should've done this years ago.

Thorsten: How do you approach this? When do you decide, okay, this looks like fun but here's a legitimate problem, this is what I need to solve?

Michael: Yeah, this is a great question, and there is a famous XKCD comic where the comic compares how much time you spent on writing the automation versus how much time you save using the automation. And usually there is a very clear point and then many people go exactly the wrong way. Yeah, this is easy to happen or to have happen, I think, personally, there is a mixture of different factors, right? You can say that, maybe you've already invested some time surveying the landscape a little bit, which would probably be a good choice to begin with if you have any sort of frustration at all. Just get aware of the different possibilities.

Michael: Maybe you've asked people, maybe you've exhausted the easy ways, and then the other factor that plays a huge role in what I want to do is my motivation to actually work in that space. For example, during the development of i3, I have actually created my own X11 visualization tool, which was a terrible choice in terms of time being spent is exactly what a friend of mine has told me when I started that project, right? I was talking to him, I was like, so I'm considering doing this visualization project, and he was like, "that sounds like a terrible idea, I think you're going to regret investing the time", which is exactly the question you're asking.

Michael: But it turned out that was actually a great idea, because for visualization, I think it inherently has a benefit which is that you see things in a different way that you haven't seen them before, and that is just such a huge game changer so often. So, I had, in the specific example of the visualization tool, what I did was I set myself a time limit and I said, if on this Saturday and maybe on the Sunday as well I can get as far as milestone X-Y-Z, I'm going to keep at least spending a little bit of time on that tool, because I think then something useful can come out of it.

Michael: And by the time I was convinced that it was worthwhile to invest a couple more days. Because the typical effect that I observe with visualization tools is exactly what had happened there, I had a whole new way of looking at these exchanges of data between the x server and the client, you can think of it kind of like Wireshark, but with a lot more X11 specific knowledge of how events are bundled together and what references what and how you want to navigate between them.

Michael: So it's really like a domain specific visualization tool, and these pay off so quickly. Now, circling back to i3, however, that was a different story, for the first month of development, I couldn't even use it myself properly because it was just so unstable and unfinished. And then in the month after that, I did use it myself but it was still so unstable that it would crash from time to time. So, clearly, I have invested so, so, so many years of my life into i3, did it pay off overall? I think I had fun doing it, and it was a good experience to do it. And, for sure, looking back, at the time, I was a little bit naïve and a little bit too arrogant, and then you make these choices but it doesn't mean that they're necessarily wrong.

Michael: What I think is, as long as you have fun doing it, maybe it's not the worst thing. Even if the cost benefit ratio isn't exactly into your favor, just enjoy the process.

Thorsten: Yeah, I was going to say you need to dare a little bit and try and see how much can I get done, set yourself a limit, a time box and then say, how much would it take to actually automate this or to improve this? And then just dare, basically.

Michael: Yeah, absolutely.

Thorsten: Cool.

Beyang: For me, as well, that fun aspect of it which also ties into how much am I going to learn from this and will the knowledge I gain from doing this pay off in the future? I also factor that into the ROI calculation. Oftentimes, if the naïve ROI calculation doesn't justify it, I really want to do it anyway, I need something to push it over the edge.

Thorsten: Yeah.

Michael: Yeah, for sure, and there's various ways to fudge it like that, but circling back to the i3 origin story, this was exactly the problem that we had when we wanted to work on wmii, it just wasn't fun. The code was very tense, it was not easy for us to pick up as outsiders, it was not well commented, we didn't understand many of the abstractions and concepts, et cetera, and we tried improving it for a while and we did get some response, like some of our patches that added documentation were accepted. But after a couple of weeks, it just showed that this is not the most fun way of tackling this problem, and what if we could just constructively do something, and brainstorm, and go from there and it was fun so we kept doing it and then eventually we had a whole window manager.

Beyang: And when you say "we", how many people were on the project at that point?

Michael: Yeah, I think it was always two of us, though the other person is kind of a secret co-founder, if you will, in that he has mostly just inspired me and given me ideas and questions and was just participating like that, in the early phases of the project. And then, at later stages, there were other people who were instrumental in the development, there was somebody who contributed i3 bar, which is the bar at the bottom or top of your screen, that used to be a separate project, we eventually merged it into i3, and there were other core developers who have stepped up in the years after it really took off, so to say.

Michael: We started in 2009, and then over the next one, two, three years, it was sort of word of mouth and it was spreading, but then, eventually, it really took off, and that's also when we got, I want to say a core team of three to four people, obviously, it changes over time, some people get busier, some people leave the project for whatever reason. But, yeah, we've had some very good people contribute significantly to the project.

Beyang: Now, you said it was kind of directly inspired by wmii, but you weren't super aware of dwm, awesome window manager, xmonad at the time, but later on, the distinctions between i3 and those other window managers became clear. What would you say are the big first order differences these days?

Michael: Yeah, for sure, so dwm, awesome and xmonad all are sort of automated tiling, whereas i3 can be categorized as manual tiling, in the sense that in dwm and awesome, which is a descendant of dwm, you have these layouts and then you open more windows and then they're automatically arranged in, maybe things such as a big window on the left and then everything else is smaller windows on the right, or some layouts go as far as building a Fibonacci spiral out of windows. I don't think you need to go that far.

Michael: But in i3, nothing like that happens, so you are responsible for moving the windows around, for establishing your own layout, but at the same time, the layout is also more dynamic, it's not as rigid. So that's the first difference in the look and feel, which is very important, obviously, if you're talking about a program whose only interface is that particular look and feel.

Michael: But then, also, the other interface to a program obviously is it's configuration file, and the secondary resource for that is the documentation that tells you what you can do in that configuration file. And, also, the community, and one factor that we had heard so often from people who came from awesome or especially xmonad and other window managers where, it strikes me, all the three that you listed are actually programmatically configurable, and dwm to the extent that you actually need to modify the C source code and recompile it, whereas awesome is in Lua and xmonad obviously in Haskell, but people are so tired of all of the need to program and express your configuration in syntax that they're not familiar with.

Michael: It's great for the Haskell community to have a window manager that you can configure in Haskell, I'm sure that makes people very happy, but for people outside of that community, it just feels so awkward. So one of the big differentiators of i3 is that the configuration is plain text, and it is understandable. And, in fact, the only reason why I claim that it is understandable is because we have spent conscious effort between the version three, which was the initial release, and the version four release, which was the only major release break that we ever did, to revamp the whole configuration file.

Michael: And the way we approached it is I was studying, at the time, and at the university I was asking the other students, I was handing them a printout of this and I was like, explain to me what this does. So I was handing them the config file example, and they were like, "ah, so I would guess that this could...", and then they would explain what the feature might do, and I was like, okay, so this one is clear enough. This one is obviously not clear, we need a better name for this, we need a better comment here, we need to add a pointer to more documentation there, et cetera.

Michael: And then you end up with a config file that is really approachable to many people. And just that initial hurdle of you have a program, but you can't quite figure out how to customize it, to make it do what you really want, that is a big hurdle. And if you can keep that down by using an approachable config file, and understandable documentation, then that brings so many people on board.

Beyang: That is a really good way of testing out the ease of use of your configuration language. I wish more products took that approach, and maybe we should do that for the Sourcegraph configuration format.

Thorsten: I was going to say, was there ever explicit user testing done for other programming languages? Like here's a piece of code, tell me what you think this will do, first glance.

Michael: Yeah, it sounds like such an obvious idea that there must have been people who have done it, but I'm also not aware of any big projects that can say that they have made decisions based on that sort of testing.

Beyang: So, looking forward for i3, the project seems like it's mostly in maintenance mode right now, it's pretty mature, are there any future features that you're looking forward to?

Michael: Yeah, I would say maintenance mode is a fair characterization. For most people, the changes between releases are not going to be significant. Obviously there is still active development going on, but there's not going to be any super weird changes or changes that change the program drastically, so whenever people come and make a feature request or request a change of any sort, one aspect that we consider is how much will this change the mental model or how much mental overhead will there be to educate our users about this new thing? Or do we deem it so important that we want to add it at all? Because everything we add makes it harder to understand what the program is about and what it does, it makes it less focused.

Michael: So there's not going to be big new features, but there's still plenty of opportunity to address some sharp edges. Myself, the way I see my role is that the project, it does what I need it to do, it does what many other people need it to do. The thing that it now most crucially needs to be is stable over the years until the situation stabilizes between X11 and Wayland. And if ever Wayland completely replaces X11, then we can totally retire i3, but before that, there's always going to be somebody who has this weird environment.

Michael: For me, myself, the way I work right now in the working from home situation is that I use Emacs over X11 forwarding, over SSH, so I just cannot do this using Wayland. And so I know that for the computing environment that I have, it's going to be years before there's going to be any sort of switch, so I think there's not going to be big changes in X11 either, so I don't think it makes sense to have big development there, have big changes there, you can see that that's not where the attention is, so that's not where we need to spend a lot of energy, either.

Michael: Myself, what I do is I oversee the project as a whole, obviously, I'm the person who needs to step in when there's conflicts, but also I'm the person who does things that other people don't want to do, necessarily. So, for example, this development cycle in i3 we are switching from Autotools to Meson, which is a change that I'm really excited about. It totally fits with the theme of this show as well, because it's about developer tools, and the build system changes is work that it's both very opinionated, so people don't necessarily feel like they should be going in there and making changes, like what if I don't like them or whatever, maybe I will block them.

Michael: In the initial years of i3, we had explicitly said that having plain makefiles is a virtue, we have since changed that approach. But now, I am the person who just does tasks like that, I have my subject areas in which I'm an expert, I can help out in these if everything else sort of fails. But in the day to day, I'm busy enough with my other projects.

Beyang: Actually, real quick, so it's Autotools to Maven? Is that the switch?

Michael: No, to Meson.

Beyang: Meson.

Michael: Yeah, it's the GNOME community mostly preferred build tool, that's where it got big, but it's kind of Pythonic in how you configure it, but it's much more high level than all of the others. So if you're used to either Autotools or Cmake, or, what else have you, SCons, I haven't used SCons much, but Meson is really much more high level, it much more feels like BLAZE or Bazel or [inaudible 00:24:12], or whatever other high level build tools you have like that.

Michael: So it knows about dependencies, but more crucially, it also understands what is actually C code and what are libraries and how could they fit together, and you no longer need to construct command lines manually that call a compiler, it's much higher level than that and that actually allows it to deliver features much more quickly. That is one of the big distinctions that I see between languages such as C, where the development environment had stagnated for decades, and then languages such as Go where they come with a feature enabled by default that really makes it stand apart, like profiling by default, debugging symbols by default, cross-compilation by default, all of these sorts of things, they're hard in C, and Meson helps us go a step further than we could previously with Autotools.

Beyang: Makes sense. I feel like I could spend an entire hour or more talking about i3 and all this sort of stuff, but I want to get also to your involvement in the Debian community and the Linux community more broadly.

Michael: Yeah, for sure.

Beyang: So, I guess you were a major contributor to Debian for a long period of time, can you talk about how you got involved in Debian initially?

Michael: Yeah, absolutely. So, I was using Debian because a friend, who was very familiar with it, he knew the answer to every question for a long time and he introduced me to it, and then, eventually, I sort of started the process of becoming a Debian developer, which is like a longer, formal process. I think they've shortened it significantly since then, but back in 2012, it was a long process. And then I sort of became a Debian developer before the friend who was sort of my mentor, nowadays he's also a developer, so all is well.

Michael: But, yeah, it was popular in my bubble and I liked what I was seeing back at the time, as it goes, I'm not necessarily surveying everything when you get introduced into something, so, I don't know, if I had known about, say, Fedora, would I have preferred it? I can't say, right? It's a different timeline, so to say.

Beyang: Yeah. Makes sense. And as part of your contributions to that community, you've contributed a lot to the developer tools and I think developer experience has been a high priority of yours, and one of the tools that exemplifies this attitude is you were the creator of Debian Code Search. Can you talk about what Debian Code Search is and what it lets you do as a Debian developer?

Michael: Right, so one of the things that I noticed at the time was that it was very hard to get an overview that is scoped to the entirety of Debian. So it was easy enough if you were working on a specific package, let's say I was working on the i3 packages, like the i3 window manager itself, the screen locker, the status bar, all of these sort of packages. If you look at them in isolation, it's easy enough, or if you want to search for, for example, let's say you want to ensure that all of the packages that are maintained by the X11 team are up to a certain standard version of packaging.

Michael: Maybe you want to use a feature in the i3 package that depends on something else, and then how do you do this? How do you identify who is actually the X11 packaging group in Debian, which packages do they own, where can I get all of the sources, how can I search all of the sources? So Debian Code Search is sort of an all in one shop for this sort of question, if you ever wanted to search through more than just one package that you already happen to have on your hard disk, you could just go to Debian Code Search, and you put in your search term, possibly using a regular expression syntax, then you can search all of the source code of all of the Debian packages.

Michael: So, Debian being a modern enough Linux distribution, that means a lot of open source software, so if you want to find an example of, let's say, an implementation of an algorithm or if you have recently discovered that there is a security issue in your code, and you're wondering, "well, if I made that mistake, who else has made it, and are there any high profile cases that I should know about?" And it's very easy to answer these questions once you have a search engine that is always up to date and that covers all of your packages.

Michael: But even just finding packages is a thing that in Debian was not trivial, because Debian values distributed working so much, so packages didn't necessarily need to be in a source code control repository at all, so most developers were using Git but Debian values that people have choice, so some of them were using Git, some of them were using Subversion, some were not using any version control at all, and then some of them were hosting it on a somewhat centralized Debian hosting site, others were hosting it on GitHub, yet others were hosting it on their personal computer that happened to be down when you wanted to update your repository.

Michael: All of this could happen, and just getting source code access in order to even just view the source code of packages was hard, and it was hard like that in every single step of the way, in anything that I wanted to do in Debian.

Beyang: Makes sense, and when you created it, had you used a previous code search tool prior to that or what was the inspiration for you?

Michael: Yeah, so there were other code search tools, the one that I was familiar with at the time was Google Code Search, because it was very well known, obviously. But they did shut it down, I think they announced the shutdown in 2010, but I'm not sure on that, and then it was still kind of working for a little while, and I was living dangerously, still using it, but then eventually it went away. And then, luckily, though just after or in between when it was officially deprecated but not quite deleted, or just after, there was this blog post that I became aware off written by Russ Cox who was the original author of Google Code Search, and he actually talked about the project in terms of what makes it special in terms of how is it built, like the trigrams, the trigram index that makes regular expression search possible.

Michael: And he was explaining this, and he was sort of adding a teaching implementation or an example implementation along to go with his blog post, and I was like, well, that's very interesting, right? We no longer have access to Google Code Search but we could build something kind of similar if we took this and added a web front end and made it get source code from somewhere, and then I thought how could I achieve this, because it's pretty hard to crawl all of open source. Suddenly you have so many problems, you have licensing issues, how do you [inaudible 00:31:02] to check licenses. You need to maintain a crawler, you need to not overload other sites, so many operational burdens.

Michael: So I figured, well, maybe actually coupling it to a Linux distribution would be good because then you get the [inaudible 00:31:15] benefit of you cannot only search the upstream source code, you can also search the distribution specific bits around it. So any sort of Debian packaging, metadata, or instructions, scripts, anything like that, you can also search. So, suddenly, the tool becomes much more valuable, it becomes valuable to one demographic which is interested in just code search across open source, and then it is is more interesting even, potentially, to Debian developers.

Thorsten: So, you could answer the question that you mentioned at the beginning, who are the maintainers of these packages? Because it's off the index?

Michael: Exactly, yeah, you can just search for, if you know that the maintainer is encoded in a file called debian/control, you can just enter path:debian/control blank, and then you say, maintainer, call on .*X11. And then you find every package where the field contains X11 in there.

Thorsten: Cool, and you said here's this problem, Google Code Search is going away, here's a possible solution. How was this accepted in the Debian community? Because I can see multiple versions here where, "oh, a centralized server where all of our code is indexed and kept in one place? That's not good", or, "this is fantastic". What was the initial reaction?

Michael: Right. Yeah, so, I was writing blog posts about this and publishing them on the Planet Debian aggregator, so people knew about it but it never did get, I never asked for, is this a project that we should do? That was never even the discussion for me, the situation is, well, I want to do it, I'm going to do it, are you interested in having this service? If so, how can we support each other? How can you give me the resources, for example, domain, that is actually self service so that's the reason why it was so easy to get that domain, codesearchdebian.net, the .net is where any developer can just add their records, and then I was just running it on my own server for the initial start.

Michael: So I was just happy to do all of that work myself. In the meantime, I have gathered a couple of fans, so to say, in the Debian community that have recognized the value of having Debian code search and that are using it regularly, and I'm grateful for their feedback and support and testimonials, et cetera, and I think nowadays the service is much more anchored in the community than it was back then.

Beyang: Code search is interesting, so, obviously Sourcegraph is a code search engine as well, and what we found from talking to people is there's kind of, broadly, two sets of people when you talk about code search. There are those who had used it before, they love it, they can't live without it, and then there's people who have never used it before and they're like, "why would I need this? What can this do that I can't do in my editor or on command line via grabber, so have you encountered that difference in attitude and why is it that code search is this thing that is kind of hard to perceive the value of, but once you have it, it becomes such a core part of your workflow?

Michael: Yeah, it's a very interesting question, and I think there are some similarities to if you think back to when you were first introduced to the internet itself, you were like, "okay, so this is the internet, but where do I navigate to? How do I use this?" And then, at some point, search engines came along, and then it became much easier. You just told people, look, there's a search engine field. And then, still, it takes a little bit of coming around, but then eventually, it clicks, and people are like, "oh, if I have this problem I can just put a question into the field and maybe helpful answers might be among the results".

Michael: So maybe it's similar to that, and to me, personally, the value of having code search is easily explained by thinking about it in terms of removing hurdles. Because without, for example, Debian Code Search, I would need to update my local checkout of Debian packages, or I will just not bother, because it's a task that takes potentially hours, and if I only have a couple of minutes of spare time a day in a busy full time job life, then there's a hurdle. I can no longer do this. So as soon as Debian Code Search is there, suddenly I can.

Michael: So I think it's that sort of game changer, it enables a different work flow and for people who cannot imagine that workflow or who haven't been introduced to the workflow, it seems strange. But then, once they see it, maybe it clicks.

Beyang: Yeah.

Thorsten: Would that even be possible to keep a local checkout of all the Debian packages?

Michael: Yeah, certainly, I mean, if you think about it, that's how we do it on the server side, right?

Thorsten: Yeah, but how big is it?

Michael: Sure, yeah. It's 140 gigabytes in size. So it is totally doable on... It wasn't as doable back then, when I started the project, it was a huge contentious point because I wanted to transition the project from own server to the Debian servers, and at the time, I was sort of used a little bit to the Google way of doing things, where if you needed a terabyte of disk space for your work project at Google, you wouldn't need to ask anybody, you could just self service grab it. Sure, why would we spend time arguing about this?

Michael: So I was approaching the Debian sysadmin team and I was asking them, well, for a search index, it needs to be fast, so we need to use flash storage. So how could I get flash storage? And they were like laughing at me because it was such an outlandish request, how could I dare to ask for flash storage?

Beyang: What year was this?

Michael: This was 2012.

Beyang: Oh.

Michael: And 2012 was around the time that I bought my own SSD's for my laptop, and they came in sizes such as 128 gigs, so if you bought the cheapest of the cheapest, you probably couldn't have your own code search archive, but these days, with disk sizes that are like 512 GB plus, even in the cheaper laptops, if you care about Debian stuff, you can totally have your own checkout, yeah.

Thorsten: Yeah, I'm surprised it's not that big, right?

Michael: Yeah, it always, it's different in how you scope it. If you only track, for example, Debian unstable, which is really only what you need for development, then it's much smaller than if you also track the other suites that are still actively maintained.

Thorsten: Yeah.

Beyang: Yeah, I think people really underestimate the impact that those kind of friction points have on what they actually choose to do day to day, and it's almost like when you're developing, you're kind of wandering in this wilderness and the step that you take now is going to impact where you end up five, six hours from now even though you don't think too much about it. It's almost like when we were talking about Sourcegraph, especially in the early days, to people who had never used code search before, they were like, "well, if I wanted to search over... My code base is small, it fits on disk, it's not as big as 120 gigs, it's probably in the multiple gigs if even that. So if I wanted to search it at a particular vision, I would just stash my working state and check out that separate branch and just use grabber or something like that.

Beyang: But then you ask them, okay, so how many times do you actually do that on a day to day basis? And it's like, "well, almost never, because it's annoying to unload your working state, you kind of have to context switch", and you're making this local calculation, is that local piece of knowledge worth the couple of minutes plus context switch? Probably not, but you don't realize that it's going to lead you down this path of five or six hours from now you might be, because you found the answer quickly, you didn't waste several hours of your time writing code that wasn't necessarily.

Michael: Absolutely, yeah, and even when you just said, you just do a Git stash, you already lost so many people. Many people who, when they hear Git, they're just immediately turned off, or even when they are accustomed to Git, when they hear Git stash, they're like, "oh my god".

Thorsten: Yeah, and it's also this you don't know what you don't know, right? If you have code search, suddenly you can have a link, a URL, that you can share with colleagues or other people that lists all of the results, like here's a URL that tracks all of the to do's that we have in, I don't know, these five sub-projects, whatever. If you don't have code search, yes, you can do this, you can curate a list of all the to do's and write them down in Markdown somewhere, but once you have it, it's so easy to just get the URL and share that. And suddenly, you don't want to go back.

Michael: Absolutely, and I think you raised a great point here which is that it's not enough to just have the one time index, or to do the one time list manually, there is also immense value in having code search update quickly. And I've seen this so much when I was working in my current team in sort of an internal clean up effort and we were transitioning to use a new API instead of the older API. And when you have code search that lags behind a couple of days, it's like you have an entirely different workflow, because now you need to maintain your own spreadsheets of what is where and who tackles what.

Michael: Whereas if you have that link that you can just share and everybody just opens it up in the morning and goes, "oh, yeah, this is the current state", then it's much clearer and you don't need to maintain anything. And if you have that and you take it to the extreme, where if you make a change, it's immediately indexed and this is what Debian Code Search tries to do but there's a couple of hurdles why it isn't quite as good, but I've seen other code searches where if you submit a change, within seconds, it's actually live.

Michael: And that just, it gives you so much more motivation, because you can be like, oh, so I see this problem here, you just make your change, and then you can be like, and now it's actually gone. I can no longer find it. It can be done, it's done and entirely done. It's just such a nice feeling.

Thorsten: Yeah, and there's even another layer on top where people recognize that code changes often and they want to basically be notified when something changes, and Sourcegraph customers, for example, they want to get notified when code changes and get an email, or with Sourcegraph Campaigns, which is what I'm working on, they want to react to code changes and say, "hey, whenever new thing pops up here, please run this code, or please do this".

Michael: Yeah, there's definitely interesting use cases for this sort of stuff, right? You could say whenever there's a new user of this API symbol, we can say, this is a new user, maybe we should see who this is, send them an outreach email. Or maybe you could track how many deprecated usages you still have over time and have a graph of it, or a monthly summary or monthly progress emails. There's so much secondary stuff that falls out of having this data programmatically available and always up to date.

Thorsten: Yeah, that's a good way of phrasing it, it enables a whole different thing, a whole different class of tools, basically.

Michael: Yeah.

Beyang: What language is Debian Code Search written in?

Michael: It is written in Go, actually, it was my first bigger Go project where it was actually multiple services running on multiple VM's, intermittently, actually. We started out on a single machine, we went to a cloud deployment Rackspace, who were thankfully hosting us for many years, and now we're back on a big Hetzner box that I just pay for myself which actually it turns out is much faster than the Rackspace cloud we had access to at the time.

Beyang: Wow, that's crazy.

Michael: Yeah, it really makes a big difference to have two fast SSD's in your machine, these very modern NVMe2 SSD's, yeah, they're just a very different ball park in terms of performance. If you don't have that, yeah, it's a big step up. Kind of like the original introduction of SSD's, right? When everybody's mind was collectively blown by hard disks and now there's SSD's, and, wow, suddenly random access is fast? It's kind of like that in that you can do tens of thousands of IOPS easily and gigabytes of writes and reads per second over these NVMe2 SSD's, so, yeah, I'm very glad that we have them.

Beyang: And are you running that at home or are they in some server rack warehouse somewhere, or?

Michael: Yeah, the Debian Code Search one is one of the ones that I run on my rented server at Hetzner in Germany, they're known for their cheap cloud offerings but also cheap dedicated servers, but when a company starts out having cheap services, that doesn't mean that they stay cheap forever because over the years, as they get more customers, maybe they can actually improve their services and I feel like Hetzner has actually become quite decent and I'm quite happy with the performance.

Michael: But a couple of other projects of mine I do actually host out of my own home here, which is very nice.

Beyang: Yeah, I was actually looking into this, not seriously, but casually, a while back because you can buy really cheap blade servers, second hand blade servers that are still functional, the warranty is expired, so no actual company will buy them but...

Michael: [crosstalk 00:45:09] Yeah, but then the question always is, do you really want that old blade center? Because they're loud, they're power expensive, I don't know if... And it's hard to get replacement hardware for them, so, personally, I can see the appeal in this, I can see the home lab perspective and I appreciate that folks are doing that sort of stuff, but for the day to day or the only way that I can personally integrate it into my day to day in a reasonable way is to just use standard, off the shelf hardware so that if anything breaks, I'll just bike over to the computer shop and I'll just pick a new part and I just plug it in.

Michael: And I only do it for fun, so only the not so super important services are running from my kitchen. Yeah, but it's pretty cool to be able to do that and that the internet allows for it, so if you're ever trying out the Distri Linux distribution of mine, then your packages will be installing right out of my kitchen.

Beyang: Yeah.

Thorsten: That's cool.

Beyang: Actually, that's a great segue into Distri, which I also wanted to talk about. So, Distri is this Linux distribution, I believe you wrote it for exploratory research purposes, initially, but with the express intent of trying to fix package management, is that right?

Michael: Yeah, I don't know that I would have the intent to actually fix package management, but I definitely want to understand why it is the way it is and how it could be done better, and I think the research that I've been doing in there so far has been very fruitful, at least for my own personal understanding but also I think in conveying the problem and the problem space to other people.

Michael: So, yeah, you're spot on, Distri is, indeed, the way I phrase it is it is a Linux distribution to research fast package management. And it was born out of frustration, as is apparently a common theme in today's show. Yeah, I was frustrated with Debian but also all of the other distributions. So, in Debian, especially if you're running it on something like a Raspberry Pi, where the computer is a little bit slower, everything is a little bit slower, and then you just say, I just wanted to install this real quick on my Raspberry Pi and then five minutes later you're still waiting for apt, and you're like, what? Why?

Michael: Is there a reason for this? So, naturally, I was very curious but it was also very hard to get deep into a system where you don't have anybody you can ask, so understanding how Debian is really built is really hard, even over the years, so I find it sometimes easier to start out greenfield and just develop something from scratch, where you're like this should probably work like that and you just plug it together, and then you also know what's there and you have control over the whole thing.

Michael: So that's the advantage in Distri. In Debian it takes many year projects to change anything, because it's a very entrenched distribution. So, so many computers are relying on Debian changing slowly, so it changes slowly. But, in Distri, nobody is relying on it and intentionally so, so I have the freedom to change everything between any release or to not make any release at all and just use it privately for myself and see what I learn and see what I can build, see what I can tell other people.

Beyang: And what have you found so far? Why is package management such a seemingly hard problem? In theory it's simple, you have this dependency graph, one thing depends on another, you just walk that graph, build the transitive closure and then just fetch everything. Why is that so hard?

Michael: Well, that's exactly what Distri does, it's not so hard after all.

Beyang: Why is it so hard for others, then?

Michael: No, so, see, it gets harder the more you look at the details. So, for sure, you have the dependencies, you need to fetch the transitive closure of something, but at which level do you actually do dependency resolution? For example, you could say, and Distri does it like that, you build a package, and everything that went into this package as a dependency is going to be persistent. And that's going to be the dependencies that you will get when you install this package on your computer.

Michael: Other distributions don't necessarily do that, they have the sort of flexibility in there, they just say, "oh, yeah, this package, it was built with OpenSSL 1.1", but it only depends on OpenSSL, if you have OpenSSL 1.5, or newer, the program or the package does not need to be changed, it will just be installable and then it works because it's dynamically linked against whichever version you happen to have in your system. And then the package manager really needs to work, because now it needs to ensure that the system itself is consistent, that you cannot have conflicting version of dependencies on your computer, because there is only one userlibOpenSSl.so, or I think it's called libSSL and libCrpyto.so to be precise, but you know what I mean, right?

Michael: And this is where, suddenly, in the design space, we haven't steered so far off course of the main question, everything is still roughly the same but now we say, okay, but what if we have conflicts? And then what does that mean? Well, now it means you need to specify which packages conflict with each other, and that means SAT solving. So, Boolean's satisfiability solving, so now we have this process which takes very long, and on Russ Cox's blog there are great articles about why SAT solving is complicated and slow and how you would do it properly, but it's still not great.

Michael: So Go Modules, notably, doesn't do any SAT solving either, so in Distri, one of the key insights that I have learned is that, in fact, on the distribution level, it's not necessary either. For sure, you get a couple of other different properties out of that, for example, if you want to push out a security fix, let's say in libSSL, you need to actually rebuild the transitive closure of dependencies so that they all pick up the new version because they're all pinned, so to say, to this one set of dependencies otherwise.

Michael: So you have a little bit more effort in there, you have a little bit more bandwidth costs, but you have an easier mental model overall if you follow this model, and it is much harder for anything to accidentally break.

Thorsten: You mentioned something now with bandwidth, which is a constraint, and I'm wondering, do you think the feasibility of different trade offs that package managers and Linux distributions make has changed over the years, and the context from which I'm asking this is when you distribute binaries or programs, you can dynamically link in other libraries or you can statically compile the binary and put everything in, and 20, 30 years ago, statically compiling a binary was not feasible because the binary would be big, you need to transfer it, you couldn't save it on a floppy disk or a CD, whatever, and now, suddenly it's like we don't care whether the binary is 15 megabytes of 500 kilobytes, it doesn't matter but it makes the whole distribution process much more easy to have a single file that we need to copy round.

Thorsten: And I think the hardware changes in the past decades have kind of changed the game and I'm wondering, did that also happen in Linux package management?

Michael: Yeah, absolutely, I think you're absolutely right, this is also why we see Docker being so popular because disk space is cheap, let's pin everything, everything keeps working, that's nice. Yeah, in Debian you can see this as well, but I find it very curious, so one example I would like to present here is a mechanism called pdiffs, which I think stands package list diffs, which means that instead of when you update your package lists using apt update, instead of download the whole package list file which can be very large, it just download the diff file.

Michael: Now, this is a great mechanism, right? It saves bandwidth, it's going to be good for users who are on low bandwidth connections, but at the same time, if you're on a high bandwidth connection, this is a terrible change for you because suddenly you need to fetch so many files and then you need to apply them all, and it's much faster to just fetch a big file and save it than to have a small file and then many other small files and then compute, resolve all the diffs, apply all the diffs, et cetera.

Michael: So, I had in, I think, 2017 or so, maybe a little earlier, I was complaining about why is pdiff enabled by default? I think it doesn't make sense given the hardware that most people are running on, but I think Debian in general is a little bit slow to even pick up changes like that. But, yeah, in general you can definitely see this in the distribution landscape. I think, also, as other distributions are getting more into containers, for example, Fedora has their, what's it called, project something? I think Project Atomic, I believe, and Silverblue is the distribution on top of that concept. Or something like that, it's all container based and cloud-y.

Michael: That is another sign of the changing landscape, both in terms of hardware and software surroundings.

Thorsten: Yeah, interesting, cool.

Beyang: In our previous conversations, you mentioned that even though Distri is kind of a research project at the moment, you think it can still be used in some situations, like potentially in your development environment for reproducible dev environments or possibly in CI where you want to install things very quickly, can you talk about what that use case looks like?

Michael: Yeah, for sure. I think it's suitable for this, I don't know if I would necessarily be recommending people actually use it like that, but what I find instructive is to, if you're feeling adventurous, you could try and change your CI environment like your .travis.yml file from just doing apt commands where you install your dependencies to using Distri, there's a little bit more to it but if you're curious, you can just send an email onto the Distri discuss mailing list and then I can outline it in more detail how I did my experiments.

Michael: But what I've seen is that the installation time, and hence the overall CI run time, drastically falls down if you're using Distri because package installation is just so much slower overhead, and so much more parallel and leveraging high bandwidth connections, which you have on all of these modern CI cloud environments. So, yeah, this would totally work, if the software that you're interested in is available in Distri, I think it's mostly useful to get a feeling of what sort of performance we're leaving on the table in the established distributions.

Michael: If only they could change a couple of things, like a couple of low hanging fruit, but maybe also a couple of bigger architectural changes, we could be in a much different position in terms of speed of all of the Linux distributions.

Beyang: What goes into creating a new Linux distribution? How long did it take you to get to a minimal viable product point for Distri?

Michael: Yeah, so a lot of this you can learn on the Linux From Scratch site, Linux From Scratch is really a great resource. It essentially walks you through how to build a Linux system if you didn't have a package manager at all. If you only downloaded upstream sources and then you make install them manually, and it just walks you through this process step by step. So if you wanted to build a new Linux distribution, you could just go through that guide and then see if you could kind of automate it, because that's what the package manager does.

Michael: You can start very, very small, you can just boot the kernel, that could be a first milestone, how do you even place the kernel in the hard disk or in your development environment such that you can start it and then the next milestone might be you boot the kernel and you actually start a small user land, maybe BusyBox, which is an integrated, embedded solution for this, just one binary from your [inaudible 00:57:08]. You don't even need to have a kernel driver for your hard disk, file system and root file system, anything like that, you can just tell the kernel, look, here's a binary, just run that and then you have a shell.

Michael: And at that point, that's already a very satisfying milestone. And then you just kind of build from there, there are a bunch of typical GNU tools that you need to get packaged and a couple of other parts in the typical open source stack. For many of them, you can apply shortcuts like maybe you don't need to build them with all of the fancy integrations that they have and all of the extensive documentation that requires more third party tools in your base build closure and stuff like that.

Michael: But, yeah, that's basically the process.

Beyang: It almost sounds like even if you don't have a specific end use case in mind, it almost seems like a really good educational experience to go and try and do that, because it'll show you how all the Linux internals fit together.

Michael: Absolutely, and I think that is really the approach that Linux From Scratch, the project, wants, it wants to be that guy that teaches you how Linux works by having you assemble your own distribution from the pieces. The same argument is often cited for distributions that are more involved or more minimalistic such as Arch Linux or GENtle Linux was a more popular one back in the day. And I think there is some truth to it, if you have the time and motivation and energy and circumstances to spend some time building your own Linux distribution, absolutely go for it.

Beyang: You know, Michael, I did not get to even half of what I wanted to cover today, so if you're up for it, maybe at some point you can come back on the podcast to talk about Go Crazy, the peer Go user land for Raspberry Pi that you've written, the Twitch streaming that you're doing, your desk setup and much, much more in the Go and Linux communities, as well as I just wanted to pick your thoughts on developer experience in general, because that seems like a theme that recurs over and over again in the work that you've done.

Michael: Yeah, for sure, no, I'd be happy to come back, yeah.

Beyang: Okay, awesome, yeah, we'll definitely do a part two. But, for now, I guess for those listening, the things that we talked about today, Distri, if people want to try out Distri and learn more about it, what would you recommend that they do?

Michael: Yeah, they could just check out the website at distri.org or just put in Distri Linux in their favorite search engine, you should find it easily, or just, you can reach out to me on Twitter if you want just put my full name into the Twitter search field, you should find me. Or find my website. I'm sure you'll find it.

Beyang: Well, Michael, thanks so much for being on the show today.

Michael: Yeah, no problem at all, glad you had me here.

Beyang: The Sourcegraph Podcast is a production of Sourcegraph, the universal code search engine which gives you fast and expressive search over the world of code you care about. Sourcegraph also provides code navigation abilities, like jump to dev and references in code review, and integrates seamlessly with your code host, whether you're working in open source or on a big, hairy enterprise code base. To learn more, visit sourcegraph.com. See you next time.

Episode 11: Michael Stapelberg, creator of i3, Debian Code Search, and distri

Show Notes

Transcript

Start using Sourcegraph on your own code