Episode 4: Evan Culver, developer tools at Segment
Evan Culver, Beyang Liu
Evan Culver builds developer tools at Segment, a customer data platform that lets product managers and software teams understand their users through data.
Evan's career has spanned many years up and down the software stack, from frontend UI development to infrastructure and ops. For the past five years, his focus has been developer tooling and infrastructure, having worked on these during his tenure at Uber during its hypergrowth years and now on the dev tools team at Segment, where his charter is to "empower the engineers of Segment with the tools to automate, optimize, and streamline their workflows." In this episode, he explains to Beyang what exactly that means, discussing Segment's use of technologies from the AWS ecosystem, the popular open-source secret management tool they created, ChatOps, and various Docker- and Kubernetes-based tools that are useful for managing the deployment of many microservices.
Evan Culver: https://github.com/eculver
Amazon ECS: https://aws.amazon.com/ecs/
Amazon EKS: https://aws.amazon.com/eks/
Docker Compose: https://docs.docker.com/compose/
AWS SSM: https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html
AWS KMS: https://aws.amazon.com/kms/
AWS Lambda: https://aws.amazon.com/lambda/
Anastassia Bobokalonova, software engineer who built Segment's ChatOps tool: https://twitter.com/anastassiaflow
Rootless Docker: https://docs.docker.com/engine/security/rootless/
Kraken, peer-to-peer Docker registry: https://github.com/uber/kraken
Kubernetes Custom Resource Definitions: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
This transcript was generated using auto-transcription software and the source can be edited here.
Beyang: Evan Culver's career spans many years up and down the stack from front end to backend to infrastructure to ops. He was an early engineer at Uber where he worked on realtime systems and developer tools to scale the engineering organization during the hyper-growth years from 2014 to 2017. Today, he's an engineer at segment where his team owns most of the developer tools and infrastructure, including continuous integration, version control, cloud provisioning, secrets management, and deployment. Evan's team has a charter to empower the engineers of segment to automate, optimize, and streamline their workflows. Evan, welcome to the podcast.
Evan: All right. Thanks for having me. It's good to be here.
Beyang: So the company tagline is the customer data platform that makes good data accessible for all teams. essentially what you do is you connect all these different tools in order to make the data, uh, consistent and accessible to, to markers and product managers and folks like that who want a better understanding of their users and customers. Is that about right?
Evan: Yeah, I'd say he did a pretty good job at describing it . You know, I think when I, when people ask me what segment does, I try to. Should I say, we take data where it's produced and get it to where it's most valuable, and that could mean a lot of things, but in that, you know, I guess that's one of our products. You could say that there are other things that go along with segment as a platform things like privacy. And, you know, we have, we do things for GDPR and we do things for managing personas , but in the end, it doesn't end up being, you know, getting your data to the highest value. Location, whether it's, you know, any of those destinations you mentioned, but that's it.
Beyang: Awesome. That as a super valuable thing to do. Uh, so, so that's kind of the, the view of segment from kind of. The outside in, you know, as, as a user or, or, um, customer, you know, that that's kind of the black box interface that you exposed the product level description, if you will. Um, I'd love to get your kind of overview of segment from the inside out, like underneath the hood. From an engineer's perspective, uh, what does segment look like.
Beyang: That makes a ton of sense. Uh, for the services that you run that aren't, uh, managed services, can you walk through a couple examples of, of some, uh, big services.
Evan: Um, yeah. Well, the one that is, I think the, the core to it all is a service. Called centrifuge, that basically is, is our core delivery mechanism. And you know, I'm going to speak above it a little bit because I don't work on it. I mean, it's, it's, it's really a great piece of software. Um, it's gone through quite a few iterations and, but it's, it's really the brain of how messages make it get, make it through our pipeline, get redelivered, and basically our insured. The, that they make it to the destination they're supposed to make it to. Um, so whether it's, you know, this thing, this one message thing, know, needs to get delivered three times once to Facebook's ad platform, the other one to mix panel HubSpot, it will make sure that not only that it eventually gets there, but it will be retried if it fails initially or you know, that it, um, is, you know, that retry mechanism is. In a way that doesn't violate the terms of the other API provider. You know, it doesn't, in a friendly way, I should say. And in general, yeah, it takes care of, of ensuring that it's delivered. And I'm along the way, you know, it does this in a highly available distributed fashion, but this is, so, this is a thing that we, you know, we developed at segment to do, you know, our core business. And so I'd say that that service is, um. Yeah, it's pretty important. It's not just one of those, um, I'd say control plane services and not that they aren't all pretty important, you know, they serve their own servers, their own, uh. Yeah, they're on purpose. But like, for example, if centrifuges down like really hard down, you know, like it's, we're still accepting data, but we're just definitely not delivering data. So, um, so that said, um, it's, it's actually one of our early Kubernetes adopters. So we, we're, we're just getting him in on this bandwagon of the Kubernetes things and all the cloud native world. And, you know, it's generally our SREs that. Enter, uh, that wrote and manage this core service. And so in being at the front front of it all in terms of infrastructure and, and the cool kids, I guess, the cool, the cool things and shiny toys, yeah. That we're, we're, um, full fully Kubernetes integrated there. Um, which, yeah, the other services are usually deployed on ECS. So, um, but yeah, I guess that's, that's really it.
Beyang: Would it, would it be kind of safe to say that, um, all or maybe like the vast majority of data that flows through segment at some point flows through a centrifuge
Evan: That is accurate. Yeah.
Beyang: to kind of use the, like our glass analogy, would it be kind of the, the like narrow waist of the hourglass where you have like all these integrations with data sources on one side and then like all integrations with data, consumer services on the other.
Evan: Yeah, I think that's probably pretty fair. I hope I'm representing this accurately. I know she'll, um, uh, are one of the engineers that worked on this the most, I hope, I hope I'm, uh, um, you're representing it correctly, but he's, he's really the brains behind all of it, and it's quite a fantastic situation that, that. Since your fuses put segment in, and so yeah, shout out to a shield for her for delivering such a high quality piece of software.
Beyang: Cool. Um, so, you know, on the topic of, uh, microservices, um, this is, you know, a broad trend. You hear a lot about these days, it's definitely a buzz word. Um, uh, how do you ensure that, uh, the versions of, of all the different services stay in sync and that you don't, um, when pushing the update to one, you don't make it. So they, you know, it, it breaks, uh, being able to talk to others.
Evan: yeah, I mean, how do we roll out code in a, an ensure compatibility across all these different services, more or less? It's know, I think this is, uh, this is a challenge, but I mean, I think in general our strategy is to, um, have feature flags around as many things as possible and to do as much upfront integration testing in our pre production environments as possible. So. We, I think in general, like engineers at segment will, you know, everybody has a pretty. Pretty similar and development workflow in terms of using Docker compose and then building things locally and just sort of, you know, their own little version of the world. But, uh, so, but, you know, like, assuming, you know, you write your own test and then you, you know, you build your things and you've tested it locally, I think, you know, then it's introduced into our deployment pipeline and, you know, that's kind of arm waving because a lot of people do it differently. But in general, I think, um, people, you know. Stage, their changes and pre production environment that might be specific to their team, or really just our staging environment in general. And what that means is we have a bunch of QA mechanisms that are kind of ongoing, where we're either shadowing production traffic or we're, you know, we have a constant workload that's being shoved into that environment, for lack of a better word. Um, and so you basically get. You know, you get your, your, your feedback first pretty early and often as to whether or not things are actually just working. And so, you know, a lot of times we, you know, we will have an incident and it's just for our staging environment. Cause you know, we might've pushed some code that, you know, breaks one part of the pipeline or whatever. It doesn't, you know, basically makes things undeliverable, let's say. Um. But yeah, those things generally get caught in stage because we have pretty much a replica and an and, and a replica that's processing very production like data along the way. And then, you know, depending on the service, I'd say there's different feature flagging components for when it does go out to production. And so, like, for example, we have just, we have, uh, depending on the customer and the sources and the destination, we have different ways of. Of targeting these workloads as it flows through. So, for example, you know, we have a service that will consume the messages that come off of Kafka before it gets to centrifuge. And so these, there's different versions of this service that will run and based on, you know, a flag that we say, we'll roll it out to different. Tiers of users. So our customers were different sources, or there's a bunch of different combinations and, and uh, flags you can set to make sure that these, this is the version that's actually going to be delivering and handling your code. So I'd say like in general, the easiest way to describe it as is trying to deploy things dark and then light them up, light up new code paths incrementally as we go. And I think that there's a varying level of. Of that in so many ways. I mean, there's some, there's some, there's some services like for example, you know, because you know that if this thing returns, you know, 500, uh, central for usual then re retry to deliver it anyways. Or, um, there are other ways in which that, you know, you can, there's other reliability guarantees that we have baked into the platform that, you know, if your service has failed, it will be retried. So you basically get another chance, you know, if you deploy some back here. So.
Beyang: So you kind of design each service to be relatively fault tolerant, such that, you know, if it doesn't get exactly what it wants or, or if something goes wrong like that, it is resilient to, that's a fair.
Evan: Yeah. I think in general, I think a lot of our services end up being stateless. So you know, if there is. You know, like if it didn't do what you think it could do, it was supposed to do that data or that request could just be retried. Um, and so in general, that's, that's kinda how we treat a lot of our data plane services. Um, control plane services are just highly available and, uh, yeah, they're generally read heavy, I suppose. So then, um. There's some caveats, I guess. But you know, I didn't really ever mention that we do have hundreds of services. So for me to sort of blanket some of these things, uh, it's with a ton of asterisks. And so I hope that, yeah, I hope I'm generalizing in a way that's accurate for at least most of our services. But there are also, you know, a bunch of siblings, a bunch of very unique things, you know, that might be stateful and stateless, you know, I didn't really even cover those. But those are usually the ones that are pretty unique. So.
Beyang: That is a lot of services.
Evan: Yeah, I mean, at Uber, you know, this was a bigger, a bigger ordeal. And I think some of my former colleagues have spoken about that quite a bit, but you were talking thousands, which was a whole nother situation. I don't think I'd be the right person to talk about the way that those things are defined, so
Beyang: how, how do you even keep track of everything that's, you know, running and production, uh, at a given point.
Evan: Yeah. Okay. So like, how do you keep track of all the diversions or like, you know, if something's broken, how we even know which one it is? Um,
Beyang: Is that like a dashboard or some sort that gives like a bird's eye view or is it just kind of, you have to rely on the people who know the ins and outs of, of their part of the, uh, deployment.
Evan: you know, I, I'd say that our deployment tool is, is kind of the best, the best, um, representative of that. Um, I mean, depending on the, the component, so like, some things get deployed with our internal tooling to, you know, to do things that are against ECS. But, um, and then there's other things that just get deployed with Terraform, for example. So I wouldn't say there's like, one. Comprehensive dashboard of where all of these versions, uh, exist. Actually, I, well, I say that and I now, I remember there was a service that I think somebody wrote that was, it was just called expects. And it was just like every service in the version that I had deployed. And I don't think that's around anymore, but it's, you know, it served its purpose, but what it ends up being is just like too much information for a human to really make sense out of. And so I think what it, uh, what it ends up being is, you know, if you're in a big data pipeline and you have your service in the middle. And maybe like five services around you. That's really the glance that you need, right? Like my services on version zero, that one's on version one, that one's on version two and that's kind of the scope of the world you care about. So I think that in that world you, it's a little bit easier to manage. You know, like you really can. Build your own little dashboard of sorts. And I think most people, to be honest, use, um, just like a command line tool or they use, you know, like command, uh, like a, like a Terraform state sort of representation to see like which version things are on. And. Yeah, that's, that's, that's really it. I think that it can seem really complicated, but it ends up not being very complicated, um, in practice. But, um, I also don't work on every single project yet at segment, so, um, you know, I am. Yeah, that's. I can't, I'm sure that, I'm sure I'm just like somebody going over and over in a sandwich. Like you don't even know. I have so many services I need to keep. It's impossible now. But, um, you know, there's our feature flagging tool, which we may talk a little bit about like how those things are managed. But those do give, give you, um, you know, some, some visibility into the versions that are deployed where, and also it's kind of another, another thing that people use to get a glimpse of all the things that are deployed. So.
Beyang: so would it be safe to say that one of the principles here is that you don't actually don't want the global view of everything cause that's too much information to consume you. You just. I do this look around, uh, sort of thing where you're deploying one particular service or you own that service and you want to understand like, what is the current state and version of, of the immediate, uh, surrounding services. Cause those are the ones that you're talking to and are talking to you.
Evan: yeah. Yeah. And I think, I think that's probably true. Um, and that's even. If you care about the version, I guess like caring about the version might imply that you care about people breaking backwards compatibility, for example, maybe, or, you know, I suppose that, yeah, like, you know, maybe there's a, something to be said about needing to even answer that question in the first place if you build, you know, solid API APIs in the first place. But, um, but yeah, I, uh, I think that that's probably, that's, I think you summarized it pretty well.
Beyang: On the, uh, source, like the version control management side. Are you, are you, uh, a segment of monitor repo or do you have many different repos for the, uh, multitude of microservices? You have?
Evan: Yeah. Okay. So I'd say, well, both mean, just leave it to you. We have all of it, and we have all of them, uh, tons of repose. Uh, I'd say most of these microservices are in their own repositories. However, we do have some, some parts of the, the, the stack that are what are effectively monitor, repose, the, but, you know, they produce multiple binaries. So, um. Yeah. There's, yeah, there's a couple I can think of, so, um, yeah.
Beyang: Cool. Uh, so in, in this, um, microservices architecture, one thing that has to happen is obviously the different services have to talk to one another. Uh, and in doing that, you know, different services have, , varying accesses, permission levels, uh, things like that. And, uh, secret management, uh, becomes a thing. Uh, can you talk a little bit about, uh, how segment solves this problem of, of secret management?
Evan: Yeah. Okay. So we have this project called chamber. I think that really is at the root of all of it. It's opensource. Um, I'd say it's pretty simple. I mean, it's backed by, you know, you plug in different backends, but in general, it's, we use it with, uh, an S SSM backend and AWS. Um, and yeah, so.
Evan: Uh, systems. Uh, it is a systems. So what is it like systems configuration back? I don't even think SSM stands for the rate SSM parameter store. Um, what does it actually stand for? Um, systems manager. Parameter store. So like SSM, it's not S S I dunno, I don't really know where the S S comes from, but yet systems manager, parameter store. And that's really just, you know, you can store encrypted. Values in, you know, in a back end. It's like a key value store for encryption.
Beyang: And that's the thing that Amazon AWS provides specifically for this, uh, kind of use case.
Evan: right. I'm looking at a page that says it's, uh, provide secure hierarchical storage for configuration, data management and secrets management. So, yeah, I mean, there's, you know, so this, it's a data store, you know, as some things built into it that, um, make it more friendly for. You know, encrypting things along the way, but you can do things like, you know, store use chamber to store secrets in S three that are encrypted manually within the KMS key. So, yep. Again, KMS is the key management system,
Beyang: got it. So it can kind of hook into multiple backends, uh,
Evan: Right? Yeah. So, yeah, that's, I'd say chamber is at the heart of how we manage secrets. Um, uh, so what this means is from a developer's point of view, you will, you know, there's a command line tool chamber and a, you'll do like a chamber, right? A key value. And so, you know, you might. He might create that a key manually or we might get one from some third party provider and then you pipe it into chamber to write it. And then when you're actually running your program, you know, you use the other side of the, of the, um. Of the tool, which is, you know, either read to get it out. Um, but typically like services do this w another sub command is chamber exac, which basically it takes everything in a key space and translates it to an environment variable and then just wraps your service with those environment variables. So it makes it pretty easy to, you know, just ensure that your service is being injected with the right secrets at runtime. So, yeah.
Beyang: So the, the kind of, um. User interface would be, uh, it's a wrapper around, uh, some binary that you're trying to execute. So it accepts environment variables and then forwards that to the, uh, binary executable that it's invoking passing along. The secrets that are relevant.
Beyang: that, is that about
Evan: Um, yeah. Well, I mean, I think chamber itself only requires some AWS credentials. So those are the environment variables that it consumes. But then, you know, when it's invoked, like for example, when you run chamber exec, it's going to go to the back end that's configured usually SSM, and then try to read for that key space, all of the secrets that are there of its latest version. And then it takes those values and then populates the environment and then in invokes the binary that it's wrapping. So, um, yeah, that's, I think that's basically the function that. The functionality that it provides.
Beyang: So, so, Oh, go ahead.
Evan: I guess I was just going to say that there's no real other magic that's there. Like there's, uh, there's, there's other things that I think some teams use to like auto rotate their secrets and things like that, but really chambers kinda like the core piece that you really just like interacts with, uh, you know, the SSM backend or the S three KMS backend or, or whatever. But, um, I think there's another story to be told about how you actually then. Manage the versions of the secrets, which isn't necessarily what chamber does. But, um, in terms of just getting at the secrets, I think it does a pretty simple good job. I think. Oh, you know what it is, you know, with, with security, it's like you just want simple things to do. One small thing, right?
Beyang: Yeah. It's kind of a Unix philosophy.
Evan: there you go. I think our security will be pretty proud.
Beyang: Nice. So you know, a really dumb question, let's say, you know. Uh, I, I have, uh, an application that's deployed with criminalities. Uh, I have some secrets that I want to pass to the various, uh, binaries that comprise the, the microservices architecture. Uh, why, why wouldn't I just, uh, pass those secrets directly, you know, through communities, you know, built in secret support and ultimately, you know, as environment variables directly to the, the deployments. Uh, why, why should I opt for something like a chamber.
Evan: You know, that's a good question. I think that just to say upfront, um, I think there's an open debate, especially internally at Sigma as to whether we will use the Kubernetes secrets backend because, um. From what I think that there was some contention around just how they're stored and the secret of the keys that are used to encrypt it and just like how those keys are managed. But I'm not really in, I haven't had that conversation, but I think in general, the idea is that no, we should be using, we would want to use the secret store that's built into Kubernetes. Like why, you know, go against the grain or try to insert some, you know, new functionality or new dependency when, you know, there's maybe a perfectly good solution there. So that said, um, if and when maybe it falls short, or maybe you want to have like, let's say, compatibility between your existing deployment environment and Kubernetes, then you might want to say, make sure that, um, you know, your, yeah, make sure your secrets are not part of the, you know. You don't have to do anything magical to make secrets work along the way or that doesn't break things. And so what that means, I think in practice you'd see a lot of, you know, Docker files that are written as with an entry point that's just chamber exec. And so in order to run that Docker image, you would just need some basic idea of AWS credentials. And I think. If you start there, and then, you know, you have a Docker image that is deployable on both Kubernetes and some other environment, right? Like as long as you're carrying along the right or you provided the right AWS current credentials that can then read that secrets namespace. So I think that might be a story for when you would not want to use chamber, when you would want to use chamber across the board. But, um. Yeah, I would, I would honestly, like for me, just personally, I would rather just be, I use what's built in, but, um, or what's, uh, goes least against the grain,
Beyang: Got it. I think that's in general a good. A rule of thumb to have in mind
Evan: Yeah, I mean, I think there's applies. I can think of a ton of different cases too, where this, you know, it's like a, I just, you'd side cars or like, you know, um, you know, like just expose port, blah, you know, it's like, I mean, no, I really want to use this over TLS and IX, you know, on a different port. And by the way, there's just like. Maybe you don't. Maybe don't. Maybe I'm like, Oh no, I want to, in my health and points, health Z, not health, just health a, B, or whatever. It's like, I don't know. Maybe just use the standard one. I don't know. But I think of metrics like the Kubernetes and Prometheus's metric stories. Yeah.
Beyang: Yeah, definitely.
Evan: the metrics in point and then you're done.
Beyang: I feel like you see this sort of thing . Uh, as the team grows, you have a proliferation of different, um. Practices and, and, um, standard operating procedures that at some point it becomes untenable. Uh, before you have like a officially blessed secret management system or dependency management system, people roll their own because they need to. And then at some point, uh, you have to sit down and have that tough conversation of, okay, which one are we going to standardize on?
Beyang: At some point we're just going to, it's going to come full circle and we'll all use make files again, right? Yeah.
Beyang: What do you think is kind of the role of a developer infrastructure or developer tools team then? Because on the one hand, you want to meet folks where they are and you know, kind of see which tools they're adopting and adjust to that. But on the other hand, uh, you know, you don't want everyone to adopt whatever tool they want cause that cause organizational issues.
Evan: Right? Yeah. Um, you know, I think, okay, that's a really. Important question. I mean, yeah, this is, I think it's really comes down to how do you want to, how do you want to define the, yeah, your job is a tooling engineer or really like who's I think is really important just up front for there to be somebody investing in the productivity of all your engineers and. Uh, it's just a really important thing for organizations to do. And even if it just means we are somebody who looks at what every team is doing and then just says like, Hey, there's a tool that can do generally like the 90% case to deploy code and it's going to save you some time. But that's the starter. Like that's pretty broad. But, so I think to answer the question in maybe a simple way, I think that. There are a few things like standardization. If there's a standardization story that's important to your organization, like meaning, it is important for all of us to be using this scratch image to build all our Docker containers. Okay. For a security point of view, that's great. Or if it's like we all need to be on at least this patch version of this library. Okay. Then stuff like that. Like Paul, we should be. Building tools and processes and procedures, so that, that is not a monumental effort for all of our developers to engage, uh, or to go down that path for, you know, like, cause I mean, at the end of the day, that's not productive. You know, like we. You know what I mean? It's safe for, it's what the business wants, but it doesn't help our users. You know, it doesn't deliver value to our users. So, I mean, all that stuff is a distraction. So what I think is, you know, how do you, how do you, you, you build systems, you build tools in place that allow those things to be less, to have less friction. You know, how do they, um, how do you re just remove as many of those monumental zero-sum. Migrations from the table upfront, you know? And I think that's really what tooling is, is about, you know, like how do we, how do we reduce the friction for just doing your day job and ultimately give developers the ability to do more developing and not like hitting enter and waiting, you know, like, how do I, that's really what it comes down to. And that I think it's different for every organization you like. There's, there's maybe, you know, you can just look at what, what language are you, are you. Building most of your software with, you know, and that would probably dictate a lot of what tooling works on, you know, or it's like, Hey, um, how often are you doing database migrations? Like maybe it's, maybe you shouldn't be doing them as often. I'm like, let's just ask ourselves, like, how often are database migrations being done and okay. And they're probably pretty important. Okay. Well then there's probably some tooling effort that needs to be spent on making sure those things go smoothly, you know, and, um, and just managing infrastructure. Maybe at some point or like, or maybe you just look back and you say, Hey, well, these are the things that have cost the business the most in terms of outages or, um, you know, issues for our customers or maybe churn from our customers. And so it really affects the bottom line the most. And so that's what we want to. Invest our tooling efforts on, you know, and it may be a different, you know, I think it's different for every, every organization, but you know, for segment, um, I think really it is, we have, uh, we have a lot of, we relatively small, I think in terms of just number of engineers still. But I do think that we
Beyang: engineers do you have.
Evan: we're over a hundred. I think we're around a hundred ish. Last I checked. Um, so yeah, but less than 200, I'm far less than 200. No, I, I don't think we're too much, more than a hundred maybe. Um. Anyway. So, um, but yeah, so I think we're at a kind of a, uh, an inflection point where we know we're going to grow, we're never going to get, you know, continue to grow steadily. So then it becomes like, how, how do we, when we hire new engineers, how do we get them to be as productive as possible as soon as fast, as soon as possible, and they can deliver value to our customers as quickly and as easily as possible. And they aren't sitting there spending their time going. Hey, wait, wa, how do I log into this machine and look at the file that's supposed to be there and isn't? And it's like, well, all right. Maybe that maybe the process we have for deploying these things is not right. So maybe we should revisit that. But, um, yeah. So I think maybe a long winded way of answering your question, but it's very, very, um, nuanced, I think, from different, from, from every organization's point of view.
Beyang: Yeah, totally. Can you talk more about that onboarding, uh, scenario? So when you hire new engineers, uh, onto your team, uh, how, how do you onboard them effectively and give them the context they need to become productive as soon as possible?
Evan: why so you mean tooling specific or maybe just, uh, okay.
Beyang: or in general. Either way.
Evan: yeah, I think, I think, well in general, um, for anybody, a new engineer, I would hope that, you know, when you sit him down. You can give him like a, like let's just talk two different things. Like we talked about, like they're working, you know, their laptops a different thing, or they're, you know, like they're working, let's say that that's all just like dialed in to their, to their settings and they, you know, they have access to all the systems they need to and all their SSH keys are in a good place. Fine. Okay. And so now they're in like a normal square one. Okay. So, but getting it, uh, I guess adjusted to the code bases and the things you need, the tooling then is, is like. How do we reduce this to the simplest number of things as far as, especially like when we're talking about the number of microservices or they're just, it's just the number of things to keep in your head. I mean, can we just narrow that down to the smallest amount as possible for me to be productive? Um, this is like every team is going to have a different point of view, but let's just hope it's, you know, like I understand all the services that my thing interacts with and I understand. All the different, um, you know, dependencies that we have. And, um, now I need to understand the different components of the code and then understand like how all these pieces of code interact with themselves and really like that might be one code base, but not all. It might also be five code bases or, um, you know, like our team owns service X, service wide service and you know, like, then it's sort of the same process for. You know, for each of those, and hopefully they're in the same kind of domain at least. So it's not crazy. And they're like, maybe written in the same language at least. Maybe. Um, so you're not, you know, they use the same patterns and you know, I think that that kind of kind of. Sets it all up and it starts with the code. You know, like, I think that it's like, you know, you sit somebody down and you say, okay, this is what our team owns. These are, this is the source code. This is a software that we own. And, um, so, you know, like, familiarize yourself with it and maybe, you know, try to fix a few bugs here and there, but then you, then you go and you familiarize yourself with the tooling that this. That this service or these services depend on, and that there's tooling is probably how you run your tests. How do you debug your tests? How do you build your code? How does your cookie deployed? You know, and those are, uh, along with that, I would hope that it would end there. But I mean, there's also, there's also some things too that a lot of, I feel like the teams at segment are pretty autonomous. Like they ended up managing a lot of their own infrastructure, you know, with the Astros. Meaning they lean on the managed services of AWS. Primarily, but you know, it doesn't mean like L I might have to install my, you know, I need to know how to install a new database schema. Or I need to know, like, you know, how to enable encryption at rest in my database, things like that. Um, or to, Hey, maybe just provision any database. That's, that's the thing. So, um. Yeah. I think that, yeah, to summarize, it's probably probably starts with the code base and then, you know, goes then down the path of testing the tools, use the test, and then deploy, and then maybe even, you know, operate on the, on the infrastructure. So,
Beyang: Got it. So it's, it starts with the source code cause that's, you know what you're going to be working on a day to day as an engineer and then you kind of walk through the entire software development life cycle to building, testing, ultimately shipping
Evan: Yeah. You know, I think that, so if I were to say that, that's like our a hundred person sort of point of view. Now, if you're in like a thousand person organization with even more. Dimensions and degrees of which that you have to have cooperation among the platforms and the tooling. Then it gets a little bit more dicey. I think there's the, there's, you just become more specialized, like you stop caring about how things actually get tested and you just like, hope it gets tested and it's kind of what I was kind of getting towards where it is like you're like, okay, you know, like I don't have to write my Docker file anymore. Right. And it's like, no, you don't. Cause like if he did. Do it in a way that wasn't satisfactory. Like it just, you don't need to, it's pointless. Like we, we do these things are, so actually, you know what, we don't even use Docker anymore. You didn't even know. But, um, I dunno. It's something crazy like that, right? Where it becomes more and more specialized, I think as the, the size of the org grows. But you also just need more standardization too, right? Like, you know, if you're gonna provide an interim bill platform for thousands of engineers, it, it can't be snowflake. He, you know, like, he can't be like. Year in year unicorn gets run here and you know all that, you know, like you get your own special instance of Jenkins and it's like, ah, no, I don't want to, I don't care. Just like run, make, test, please. And then show me the logs. Like, just do that when I push code, please. Um, so yeah.
Beyang: How about how big was the engineering team when you first joined the team?
Evan: I believe it was in the seventies 70 ish. Yeah.
Beyang: Okay. Got it. And you know, between then and now, have there been any, you know, big changes in kind of, um, tools or, uh, kind of best practices, uh, code based structure, like that sort of thing? Or has it been pretty stable?
Evan: I wouldn't say that the code based structure and, and development tools necessarily, but the ways in which we deploy our code is testing. Our code has changed quite a bit too. Um, so we, yeah, we're really hoping to do more Kubernetes based deployments soon. And I think that there are a lot of things that kind of have to change. But, um. You know, we're not trying to, just like up in the world up in the world. There's a lot of things we're hoping to get out of the Kubernetes integration. Um, just from like an operational point of view becomes a lot simpler. And, um, you know, we get better isolation stories and, you know, for our customers, like we have, you know, some efforts that are going on to, you know, create isolation at the infrastructure level for our customers. You know, so, I mean, you're talking about like how data flows through segment. I mean, a lot of companies. Don't want their data flowing in any of the same channels that another company is flooding into. And so I think there's some really cool stories. Like let's say you have your own Kubernetes cluster. That's, you know. Just, that's all. It only runs, it runs a miniature version of segment just in its own little world. And that's, that would be cool. I think it's a little bit easier to tell that story with Kubernetes, and it's just easier to manage than it is with just like arbitrary Terraform things. But, um, yeah. But, um, let's
Beyang: How, how do you bridge that gap between, uh, like production and development? You know, w with the big microservices, um, know landscape, obviously you're, you're probably not going to run every single service in development. Um, and, and you, you mentioned earlier that you have, you have the staging environment. Um, is that what people rely on to kind of test things out before they, you know, push things closer to prod or, or is there. Uh, additional support in the development environment where you try to replicate some of that
Evan: Oh yeah. So yeah, it just depends on the service. So how do you bridge that gap between, you know, I guess, how do you reduce surprises for when you go to production? That's kind of what we're trying to get at, right? So there's different, so it depends on your, on your. Services, right? I mean, the service, um, like, so, so like, let's say you depend on, um, you have a direct dependency on some data store and then, you know, not only that, but you have like a, this user interaction that, you know, in order for you to test and these, all of a sudden there's some supplementary data and, right. So you need like this big complicated data fixture. You know, there are, there are some teams that have these very, I'd say nuanced and complicated, uh, dependencies. And. I just, I'm just gonna leave it at that because I'm just like, I kind of just like, I mean my mind, like as the tool engineer, I want to simplify and reduce the, the variability of all of it. But I do, let's just say that those exist, you know, and they, and they are very real and they can't write, they can't write specific use cases versus these very one, these very specific things. Cause in my world, I would just prefer it you to, to isolate it as much as possible and you reduce all of the variables before you go to staging even. Um, but. But let's, let's just say that you needed to test it in staging with like very specific data. Um. Yeah, I think that, um, it's just, I would say that it sort of should be like a Kashi. You should be able to drop it and then recreate it at any time. That's kind of like a good property to have as a staging environment and, you know, you'd have a way to simulate, um, things over and over again. I think. That's just another good property to have. And so if you have those two things in cool, you can sort of have this like kind of reliable test bed, but ultimately, like if it's not production, it's not production, so you're never going to be able to completely bridge that gap. And so that's, I kind of think that you just really have to embrace the whole incremental deployment strategy. Um, that's really, that's it. You don't, you reduce that gap by not having a gap. So. That's, I mean, that's just, I mean, that's how we, I would have done it in the past that other jobs, you know, like it gets really complicated if you always depend on like, instead of. W w defining your boundary just in your code very well. You instead say like, Oh, I always depend on a real database connection cause that's always going to be there, right? Like, well, let's just, you know, like if that's the way you're, you know, if you can't have very strict dependency boundaries, then, um, yeah, then things just get really difficult to test. And so, or he can have a lot of other problems and say, so, but. It's a bit, I guess. So, Jane, I think your other question was about, um, how have things changed since I, I, I joined, um, just in terms of tooling and deployment? You know, I would have to say that there's two big things really. There is the Kubernetes thing, which in my mind like that is just, it's like a blessing. You know, I have some qualms with, uh, some of the things that. Like EKS versus, you know, like GCPS offering. But, um, but that said, um, it's making a lot of things. Way simpler. And then there's, but then there's the Lambdas, like the, the serverless movement. Like we, you know, before I joined segway, there was, um, folks working on our internal deployment tool, which, you know, does some things very AWS specific and you know, even like cloud formation specific. But for better or worse, it does a good job just deploying application code. But. We made no considerations whatsoever for deploying Lambdas. And so now it's like, how do we eat the paradigms? They're like, you know, for it to work with. You know, what I would hope are arbitrary artifacts and delivery mechanisms, but, um, it's actually not that simple. So, but that's, I think, um, the biggest things that I see right now that have changed. I mean, other things have been relatively stable. There's, you know, there's always just the, um. A tool does your for, you know, the front end things that I don't pay attention,
Evan: you know, not trying to downgrade, but I just don't keep up with, I think, um, uh, so I'm sure that those things have changed quite a bit. Um, but for the most part, I think, yeah, those are the two biggest shifts I've seen.
Beyang: makes sense. So, on the topic of, uh, Lambdas, uh, and communities, a lot of, you know, when you listen to people talk about Lambdas and also Kubernetes one question that comes up, uh, from folks like myself or who are, uh, you know, maybe a little bit naive about, uh, the big picture is. You know, w when, when would I use Lambda, uh, versus, uh, Kubernetes and in your mind is, is there a class of services, um, that, uh, segment has that fits better into one model or the other?
Evan: You know, I think that the, the emergence of Lambda is really a result of things like, um, scheduled tasks or sort of these offline or offline processing things, and rather than like core services. So, I mean, in my experience. Personally, what I've been using the Lambda tool for is more about like, um, like for example, we were, I was just deploying a Lambda recently to, you know, basically do auto scaling because there's some things within AWS, uh, lifecycle hooks and things like that. They just don't offer the re, you know, the resolution that you need to, to, uh, to respond to things that you might want to auto-scale again. So there's a Lambda that checks in, it wakes up every, you know, every day. 10 seconds and you know, check some metrics and then reacts or does, you know, basically adjust some auto scaling groups. Um, but so. So I think that the scheduled tasks or the things that are super stateless and they just need to do one, they're super salient. Yeah. They just effectively stateless and they can just wake up and have, they don't need to build any sort of preconditions that can just do one thing and they have a very discreet set of inputs. That's really what I think Lambda's good for. I think that they're, like, there's a lot of, um, things that are, that are really becomes simpler from like a infrastructure point of view too. When you talk about like, how do, how do you, um. Let's say you, you can, you know, logs get ingested. So like a new, a new file gets dropped into an S three bucket. Well, cool, now I can get triggered and then just process that right away in the, I can do that in number of times based on how many things get dropped in the bucket. So it's nice to decouple a lot of these infrastructure components without having to do like a deploy giant queue in the middle. And so that's another thing I've seen. Um, but.
Beyang: So it is kind of the rule of thumb that you would apply, you know, if it's, uh, like a server, like thing that always needs to be on a, is fairly core use, uh, binary deployed within communities. If it's something that's more event driven or like a one off task, that's more of a, a Lambda candidate.
Evan: Yeah. I mean, I think, yeah, I mean, I think it's pretty clear, right? Why, right away whether or not you know your processes. Like a demon. I have demonized will process that needs to keep some state and like keep a connection. You know, your state might just be a database connection. It might just be, you know, a, you know, it might just be like, uh, whatever. Some. UDP type of thing where like, it's, or like, it's, you know, it has a socket at some variety or it might just, it might even be like things in memory that is just like a big data structure to make decisions or it's building one up over time. Did he like dedupe something, you know, um, stuff like that, um, would probably, would probably be way better for a, like a demonized double service in Kubernetes versus a Lambda, so,
Beyang: makes sense.
Beyang: Uh, let's chat about chat ops for a bit.
Beyang: chat ops is kind of this idea of moving. Uh, the interface for interacting with ops and infrastructure out of the traditional, uh, gooeys and CLI is that you use to kind of introspect the state of, the infrastructure and moving more of that into these kinds of like chatbots that you can interact with. Um, is that a, is that kind of your view of it as well, or am I leaving leaving stuff out here.
Evan: I think that's pretty, pretty close. Um, yeah. I just think, I think that like the, the reality is that we're spending more and more time. We're spending just as much time in our chat client as we are within our terminals or coding environments. And so. You know, rather than go to a webpage to click a button and do a thing, you might as well just like reduce the number of things. You have to do those tasks within and just do it. Yeah. Keep it all there. So, um, yeah, I think that's kind of the premise. Uh, although you know, it, it, I think there are some things, some workflows that are like super heavy handed when you like, they're like, you know, 12 step processes that are, you know, like. Yeah. Some like, do you want to, do you want to onboard a new, a new employee? Like, yeah. Create their, provision, their, their home directories and like, I don't know, I'm just imagining like all these things like now, like, let's just like run a script for that and like check back later or something. I don't know. It's just, there's things that I think maybe go a little overboard, but, um. Yeah. I think that there are really cool, really cool opportunities to, uh, to just like lower the cognitive overhead or just like reduce the number of contexts, which is. Um, and this, and there's also just a cool, cool side effect too where, um, if you're doing like compliance type of things where you know, you just need a paper trail of sorts. Um, and you know, as it turns out, like, yeah, going through like some change control process or whatever, it's just like, you can do it, but it's like, it's going to be a pain and like, you can probably get by with just like. Messaging one little thing to a chat bot, and it's like, cool, I will record that this happened and I will automatically create a commit for you. And then I will go and you know, that will be pushed and then triggered on all the, you know, the deployment things will take place after that or whatever. But it's like kind of starts with a, you know, just the lowest barrier to entry in terms of communicating that into some system. So.
Beyang: sort of a D does segment have a chat ops? Uh, like things internally that, uh, your team has developed and people find useful.
Evan: Yeah, we do. I think if I, so I'd say segment is just like a very young, very young in the whole chat ops world. Um, there's some things I would say that we have chosen to experiment with and they have been great in, it turned out pretty pretty well. Um, and. You know, I think that there are others not to be named, but yeah, I mean, there are some experiments, kind of the things that I was kind of alluding to, but it doesn't end up being that useful for chat officer or whatever. It just does it. People don't end up actually getting a lot of value out of it, but, um, so, but the things we do use it for our, our sever incident. Uh, management process. So I think early on, or at least when I joined, we were using a third party tool to manage sort of our incidents. This might be like an outage or like, let's say, you know, like some services degraded, you just need some like light coordination around, um, an ad hoc, like maybe small, medium, or large emergency. Uh, we call them selves. Um, and yeah, so like, it usually involves like, you know. Creating a new channel and chat so that people can talk, um, in about, in a focused way. There's like, you know, somebody gets nominated as a, uh, the incident commander so that they can focus on comms and just coordination. And then there's, you know, he usually an engineer or two or three that are actually investigating what's going on that's
Beyang: that's really cool actually. So it's okay.
Evan: And so, but what's cool is that, um, you know, when you decide that something's happening or you know, maybe they're, you being alerted in the first place and, you know, sort of lightweight, discuss like, Hey, everybody, this doesn't look good. Should we just call us? Says, yeah, you just slash Sev prompts you for three things. And, um, you know. The enter and then, yeah, then you have a channel created for you. You have people being pinged along the way. Um, you know, it's sort of like asking you for updates in a casual way. Um, but what's also cool is, um, once you're in a Sev, you know, you can just like, you know, mitigate or, you know, you can say like escalate. Um, basically you can say like, you know, he starts at a level three and then if it gets more and more severe or more time, cause by that you can escalate it and then kind of becomes more and more. Um, um. I guess prominent to the powers that be. Um, so, so, um, but, but, and then so it's cool on the way is that there's these other things that kind of get created too. So like, we get, um, there's like an incident review for all sets that get created that are of a certain level. And, um, so then there's like some documentation around like a postmortem that gets created and there's just. A lot of tracking and sort of like accounting that gets created, which is great for you. Not only reviewing and understanding how often you're having these types of selves, but also just categorizing them for the future so you can learn and really just prevent these things from happening ever again. So it's like the whole, you know, SRE mentality of kind of always iterating and always learning. And I'm like the blameless thing, but, um. So, yeah, that's one of them. Um, we, we decided to build our own because we. Ended up being like, we just wanted something really simple and like all the third party things came out and there was always something that just didn't quite fly with somebody. They're always like, well, you know, like, I don't like it out nags me. Or like, you know, it always creates this document in this way and we have no, no way to like customize the template and then like assumes that we're going to do this and that where it's, as soon as we were going to have a postmortem for everything, we maybe don't want a postmortem for anything except for seven zeros. You know? Stuff like that. And so we just, it ends up being pretty simple, I think. Um, you know, shout out to Anastasia who worked on this the most. Um, but yeah, just like coming up with a really simple solution to this, and it being a, what I would call a super, uh. You know, super great candidate for this chat ops thing. And I think it also came from just like we kind of review, I think the SRDS review the process pretty regularly, you know, like every six months or so. And like, what do you like, what do you not like? And one of the things that they, killers, I just really don't like the set process stuff. Julie needs this, this and this. And like, I think all the SRS came together and were like, I wish he did this, I wish did that. So we had like a bunch of information as like what the ideal tool would look like. We just needed it to exist. And so it was very well defined and well scoped. And so it just seemed like. Yeah. And it's, they should just like knock it out of the park. So, but, but I can say like, I guess I should say like there's a bit of an asterisk, cause I'm sure that other organizations like Stripe, who I know rely heavily more heavily on chat ops, kind of define this thing a little bit more differently. But I know for example, like, like. They at least used to deploy code with, you know, slash commands and things like that. And there's even, I forget the level of which get, uh, the organization integrates, um, chat ops into its workflows, but, you know, like the whole Cuba thing has been around for awhile. Yeah. And so I think it was kind of varying levels, but what I'm, we're sort of, we're, we're talking like business workflows and we're not like, there. Some of them are, well I guess a lot of more are engineering oriented, but like the other two I was going to mention is our feature flagging tool, which I think you know, is generally managed by engineers. But I can imagine like if you're a data scientist, you might want to like manipulate an experiment, you know, which effectively is some feature flags. And so that, you know, you might have people that are. Not technical or not necessarily engineers working on these, these are working with these tools. Um, that eventually might be defined as chat ops, I guess. But, um, yeah. So, but I guess, yeah, that's, that's, um, to be, maybe it's just a little nuance that comes with an asterisk for the, the very pedantic people out there, but I don't know.
Beyang: Yeah. So we're, we're kind of coming up on the end of the time here, but as kind of a final, uh, parting question, I wanted to ask you, what are some of the new tools and technologies that you're excited about that you think the rest of us should check out?
Evan: Hm. That is a fun question. Well. Let's see. I, I've been pretty pumped about IPFS for awhile. Um, I think that, uh, you know, I just have faced, you know, like I have faith in the whole distributed internet model. I think that's gonna be cool. It's not really a tool, but it's a technology that I think is really, I think should have more attention. Um, but, um, in terms of tools, I really like, uh, the. Um, like the CAE native sort of effort. Originally, I think it was originally called canine, and now it's Tecton just the things that are going on and like the, the, the rootless Docker Docker building world. Um, there's a bunch of projects on some of the, you know, my former peers at Uber don't done a lot of work around that. Um, my key Sue and, and, uh, I think, um. Got it. Was it cracking? I think it's another one, but these are just tools like manage Docker containers, peer-to-peer, um, Docker registries, things like that. Um, but, um, yeah, like I just really think that the, the cloud native sort of approach to doing infrastructure and sort of describing States that the way you want the world to be in and just sort of like letting these, you know, these. Highly available, distributed systems. Just take care of the rest is really a fascinating thing. I think when you, when you kind of go nuts with it, you could even say like, okay, let's, instead of pushing to, I don't know, get server, I'm going to push to my Kubernetes cluster. What, what does that even mean? Like, no, like it's going to speak, you know, get over sh it's going to be sweet. And you're like, Oh my God. So then like the code is going to be in Kubernetes and you know, the, the code can get built in Kubernetes and Oh my God. Like, cause it's gonna it's, it's already there. Can we deliver, you know, deploy it incriminates them. Like it never. Just everything goes to the Kubernetes. It's like everything. Let's just, let's just live in Kubernetes. Like, let's build our houses in Kubernetes. let's, I don't know, it's, it's like you can go pretty far with it, but I think it is pretty exciting to think about, like from a developer's point of view where. You know, you offer up, um, you know, some custom resource types and you know, like maybe it's like a database thing that you as an infrastructure team have curated and you manage and like different versions of it, but then you just give them a resource type in some ways in which you can configure and declarative way. And then you just let them throw that over the fence to your Kubernetes cluster, and then they get a database. Like, that's kinda cool. You know, that's, you know, instead of having to like. File a ticket or like know anything about how distributed systems work. You really just like say, Hey, I want thing, and they just, their ticket is there. Their request is, or the Yamhill that they are shoving into your Kubernetes cluster. That's pretty neat. I think that's, you know, mix, you know, it gives a nice separation of concerns for developers versus infrastructure folks. So super excited about that.
Beyang: Awesome. My guest today has been Evan Culver. Evan. Thank you for being on the podcast.
Evan: Well, thanks for having me. It's been fun.