Presenter: Deval Shah
Liveblogger: Farhan Attamimi
How Reddit built its ad-serving system using Go, and the lessons learned from the process.
The Reddit engineering team recently introduced Go into their stack to write a new ad-serving system to replace a third party system. Deval Shah talks us through the architecture of the new service, the Reddit team's experience using Go for the first time, and all the lessons they've learned from using Go to build this ad server.
Reddit is the frontpage of the internet. It's a social network with tens of thousands of interest communities, where people go to discuss the things that matter to them.
Reddit by the numbers:
Any system that Reddit builds must scale to handle this level of traffic.
The ads server handles the entire ads flow. Everything from the selecting advertisements to show to any post-processing after the ad is shown to the user, is handled by the ads server.
There are several requirements for the Reddit ad server:
Before, whenever a user went to reddit.com, the reddit monolith backend would send a request to a third-party ad server. The third-party server would respond with one or more ads that it selected, and that gets returned to the user.
After a while, they realized that continuing to use the third-party ad server wouldn't work for them moving forward because it was:
decided to build an ad server, built a team of 3 people. started with infra, then wrote the services, and then rolled it out to prod. the system it is now in production.
Some notable tools used in the ad-server infrastructure:
This is the architecture of the new ad server:
A brief overview of how it works:
getAds, which handles getting and returning the ads to be shown to the user. The ad selector then calls the enrichment service.
Kafka provides data to two Apache Spark jobs:
In this architecture, the Go services are:
The ad selector:
Some other Go tools and services at Reddit that won't be covere in-depth:
This is the first experience Reddit has had with Go. Deval says the experience has been great so far. The effort started with two to three engineers using Go, and it has now grown to around a dozen engineers working on Go.
The main advantages they've seen with Go are:
Finally, Ad serving latency decreased drastically: response times dropped from 90ms to under 10ms.
This is a set of 5 problems faced, how Reddit dealt with them, and the learnings from these challenges.
Reddit had prior experience doing this for Python, but not in Go.
The initial prototype worked, by way of lots of StackOverflow reading and Googling, but it was clearly not going to scale with developers.
Some issues they saw were:
They realized the Go community had solved these problems, so they looked at existing frameworks that had solved these problems. Some options they encountered:
They decided that Go-Kit made the most sense. The main reasons Reddit picked Go-Kit were that it:
Go-Kit @ Reddit. This is a diagram of the enrichment service using Go-Kit:
There are some things to note about this architecture. The center service has 2 implementations: an in-memory implementation (this was good and used for the prototype), and a RocksDB implementation which was used in the production implementation. The in-memory implementation still exists for local development.
There are several middleware layers: tracing, logging, and metrics. Finally, the Thrift transport is at the top level. This structure makes it easy to make changes. For example, if they wanted to change the transport layer from Thrift to gRPC, they'd only need to change the top layer.
Using Go-Kit was beneficial because it gave the team a good example on how to structure Go code. They didn't have experience in this before, so using Go-Kit was helpful for understanding the typical structure for Go services.
Lesson 1: Use a framework/toolkit. Not necessarily for everything you use Go for, but for production services that require metrics, logging, and so on, use libraries that have solved the problem rather than trying to do it yourself.
The ultimate goal was to roll out the new ad server with minimal impact to Reddit users, paying advertisers, other internal teams reliant on the ads team. The third party ad server was a black box, and Reddit needed a way to iterate rapidly, learn, and get better.
It was like changing airplanes mid-flight. They slowly added the new infrastructure around their third-party service, and when it was ready, they would rip it out.:
How did Go help with this? Go allowed them to make the move to the new ad server safely and easily, aided by these Go characteristics:
Lesson 2: Go makes rapid iteration easy & safe.
Once the new ad server was deployed, they did see some slowness, network glitches, bad deploys etc.
pprof is great if you know exactly which service is having issues. Distributed tracing, on the other hand, gives you visibility across services. They didn't have support for distributed tracing on the ads side, but they did have support for it elsewhere on Reddit's stack.
Why is tracing useful?
Tracing is usually easy. You have a client and server. On the client side, you extract trace identifiers, and inject them into the request youre sending to the server. On the server side, when you get a request and identifiers, you put them into a context object and pass them around. This is very straightforward using HTTP and gRPC, and there's no reason not to do this.
But, reddit was dealing with Thrift, so they ran into some problems.
They took a look at Thrift alternatives, Facebook Thrift and Apache Thrift. The two key features they were looking for were support for headers and context objects:
They tried using FB thrift but there were some issues, mainly that the lack of a context object required messy workarounds, leading to messy code and complications. In Apache thrift, the context object was supported, but it doesnt have support for headers. So, the solution: add headers to Apache Thrift. This has been done for other languages, but not for Go. So, they added THeader to Apache Thrift. This means context objects are now supported, and headers can store trace identifiers.
If you want to see these changes, you can check out https://github.com/devalshah88/thrift. Deval hopes to get the changes through the contributing process and merge it upstream.
Here's a look the tracing code. The client wrapper just extracts out trace information from the context object, and adds it to the headers:
The server wrapper takes information from the headers and injects it into the context object so it can be passed around:
This code is from https://github.com/devalshah88/thrift-tracing.
Having done all this work, distributed tracing proved to be very useful in debuggging latency issues. The takeaway, however, is lesson 3: Distributed tracing with Thrift and Go is hard.
At Reddit, they want systems to handle slowness gracefully. They never want users to suffer, so if there is slowness, Reddit would rather not show ads than degrade the user experience.
The two goals they had are:
Use the context object to enforce timeouts within a service: This is the code from the enrichment service of adding a deadline to context object, passing it through, and exiting early if deadline expires.
This result of this is good, but not enough: The first graph shows how long it took to get responses from the enrichment service. This particular time frame had some slowness, but it did not let users wait longer than 25ms.
The second graph shows that on the server side, the enrichment service was processing the request for up to 70ms, so the server was wasting resources doing work after the client had already timed out and didn't need a response anymore.
What typically would be done is to propagate deadlines with HTTP. This code adds a timeout, which is passed to the server through the context object:
Thrift makes this hard. There is no context object used here. If the client times out, the goroutine doesnt know that and doesnt exit:
This is not great, but there are ways of fixing this:
One option is to add a deadline to the request payload. The client needs to include the deadline in the request. The server would inject deadline into the context object, and use it. This wasn't great because this change had to be made in all endpoints.
Instead, they passed the deadline as a thrift header. This is similar to how they pass trace identifiers. After this change, on the enrichment server side, they saw latencies similar to client side:
Lesson 4: Use deadlines within and across services.
Rapid iteration and complex business logic can lead to performance issues. The ad service team needed processes and tools to ensure they could move fast without violating the latency SLA. To do this, they made use of load testing and benchmarking.
Load testing using bender:
This is what using you'd get in response from Bender:
Load testing is really useful for testing changes under heavy load, and lets developers optimize new features for high load before pushing to production.
They also make use of benchmarking for all critical systems. This benchmarking code: Gets you this output:
Benchmarking helps by:
Lesson 5: Benchmarking and load testing is easy. Do it!