GopherCon 2018 - Building and Scaling Reddit's Ad Server

conference, golang, gophercon2018, notes

These are some notes from my experiences at the GopherCon 2018. I don’t expect these will be laid out in any particularly useful way; I am mostly taking them so I can remember some of the bits I found most useful in the future.


Note: reddit mascot is called “snu”

Ad Serving Goals

  • Scale

  • Speed

    • SLA: < 30ms p99
  • Auction in real time for every request

  • Pacing: don’t blow budgets inappropriately

Before Rewrite

  • Call out to 3rd party for each request

    • Slow
    • Not customizable
    • No observability

Current System

  • Tech: Prometheus, RocksDB, Kafka, Thrift, and other stuff

  • Call to AdSelector for every request

    • call to enrichment service for more data (about user, request, etc)
    • select ads based on enriched data
    • log to kafka for reporting, billing, monitoring, etc.
  • Browser fires a tracker event

    • Service writes impression to kafka
  • Background service to read kafka and writing back to enrichment service (spark)

  • Background service to read kafka and writing pacing data back to AdSelector (spark)

AdSelector
  • Business Rules
  • Auction
  • Horizontal Scaling
EventTracker
  • SLA <1ms p99
  • Reliable
Enrichment Service
  • SLA <4ms p99
  • gorocksdb
  • prefix scan to find data to return
Other Tools
  • Reporting
  • Admin

Benefits of Go

  • Increased velocity

    • Strong conventions
    • Fast compilation
  • Performance is great

  • Easy to focus on business logic


Lessons Learned

Getting Production-Ready

  • Logging, metrics, etc all over the place
  • Changing transport layers was hard (thrift)
  • Considered go-micro, gizmo, go-kit
  • Picked go-kit for flexibility (check Peter Bourgon’s talk at GopherCon 2015)
  • Use a framework/toolkit

Deployment

  • Need rapid iteration to roll out in a safe way (keep using old service along the way)

  • Started with a blind proxy to the 3rd party, so AB tests were possible

    • Start w/ still returning old sets
    • Log results and do offline analysis for comparison
    • Then switch over when ready
  • Go makes rapid iteration easy and safe

Debugging Latency

  • Distributed tracing

    • Client: extract identifiers and send along w/ a request

    • Server: extract identifier from request and inject into context

    • Issues with tracing: thrift didn’t have headers

Handling Slowness / Timeouts

  • Add timeouts to context (and check them) on the client side

  • Need to make the server not do unnecessary work also

    • Pass deadlines around
    • For thrift, include in request headers, and then process in server to inject timeout
  • Use deadlines within & across services

Ensuring New Features Don’t Hurt

  • Need to make sure SLAs don’t get violated by new features/refactors

  • Bender for load testing (from Pinterest)

    • supports http, thrift, but not yet grpc
  • Write benchmarks along side tests

  • Use load testing and benchmarks