ch5: scaling to the whole internet

date: May 14 2026

1. Three threads

Through the previous chapters we’ve casually said “one IP serves the world” and “Cloudflare’s 1.1.1.1 handles 50 million queries per second.” Time to fill in the engineering. Three intertwined threads run through this chapter, and you can read them as separate questions:

How does one IP serve the whole planet? Anycast (§2 and §3).
How does one authoritative server handle millions of queries per second? Per-server engineering (§5).
How do CDNs use DNS to send each user to the nearest edge, and what does that cost? The steering layer and its privacy trade-off (§6).

Between threads 1 and 2 lives one more question: if the root zone is held up by an unfunded volunteer model that’s seeing 25% annual query growth (Chapter 3), where does the architecture go next? §4 covers the propagation mechanics and the scaling alternatives being deployed in response.

2. Anycast: one IP, many machines

Anycast is a routing trick, not a DNS feature. It works at the BGP layer and any service that fits its constraints can use it.

BGP primer

BGP (Border Gateway Protocol) is how the internet’s routers tell each other which network destinations they can reach. The internet is partitioned into ~75,000 autonomous systems (ASes): each AS is one administrative network, identified by a unique number. Cloudflare is AS13335. Google is AS15169. Your home ISP is some AS, and so is every transit provider in between. Each AS announces, via BGP, the IP prefixes it is willing to deliver packets for. Its neighbors propagate those announcements onward. After enough propagation, every router on the internet has a map of “which AS to send packets toward for which destination prefix.”

BGP doesn’t optimize for speed. When a router has multiple announcements for the same prefix, it picks the one with the shortest AS path, the smallest number of AS-to-AS hops to get there, with ties broken by local preference, which operators tune for their own commercial reasons. Importantly, “shortest path” in AS-hops is often not the lowest latency in milliseconds, because each AS hop can span continents.

How anycast uses BGP

The premise of anycast: multiple physical machines, geographically scattered, all announce the same IP prefix to BGP. Every router between you and that prefix picks whichever announcement it prefers. Your packet ends up at whichever machine “won” at your local router.

This works for DNS because queries are stateless. A UDP query is one packet out, one packet back; there’s no session to maintain. Any node serving the same zone can answer identically, with no shared state to coordinate. If a mid-flight routing change moves the next packet to a different node, the client just retries. It looks like normal packet loss. Compare with anycasting TCP: mid-connection routing changes break the session, because state lives at one node. There are tricks (anycast-aware load balancers, session migration over TLS keys) but they’re harder. DNS got anycast cheaply because the protocol cooperates.

What “closest” actually means

This is the part that surprises people. BGP doesn’t pick “physically closest” or “lowest latency.” It picks “shortest AS path with the highest local preference.” The intuition that anycast sends your packet to the geographically nearest node is wrong in the general case.

A Berlin router might think the São Paulo announcement is “closer” than the Frankfurt one, because of some peering relationship between intermediate networks. The Frankfurt announcement might cross more AS boundaries than the São Paulo one, even though Frankfurt is physically next door. Operators tune their announcements (deaggregating into smaller prefixes, withdrawing from regions where they don’t want traffic, prepending the AS path in some directions) to nudge BGP toward their preferred topology. It’s an ongoing game, not a fixed configuration.

[diagram placeholder] World map of anycast nodes (e.g. SFO, IAD, LHR, FRA, SIN, NRT, SYD, GRU, JNB) all advertising the same prefix. Users in NYC, Berlin, Jakarta, Auckland routed to their nearest node, with one “peering quirk” arrow showing Madrid hitting São Paulo instead of London.

[demo placeholder] “Where does your 1.1.1.1 actually live?” — fire a CHAOS TXT id.server query, reveal which datacenter answered, compare against geolocation.

3. When an anycast node fails

Failure is what makes anycast really useful, not just clever.

A node fails: kernel panic, network drop, scheduled maintenance, doesn’t matter. It stops announcing the prefix to its upstream routers. BGP converges. Within 10 to 60 seconds, routes that depended on that announcement get withdrawn, and traffic flows to the next-preferred announcement instead. From the client’s perspective, a few seconds of timeouts, then everything works again. From the operator’s perspective, a node that stops announcing automatically takes itself out of rotation, with no central control plane, no health-check service, no coordinated failover. The routing table is the failure-detection mechanism, which has been running and recovering from failures for 30+ years.

“Seconds” hides variance. BGP convergence for a single prefix withdrawal is typically 10-60 seconds; sometimes longer with bad peering or if route flap dampening kicks in. For DNS, this is fine: the client retries every few seconds anyway and one or two retries land at a working node. For stricter SLAs, operators add side-channels: external health checks that announce or withdraw based on real probes, RPKI for prefix authorization, sometimes BFD (Bidirectional Forwarding Detection) for sub-second BGP failover at peering points.

4. Propagating changes, and scaling beyond query-response

Anycast handles routing. It doesn’t handle data freshness. If your authoritative zone changes, all of the anycast instances eventually need the same update, and they pull it from a small set of distribution masters, not from each other. The data path is one-to-many; the routing path is many-to-one.

The mechanism on the data path is the standard DNS replication protocol. The master notifies its secondaries with NOTIFY (RFC 1996); the secondaries pull the zone with AXFR (full transfer) or IXFR (incremental). For root and major-TLD operators, the propagation runs through a small set of distribution servers operated by IANA (for the root) or by the registry operator (for TLDs), and from there fans out to every anycast instance via the operator’s internal infrastructure. Managed-DNS providers (Cloudflare, Route 53, NS1) run their own internal control planes for this, but the public-facing protocol surface is still NOTIFY + AXFR/IXFR.

The traditional model is reaching a wall

From Chapter 3: root queries grew from ~90 billion/day in early 2023 to ~130 billion/day in early 2025. ~25% CAGR. The model is “every recursive resolver pulls fresh data from the root on cache miss,” and the root operators are unfunded volunteers. The math doesn’t work indefinitely.

Several alternatives are deployed or proposed. They share a theme: move from “query the root on cache miss” to “everyone has a recent local copy.” That trades a hard scaling ceiling for a freshness and verification problem, and DNSSEC + ZONEMD make the verification side tractable.

Aggressive NSEC caching (RFC 8198). Once a resolver has a signed NSEC record proving a TLD doesn’t exist, it can authoritatively answer NXDOMAIN for any name in that gap without re-asking the root. Since ~50% of root queries are NXDOMAIN, this is a huge reduction. Shipped in BIND 9.12+, Unbound 1.7.0+, Knot Resolver 2.0.0+.
Root zone mirroring (RFC 8806). Resolver loads the entire root zone locally and serves it as if authoritative. Zero root queries on cache miss. Trade-off: validating freshness and proving the local copy is complete.
ZONEMD (RFC 8976). Message digest record published inside the zone, covering the whole zone’s contents. A mirror can cryptographically verify its local copy is the real, complete, unmodified zone. Pairs naturally with mirroring. Shipped in Unbound 1.13.23+, PowerDNS Recursor 4.7.04+, BIND 9.17.13+.
CDN-distributed root (Huston proposal, APNIC 2025). Treat the root zone as static content distributed via existing CDN infrastructure rather than queried per cache miss. Publish a new zone file every 24 hours with 48-hour validity. Sidesteps the query-load scaling problem entirely, at the cost of slower update propagation. This is acceptable since the root changes only when a TLD’s NS or DS records change.

The TLD operators are watching closely. If the root’s economics get formalized into “the root is a static signed artifact you fetch and verify,” the same model could move down a level to TLDs, and most of the DNS world’s query traffic would no longer need the central infrastructure to scale linearly with it.

5. Authoritative-server performance

Anycast spreads load across machines geographically. But each individual machine still has to handle a lot of QPS. A typical high-traffic Cloudflare or AWS authoritative-server location sees several hundred thousand QPS sustained, and the biggest see millions per box. How?

Four engineering choices, layered on top of each other:

In-memory zones

The whole zone is loaded into RAM at startup. No disk I/O during query handling. The zone file might be a megabyte or a gigabyte; either way it lives in memory, and updates apply diffs to the in-memory copy and asynchronously to disk. Memory access is roughly two orders of magnitude faster than disk seek; for a system whose hot path is parse → lookup → format → emit, the lookup has to be RAM-resident or the rest of the path doesn’t matter.

Lock-free reads (RCU)

Modern authoritative servers (NSD, Knot, PowerDNS) use Read-Copy-Update or similar techniques so multiple cores can read the zone concurrently without contention. Writes (zone reloads) happen rarely and don’t block readers. Without lock-free reads, every query would serialize on a shared mutex and the system would lose most of its concurrency. With it, each CPU core processes queries independently, scaling linearly up to memory bandwidth limits.

SO_REUSEPORT and per-core sharding

The Linux SO_REUSEPORT socket option lets multiple processes bind to the same UDP port; the kernel hashes incoming packets across the bound sockets. Each worker process then has its own dedicated queue, no contention on a shared socket. Combined with one worker per CPU core pinned to that core’s NUMA node, the work scales close to linearly with cores.

Kernel bypass (DPDK, AF_XDP)

The Linux network stack is general-purpose; it does a lot of work (firewall hooks, conntrack, routing tables) that an authoritative DNS server doesn’t need. DPDK and AF_XDP let the server read packets directly from the NIC’s ring buffer in user space, skipping most of the kernel networking stack. The cost is operational complexity (custom driver model, harder debugging). The benefit is pushing the per-box ceiling from ~1-3M QPS (tuned NSD or Knot on stock Linux) to 5-10M QPS (the same software with kernel bypass).

TODO: validate these numbers against published benchmarks before publication. NSD’s and Knot’s engineering blogs have real measurements; the numbers above are order-of-magnitude estimates pending citation.

6. CDNs use DNS as the steering layer

Now stack everything together. Every CDN’s job is to send each user to the nearest edge, where “nearest” means low-latency, healthy, and not overloaded. DNS is the natural steering point: you’re already asking “where is this name?” and so the CDN’s authoritative server answers with a different IP depending on who’s asking. Anycast handles the steering at the routing layer; DNS handles it at the name layer; the two compose, and most modern CDN architectures use both.

Four common DNS steering techniques, in rough order of how sophisticated they are:

Round-robin A records. Return multiple IPs in the response. The resolver (or client) picks one, usually the first, sometimes one chosen randomly. Crude load balancing. Doesn’t account for geography or health. Still widely used as a fallback.
Health-checked DNS. The authoritative server monitors backends. If one goes down, remove its A record from responses going forward. Convergence speed bounded by TTL.
GeoDNS. Look at the resolver’s source IP. Map it to a geographic location via a GeoIP database. Return the IP of the nearest backend. First widely-deployed steering technique, dating to the late 1990s.
Latency-DNS, RUM-based steering. Use real latency measurements between resolvers and backends, often gathered via in-page JavaScript that pings candidate IPs. Return the IP with the lowest measured latency for users behind that resolver. State of the art at the big CDNs.

The TTL trade-off (from Chapter 1) shows up sharply here. Short TTLs let you fail over quickly; they also multiply auth-server load because caches can’t hold values long. Most CDN-fronted records publish TTLs in the 30 to 300 second range. Each major CDN runs its own authoritative infrastructure precisely because their TTL economics don’t work on third-party DNS.

TODO: citations needed for the steering claims. Cloudflare’s, Fastly’s, AWS Route 53’s, and Akamai’s engineering blogs and whitepapers have published measurements.

[demo placeholder] Live CDN-steering panel: type a CDN-fronted domain, query it from several public resolvers in different regions via their DoH endpoints, show the returned A records geolocated.

EDNS Client Subnet (ECS)

GeoDNS by resolver IP broke as users moved to anycast public resolvers like 1.1.1.1 and 8.8.8.8. A Berlin user resolving via a US-routed copy of 8.8.8.8 would get steered to a US CDN edge, and have bad latency for the entire session.

EDNS Client Subnet (RFC 7871, 2016) fixes the steering accuracy by leaking some user information upstream: the resolver attaches part of the user’s IP (typically /24 for IPv4, /48 for IPv6) to the query, inside an EDNS OPT record. The authoritative server now sees both the resolver’s IP and a subnet that approximates the user’s location.

The privacy cost is real. ECS leaks the user’s subnet to every authoritative server in the resolution chain: Cloudflare’s auth, Akamai’s, Fastly’s, anyone running a CDN-friendly authoritative server, including ad-tech operators. A /24 prefix narrows a user’s location to roughly a neighborhood. For most users this isn’t catastrophic, but it’s meaningful. The user signed up for “1.1.1.1 doesn’t log my queries.” They didn’t necessarily sign up for “every CDN auth server learns my /24.”

The policy choices are visible. Cloudflare’s 1.1.1.1 doesn’t send ECS by default, preferring privacy. Google’s 8.8.8.8 sends ECS to CDNs that request it, preferring accuracy. Quad9 strips it. Each defensible, none neutral.

[demo placeholder] ECS A/B demo: fire two DoH queries for the same CDN-fronted name, one with ECS, one without; show edge IPs side by side; slider to spoof your /24 to other locations and watch steering follow.

7. The apex-CNAME problem

A small but operationally annoying problem specific to CDN integration. You bought example.com. You want it to point at your CDN’s edge, which the CDN vendor gives you as a hostname like d12345.cloudfront.net. The natural mapping is a CNAME:

example.com.  CNAME  d12345.cloudfront.net.

You can’t do this. RFC 1034 says a CNAME cannot coexist with other records at the same name. But the apex (the name of the zone itself, with no subdomain prefix) needs SOA and NS records, mandatory for every zone. So the apex can’t be a CNAME. This is a real RFC-level constraint, not a policy choice. CNAME’s behavior was defined as “this name is that other name,” and the protocol has to know how to handle SOA and NS queries at the apex without redirecting them.

Three workarounds, none of them official:

ALIAS / ANAME / CNAME flattening. The authoritative server resolves the CNAME on your behalf at query time and returns A records directly. From the resolver’s perspective, it gets A records. From your perspective, you “set a CNAME at apex.” This is what Cloudflare’s CNAME flattening, Route 53’s ALIAS, DNSimple’s ALIAS, and DNS Made Easy’s ANAME all do. Not standardized; behavior varies between providers.
Multiple A records, manually synchronized. Ask your CDN for their current edge IPs and copy them into your zone as A records. Brittle. Breaks every time the CDN changes its IP layout.
HTTPS / SVCB resource records (RFC 9460, 2023). Newer record type that substitutes for CNAME-at-apex semantics, with extra fields for protocol parameters (ALPN, port, IP hints). Adoption is slow but growing. Chrome and other major browsers support it as of 2024.

A clean spec from 1987 produces a real operational pain in 2026, and vendors quietly route around it. Most people who set up an apex CNAME on Cloudflare or Route 53 don’t know the spec forbids it.

Next: ch6: security and trust

Denys Inhul