Denys Inhul

ch1: why DNS exists

date: May 14 2026

1. The problem

Machines speak IP. A router moves a packet from 10.0.0.51 toward 198.41.0.4 by reading addresses and consulting routing tables. The protocol works. It just doesn’t fit the way humans communicate, people want labels.

You want to type cloudflare in your browser, send requests to stripe in your code, and trust that those names still mean something tomorrow even if the IP behind them changes. You want stable identifiers, names you can pass around in URLs, certificates, config files, business cards, separated from the volatile addresses underneath. So we need a layer of indirection: a translation from what you mean to where it currently lives.

2. HOSTS.TXT

The first answer wasn’t DNS. It was a single file called HOSTS.TXT, maintained at SRI-NIC starting around 1973. Every machine on ARPANET would FTP a fresh copy of this file periodically. To register a new host, you emailed Elizabeth J Feinler’s group at SRI, and they edited the file by hand.

# excerpt from HOSTS.TXT, c. 1983
HOST : 10.0.0.51 : SRI-NIC : DEC-2060 : TOPS20 : TCP/TELNET,TCP/FTP,TCP/SMTP :
HOST : 10.2.0.11 : MIT-MULTICS : HONEYWELL-DPS-8 : MULTICS : TCP/FTP,TCP/SMTP :
HOST : 10.3.0.6  : UCB-VAX : DEC-VAX-11/780 : UNIX : TCP/FTP,TCP/SMTP,TCP/TELNET :

This worked fine for a hundred hosts. By the early 1980s it had stopped working on every axis at once. The pain was documented in real time across three RFCs: RFC 799 (Cohen, 1981) flagged that 400 hosts was already too many to manage. RFC 819 (Su & Postel, 1982) pushed it further: “the ever-growing size, update frequency, and the dissemination of the central database makes this approach unmanageable.” RFC 882 (Mockapetris, 1983), the original DNS specification, opened by saying the same thing: the system was at the limit of manageability.

The bottleneck wasn’t disk space. It was human throughput. One person, one file, every change, every host.

[diagram placeholder] HOSTS.TXT centralized star (one file at SRI, every host pulls everything) vs DNS delegated tree (root, TLDs, second-level zones).

The replacement had to solve four failures of the original at once. Centralized writes meant one human edited every change. Fix: distribute authority, let owners change their own records. Centralized reads meant every host pulled the whole file. Fix: cache aggressively, pull only what you need. Flat naming meant two organizations couldn’t both have a host called research. Fix: hierarchical names, so research.mit.edu and research.stanford.edu never collide. And single-point control meant one party held the whole namespace. Fix: delegated operation, so no single party can be coerced into breaking the system.

Mockapetris’s proposal in RFC 882/883 (1983), refined into RFC 1034 / RFC 1035 by 1987, made those four fixes the design. The fundamentals haven’t changed since.

3. Ownership hierarchy

If no single party can own the whole map, different parties have to own different pieces. But the harder question in 1983 wasn’t “how do you route a query.” It was “who gets to edit what?” The pain that killed HOSTS.TXT was administrative, not algorithmic. The replacement had to distribute write authority, not just spread out reads.

The RFCs leading to DNS considered three options. Keep patching HOSTS.TXT (raises the update-frequency wall higher, but doesn’t move it). Regional master files, one per network segment (wrong shape: the thing that wanted to own a name was an organization, not a network). Or hierarchical delegation by name, where the name itself encodes the chain of authority. That last option won, and it won because RFC 819 explicitly wanted a hierarchy “administrative dependent, rather than a strictly topology dependent.” Cuts the namespace along organizational lines. Predictable lookups fall out for free.

Read www.cs.mit.edu right-to-left and you have a tour of the delegation chain. .edu is delegated by the root to Educause. mit.edu is delegated by Educause to MIT. cs.mit.edu is delegated by MIT to its CS department. www is just a name the CS department picked. Each step in the chain is a zone: a slice of the namespace under one party’s control. The party can publish their zone from one nameserver, from ten, or (most commonly today) from a managed DNS provider. The point is that the parent zone only needs to know “for anything under mit.edu, ask these nameservers.” It doesn’t need to know (and shouldn’t need to know) anything about what MIT is doing inside its zone.

[diagram placeholder] Delegation chain root → .edu → mit.edu → cs.mit.edu → www.cs.mit.edu, with each box labeled as a zone (administrative unit, not a single server).

The cost paid: every cold lookup has to walk the chain. Three or four round-trips before any answer. We’ll see in §4 why that’s fine in practice, and in §7 what the actual numbers look like.

4. Caching, TTLs, and eventual consistency

Hierarchy is one design move. Caching is a separate one, made for a different reason. It’s tempting to say “the hierarchy would melt the root without caching, so caching is forced.” That’s true at today’s scale, but it isn’t what motivated the choice in 1983. The real driver, stated plainly in RFC 882, is that reads vastly outnumber writes.

A DNS record changes maybe once a year. It gets read billions of times in between. Any system with that ratio wants to cache aggressively, hierarchical or not. Hierarchy then amplifies the payoff (a cache hit short-circuits a four-hop walk, not just a one-hop fetch) but caching would have been wanted regardless.

The TTL: who decides how stale is okay?

Caching forces a sharp question. How long do we cache? Too long and you serve stale answers. Too briefly and you lose the scaling win. DNS hands this trade-off to the owner of each record, who attaches a Time-To-Live in seconds:

$ dig +noall +answer www.cloudflare.com
www.cloudflare.com.   300   IN  A   104.16.132.229
www.cloudflare.com.   300   IN  A   104.16.133.229
                     ^^^ TTL in seconds (5 min)

Three things to notice. TTL is a field on the resource record, so www can be 60 seconds while mail is 86400, giving per-record granularity from day one. The publisher decides, the cache obeys; resolvers count down from the value the owner ships and don’t get to invent their own (they can cap it lower as a defense, but never raise it). And the original mental model was days: RFC 882 suggested TTLs “on the order of days for the typical host.” The 60-second TTLs you see today on CDN frontends are a much later cultural shift, enabled by faster networks and managed-DNS providers, not a protocol change.

Eventual consistency, by design

Hundreds of millions of resolvers now hold cached copies of the same record. Each one has its own TTL countdown, and each was acquired at a different moment. At any instant, different resolvers around the world hold different values for the same name. Some are fresh. Some are stale. Some have never seen the name at all.

This is eventual consistency, and it was a deliberate choice, not an accident. RFC 882 stated it in plain English seventeen years before “CAP” existed: “Access to information is more critical than instantaneous updates or guarantees of consistency. Hence the update process allows updates to percolate out through the users of the domain system rather than guaranteeing that all copies are simultaneously updated.” Translated into modern vocabulary: DNS picks A over C, in the same paragraph where it explicitly chooses to keep serving old data when the path to fresh data is down. Mockapetris didn’t have the CAP framework, but he had the insight and made the choice consciously. DNS is now retroactively cited as the textbook AP system.

[widget placeholder] TTL cascade demo: change a record at the authoritative server, watch each cache layer refresh independently as its TTL hits zero.

The operational consequences are concrete. You can’t atomically change a record; TTL 300 means up to 300 seconds of stale answers worldwide. “Works for me, doesn’t work for me” right after a DNS change is two users on different points in their TTL cycle, both right about different moments. A misconfiguration with a 24-hour TTL haunts you for 24 hours; there’s no recall. Cloudflare changes propagate in ~5 minutes not because the protocol got faster, but because their default TTL is 300 seconds.

5. The administrative layers behind every domain

The hierarchy diagram in §3 shows authority. In practice, that authority breaks into four distinct administrative roles. They’re cleanly separable, and modern providers bundle them in ways that confuse people who haven’t named them out loud.

The registrant is the owner, the party with the right to publish records for the name. Cloudflare Inc. is the registrant for cloudflare.com. You are the registrant for the domain you bought last week.

The registrar is the company that sold you the name and that talks to the registry on your behalf. Namecheap, GoDaddy, Cloudflare Registrar, hundreds of others.

The registry is the operator of a TLD. It runs the authoritative nameservers for, say, .com, and maintains the database of all registered names underneath. Verisign for .com; Public Interest Registry for .org; one per TLD.

The nameserver operator is whoever actually serves your zone, the machines that respond when a resolver asks “what’s the A record for www.yourdomain.com?” Could be your registrar, could be a separate managed DNS provider (Route 53, NS1, Cloudflare DNS), could be a server you run yourself.

Concretely, registering myname.blog looks like this:

  1. The .blog TLD has to exist. The root zone delegates .blog to its registry operator (Knock Knock WHOIS THERE LLC, owned by Automattic) via NS records. .blog was approved by ICANN in 2015 as part of the 2012 new-gTLD program.
  2. You register the name through a registrar. The registrar talks to the .blog registry via the EPP protocol (RFC 5730) and creates a myname.blog entry in the .blog zone.
  3. You pick a nameserver operator. Cloudflare DNS, say. Cloudflare hands you two hostnames (e.g. ns1.cloudflare.com) that will serve your zone.
  4. You tell your registrar to set the NS records for myname.blog to those Cloudflare nameservers. The registrar pushes the change into .blog’s zone. Within minutes the .blog nameservers respond to queries for myname.blog with a referral: “ask ns1.cloudflare.com.”
  5. You add records in Cloudflare’s dashboard: www.myname.blog A 1.2.3.4, mail.myname.blog A 5.6.7.8. Cloudflare’s authoritative servers now answer for those names.

The registry/registrar split is a 1999 ICANN policy artifact, not a protocol concept. Until then, Network Solutions ran both the .com registry and was the only registrar. ICANN’s Shared Registration System broke them apart to introduce competition at the registrar layer. The DNS protocol itself doesn’t know registrars exist; it only sees zones and delegations. Everything in this section is administrative scaffolding built on top of the protocol. We’ll come back to who actually runs which piece, who’s paid for what, and what changes when you cross national borders, in Chapter 3.

6. Why recursive resolvers exist

In the original 1983 design, a “resolver” was a library that ran on each machine. It walked the tree itself (query the root, get a referral to .com, query .com, get a referral to example.com, query example.com, get the answer) and cached locally for the rest of the session.

That cache lives in your one machine. Your neighbor’s cache lives in their machine. Every machine on a campus does its own walking, builds its own cache, throws it away when you reboot. Wasted work.

The fix was to move the resolver and its cache out of the user’s machine and share it across a whole population. A dedicated server inside the ISP’s network does the walking and the caching, and every subscriber’s stub resolver just asks that server. The first user to look up cnn.com pays the full four-hop cost. Everyone else gets a cache hit. The economics are obvious: thousands of users behind one resolver means thousands of cache-miss-then-hit ratios, instead of every machine starting cold.

ISPs started running recursive resolvers for their subscribers in the early 1990s. That was the dominant model for the next fifteen years. The public-resolver era (Google’s 8.8.8.8 in 2009, Cloudflare’s 1.1.1.1 in 2018) moved the same cache infrastructure off the ISP and into a handful of mega-operators. The principle is the same: share the cache across many users.

7. Modern scale

The numbers are large enough to be worth doing the napkin math on, because the math forces the rest of the architecture into existence.

Public resolvers, and why they’re free

Cloudflare’s 1.1.1.1, Google’s 8.8.8.8, Quad9’s 9.9.9.9, NextDNS, and several smaller operators are free to use. The reason they’re free isn’t generosity. It’s that DNS query data is valuable (telemetry on global naming patterns, threat intelligence, popular-site lists), the brand value of being “the privacy resolver” or “the fast resolver” is real, and DNS is a useful control-plane lever for whoever runs your resolver: they can rewrite responses, block names, redirect queries. Free resolvers exist because the alternative (letting ISPs and adversaries see all your queries) is worse for the operators’ broader interests too.

Napkin math

Consider what loading a single YouTube video page actually triggers. A modern page touches dozens of distinct hostnames: youtube.com, googlevideo.com, ytimg.com, doubleclick.net, googleapis.com, gstatic.com, ad networks, analytics endpoints, embedded social widgets. Many of those are CNAMEs that point to other names, each requiring its own fresh walk.

Round numbers, no caching: ~50 unique hostnames per page load × 4 hops each = ~200 round-trips per page. YouTube serves ~5 billion video views per day. Even a conservative 1B unique-hostname lookups/day from YouTube alone is ~12,000 lookups per second sustained, just for that one product. Multiply across every web property, every mobile app, every IoT device on Earth and you get hundreds of millions of queries per second globally. Without caching, every one of those would walk the full tree. The 13 root letters would melt.

What actually happens, with caching: Cloudflare’s 1.1.1.1 alone handles roughly 50 million queries per second as of 2024 (Cloudflare Radar), after browser, OS, and home-router caches have absorbed most lookups. Globally across all resolvers, the total is plausibly several hundred million QPS, approaching 1B at peak. The 13 root letters collectively see only ~100 thousand queries per second (RSSAC stats; ~130 billion/day at current 2025 rates per APNIC). That’s a ~1000× compression between user-edge traffic and root traffic. That compression is the entire reason the system works. Caching at every layer absorbs the vast majority of queries before they reach the next layer up.

CNAME chains

A subtlety worth flagging: many user-visible hostnames are CNAMEs that point to other hostnames. www.example.com might be a CNAME to example.com.cdn-provider.net, which is itself a CNAME to pop-frankfurt-12.cdn-provider.net, which finally has an A record. Each CNAME triggers a fresh full walk from root for the target name. CDNs, load balancers, and managed services use this constantly. A single user-visible name can quietly become two or three separate four-hop walks. Caching absorbs almost all of it after the first lookup, but the cold case is much more expensive than the diagrams in §3 suggest.

[demo placeholder] Real-domain CNAME-chain demo: pick a CDN-fronted name, bypass cache, trace the actual chain, show which servers answer at each hop.

Root server load

How do 13 names handle 130 billion queries per day? Anycast. Each “letter” (a.root-servers.net through m.root-servers.net) is one IP address advertised from many physical locations via BGP. When you “ask the root,” your packet ends up at whichever physical node is topologically nearest. As of May 2026 there are 2,020 instances across ~1,742 sites (root-servers.org). Chapter 5 covers the anycast story properly.

8. Authoritative vs recursive

So far we’ve talked loosely about “DNS servers.” There are actually two distinct kinds, doing two completely different jobs.

Authoritative servers hold the truth for a zone. They answer questions about names they own, and return referrals for everything else. ns1.cloudflare.com is authoritative for the millions of zones Cloudflare hosts; ask it about www.cloudflare.com and you get a direct answer with the AA (authoritative answer) bit set; ask it about anything outside its zones and it’ll either refuse or return a referral. Authoritative servers don’t cache. They don’t recurse. They only know their own zones.

Recursive resolvers hold a cache and walk the hierarchy on behalf of clients. 1.1.1.1 is a recursive resolver. It doesn’t own any zone. Ask it about www.cloudflare.com and it’ll either return a cached answer or do the full walk on your behalf (root, TLD, second-level) and return the final result.

[diagram placeholder] Side-by-side comparison of authoritative server (one query in, one answer out from local data; no recursion, no cache, only its own zone) vs recursive resolver (one query in, hidden hierarchy walk, returns final answer; recurses, caches, owns nothing).

The recursive resolver tier is what stands between the chaos of stub clients and the small handful of letters at the root. It’s the layer that absorbs the read amplification (§7), and as we’ll see in §10, it also exists as a security boundary.

9. Iterative vs recursive modes

RFC 882 (1983) defined two query modes. Iterative: the server says “I don’t know the full answer, but ask ns1.example.com; they handle that subtree.” Returns a referral. The client has to keep going. Recursive: the server says “hold on, I’ll do the whole walk for you.” Goes off, talks to root, TLD, authoritative on your behalf, returns only the final answer.

The original spec made iterative mandatory and recursive optional. In practice today, every real query uses both modes on different segments:

your laptop's stub resolver
    │  recursive query (RD=1) ──▶  ISP / public resolver / 1.1.1.1
    │                                       │
    │                                       │  iterative queries ──▶  root
    │                                       │  iterative queries ──▶  .com
    │                                       │  iterative queries ──▶  example.com
    │  ◀── single final answer ──           │

Stub to recursive resolver: recursive (you want one answer, you wait). Recursive resolver to authoritative servers: iterative, because authoritative servers refuse recursion for arbitrary clients. The recursive resolver is the bridge between the two modes. The next section is why that bridge has to exist.

10. Why authoritative servers refuse recursion

The reason isn’t performance. It’s that DNS runs on UDP, and DNS responses are usually much bigger than DNS queries. An attacker can spoof the source IP of a query to be the victim’s IP, send a small query (~60 bytes) to a server that performs recursion, and the server obediently sends a large response (~3000 bytes) to the victim. ~50× amplification with no effort on the attacker’s part. Across a botnet of thousands, this saturates gigabit links at the victim. This was the dominant DDoS technique through the 2010s. The 2013 Spamhaus attack hit ~300 Gbps using exactly this pattern.

So if a root or TLD server performed recursion for any client that asked, it would become the world’s largest open resolver: a free amplifier weaponizable by anyone. Authoritative servers refuse recursion and only return small referrals for names outside their authority. Recursive resolvers, in turn, accept queries only from their own clients (an ISP’s resolver answers only its subscribers; a corporate resolver answers only internal IPs). Public resolvers like 1.1.1.1 are the exception: they accept queries from anywhere, but they invest heavily in rate limiting and DDoS mitigation to make abuse uneconomic, and they spread the load across global anycast infrastructure that no single attacker can saturate.

That’s the load-bearing reason the recursive resolver exists as a distinct tier: it’s an abuse boundary, not just a performance optimization. Security gets its own chapter (Chapter 6). For now this is enough to explain the shape of the system.

11. A short history of the recursive layer

  • 1980s-early 1990s. Most hosts ran iterative resolution via a local resolver library (BIND’s resolver, mostly). “Recursive resolver as a service” was rare.
  • Mid-1990s onwards. As consumer internet exploded, ISPs began running recursive resolvers for their subscribers. This was the dominant model for ~15 years.
  • 2006. OpenDNS launches as the first widely-used public recursive resolver, where anyone anywhere could point at it.
  • 2009. Google Public DNS (8.8.8.8). Free, fast, easy to remember. The start of resolver consolidation onto a few mega-operators.
  • 2017-2018. Quad9 (9.9.9.9) and Cloudflare 1.1.1.1 launch with privacy-focused pitches.

The trajectory has been steady centralization. In 1995 you used your ISP’s resolver. By 2025, a substantial fraction of the world uses one of Google, Cloudflare, or Quad9, directly or transparently via browsers and OSes that route through them. That centralization solves real problems (latency, reliability, blocking known-bad names) and creates a different one (concentrated visibility into everyone’s queries) that Chapter 6 picks up.

12. Putting it together

The chapter has built up two pictures of DNS. They’re worth seeing side by side.

The conceptual picture

What the protocol gives you: a hierarchy of zones, delegation between them, caching at every layer.

[widget placeholder] Two-user cache demo: cold walks the tree, warm hits cache.

The practical reality

What actually happens when a user types a name into their browser. Many more layers than the conceptual picture, where each one is a potential cache, a potential rewrite, and a potential failure point.

[widget placeholder] End-to-end practical walk through every cache and forwarder. Random-subdomain trick to bypass cache and see who actually talks to whom in what order.

Concretely, a Berlin user typing news.bbc.co.uk into Chrome on a corporate laptop with VPN active, with nothing cached anywhere:

Chrome's internal DNS cache              → miss
  ↓ (calls getaddrinfo)
/etc/hosts                               → not listed

systemd-resolved / mDNSResponder         → cache miss
  ↓ (sends UDP query to configured upstream)
VPN-provided resolver (10.x.x.x)         → forwards to next hop

Corporate internal resolver              → forwards to upstream

Cloud egress resolver                    → talks to public resolver

Cloudflare 1.1.1.1 (anycast → Frankfurt) → cache miss, walks:
                                          root (anycast → Frankfurt instance)
                                          → returns NS for .uk

                                          .uk (Nominet, anycast)
                                          → returns NS for .co.uk

                                          .co.uk
                                          → returns NS for bbc.co.uk

                                          bbc.co.uk authoritative
                                          → returns CNAME to news.bbc.co.uk.cdn-provider.net
                                          ↓ (full new walk for the CNAME target)
                                          root → .net → cdn-provider.net → final A record
  ← single answer back through every layer

That’s plausibly 8-12 distinct caches and ~10 actual network round-trips for one cold lookup. After that first lookup, almost everything is cached and subsequent queries cost almost nothing.

Every design choice from §2 through §10 shows up in this walk. Hierarchy partitions the namespace. Delegation is each step. Caching at every layer absorbs almost all the load. TTLs control how long each layer holds a value. The recursive resolver is the abuse boundary that lets stubs stay small and authoritative servers stay safe.

The next five chapters fill in the parts this chapter glossed over. Chapter 2 looks at the bytes on the wire. Chapter 3 covers who actually runs what at the top of the tree, and who’s paid for it (or, mostly, isn’t). Chapter 4 is the politics: what happens when cooperation breaks down. Chapter 5 covers anycast and the scaling tricks that keep the load tractable. Chapter 6 covers security and the privacy story the protocol has been slowly retrofitting since 1987.

Next: ch2: protocol on the wire


Further reading

  • Cohen, RFC 799 (1981), Internet Name Domains. First articulation of the HOSTS.TXT scale problem.
  • Su & Postel, RFC 819 (1982), The Domain Naming Convention for Internet User Applications. The case for “administrative dependent, rather than a strictly topology dependent” hierarchy.
  • Mockapetris, RFC 882 & RFC 883 (1983). The original DNS specification.
  • Mockapetris, RFC 1034 & RFC 1035 (1987). DNS concepts and protocol. Still the canonical specification.
  • Mockapetris & Dunlap, Development of the Domain Name System, SIGCOMM 1988. Retrospective on why HOSTS.TXT failed and what DNS replaced it with.
  • Brewer, Towards Robust Distributed Systems, PODC 2000 (keynote). The original CAP conjecture.
  • Gilbert & Lynch (2002), Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. The formal proof.
  • Liu & Albitz, DNS and BIND (5th ed.). The operator’s handbook.