Internet Crawler (Practice Interview Question)

You only have free questions left (including this one).

But it doesn't have to end here! Sign up for the 7-day coding interview crash course and you'll get a free Interview Cake problem every week.

You're in!

Back up the Internet.

The CEO of a hot new startup is convinced that if you get a backup of the Internet, you'll be able to run fancy analytics and do AI to figure out ... something. (They didn't provide many specifics, since they're worried about someone stealing their intellectual property.)

Write a web crawler that backs up the Internet by downloading a page, downloading all the pages that page links to, downloading all the pages those pages link to, etc..

The CEO wants the entire backup process to take at most 1 week to run.

Challenge accepted!

To start off, let's ask some clarifying questions to figure out what's in scope.

Are we designing something that'll process the web content? Or just download and store it? For now, focus on the downloading portion. (We'll need more details from our CEO before we get to the processing part.)

What sorts of content are we grabbing? For now, just HTML. Later on, we might expand to other stuff—pictures, videos, Javascript files, etc..

How long will we run our crawler? Your CEO said the crawler should back up the Internet in a week.

Roughly how many pages will we need to crawl? The whole Internet! There are always new pages getting added. At the start of 2018, there were 4.5 billion pages, so that's a good ballpark estimate.

Are we building a front-end to access the downloaded data? No. For now, just focus on the part that downloads and stores HTML files.

What's the format of our database? We'll want something like a large key-value store, where we can look up page data by URL.

If a page gets updated before our backup finishes, do we need to fetch a more recent copy? No, any version within the last week is fine.

How do we find links in HTML? Keep it simple. For now, just look for <a> href attributes.

In HTML, links to other pages look like this:

As an example, this is a snippet of HTML that links to Interview Cake's homepage:

<a href="https://www.interviewcake.com">Interview Cake's Homepage</a>

And here's how that HTML gets displayed: Interview Cake's Homepage.

Okay, that gives us something to work with.

Let's figure out some numbers for our crawler.

How fast do we need to crawl pages? And, how much space will we need to store them?

How fast do we need to crawl pages?

Our goal is to back up the whole Internet in a week.

As of early 2018, there were about 4.5 billion pages. If we need to get them all, then we're looking at:

650 million pages per day
30 million pages per hour
500 thousand pages per minute
8,500 pages per second

How much space will we need for storage?

Internet pages are pretty big these days (with lots of ads, images, and scripts). We're not downloading the whole page though—just the HTML. On average, that's about 75KB per page. Let's round up to 100KB to be on the safe side.

With 4.5 billion pages, we're looking at 450 terabytes of data.

Okay, so here's what we've got. We're building a web crawler to back up 450 terabytes of data. Our crawler should run in about a week, which means processing 8,500 pages per second.

Now that we've got some numbers, let's start thinking about our architecture. What would it take to build a really simple web crawler?

As a starting point, we'll need a crawler machine that goes around the Internet downloading pages and storing them.

mvp

How does our crawler know which pages to visit?

We can treat the Internet like a graph. Each page is a node, and links to other pages are directed edges from one node to another.

There are two common ways to explore graphs: breadth-first and depth-first traversal.

Which type of traversal should we use here?

It depends on how we want our crawler to behave.

With a breadth-first traversal, we'll "go wide," sprawling out to lots of different sites quickly.
With a depth-first traversal, we'll "go deep," usually sticking around the same corner of the Internet for a while before moving on to a different site.

Both of these could work. But,

Most web content tends to be just a few links deep on a site, so a breadth-first traversal seems like a good way to get the highest quality content early on in our backup.
A depth-first traversal could easily get "lost in cyberspace" following links deep into a site aimlessly and taking a long time to get to popular content that users actually see.

We'll go with breadth-first here to make sure we grab the highest quality content quickly.

Interviewer: "Can you write out the pseudocode for the breadth-first traversal?"

Our code looks a lot like breadth-first search:

- pages_to_visit queue starts with a handful of pages with lots of links (wikipedia, reddit, stackexchange, etc.) - while there are still urls in pages_to_visit: - get the next URL to download - download it - insert the URL and downloaded data into our database - parse the downloaded HTML. For each link: - if we've already downloaded it, skip it - otherwise, add it to pages_to_visit

The specific pages our traversal starts with aren't super important. Any page that has lots of outbound links to places across the Internet should do the trick.

Going back to our earlier diagram, we'll need to add two things:

A queue storing the pages to visit in the breadth-first traversal
A set to keep track of the pages we've already visited

So here's what things look like now:

mvp

Do we really something to store the pages we've already visited? Couldn't we just check for the page in our database?

We could do that. But we probably want to keep them separate. We'll be accessing the set of pages we've already visited a lot (to check if a page is already visited), but we don't need to touch the downloaded data yet. (You'll need more details from your CEO on what sort of analysis he wants to do.)

Breaking these storage structures apart means that we can keep the set of URLs we've already visited in fast storage and put the full set of downloaded pages on slow, cheap storage.

Okay, this is starting to feel like a minimum viable product. Let's starts scaling it up so that our traversal finishes in a week.

In order to finish our traversal in a week, we need to process roughly 8,500 pages per second.

Can we do that with one machine? Or will we need multiple machines?

Our intuition should be that we'll need multiple machines. Processing 8,500 pages just seems like a lot of work for one machine to do on its own.

How could we show this more rigorously?

Let's try breaking down the steps involved in processing a page:

Connecting to the server hosting the page,
Downloading the page,
Parsing the downloaded content to find any outgoing links,
Writing the downloaded content to our data store, and
Updating our list of pages we still need to visit.

Of all these steps, the network portions (connecting and downloading the page) will almost certainly be the bottleneck. On a fast connection to a server nearby it might take 50 milliseconds for data to go back and forth, and servers across the ocean will take even longer to reach.

So, when we're downloading pages, we're going to be spending a lot of time blocked: waiting to receive data from a server with nothing to do in the mean time.

We can minimize the time spent idly waiting by downloading multiple pages in parallel. If we have a few different requests out at once, chances are we'll always have at least one with new data ready to process.

All that said, it's hard to estimate how many pages a single machine will be able to download per second—it'll depend on the bandwidth of our network connection, the responsiveness of the server we're downloading from, and the speed of our own data stores.

So, let's focus on the page processing instead.

At a minimum parsing the downloaded page will require reading each byte of the page in from RAM.

Accessing RAM takes, on average, 100 nanoseconds. Over 100KB of data, we're looking at at least 10 milliseconds of time to process each page.

How many milliseconds are there in a second?

1,000.

So that means that as a high estimate, a single machine will only be able to process roughly 100 pages per second.

What about caching?

That's true. Once we read one byte from memory, chances are the next few bytes will be read out of a faster cache.

Still, it's hard to know how many cache hits we'll have in practice, since it'll depend on things like the cache size, layout, and eviction policy.

We definitely don't want to be relying on cache hits to finish on time. So we're going to play it safe and assume all of our accesses go to RAM.

What if a machine has multiple cores? Won't that increase the number of pages it can process?

Probably not. Most systems have one connection (a.k.a.: memory bus) that connects the processors to RAM.

Okay, so we definitely need multiple machines.

How should we go about splitting the work of crawling the Internet across them?

What if we just gave each machine a different starting page, and had them run independently?

Like this:

mvp

That could work. But, since each machine has a separate database, it's possible that the same page could be crawled multiple times. That's inefficient, since we're duplicating work.

So if we're going to have multiple machines crawling the Internet in parallel, we need some sort of coordination between them.

What about if we had one centralized tracker that was shared among all of the machines?

Like this:

mvp

That definitely fixes our duplicate-work problem. Since every crawler machine will be pulling from the same queue of pages to visit, and they'll have the same database tracking what's already been grabbed, we won't visit the same page twice.

Can you find any issues with this design?

We've taken our data—all the pages we've downloaded and where we're going next—and centralized it.

This centralized queue and database could become a bottleneck, since we'll have several machines constantly reading and writing to them. Remember, we'll be processing 8,500 pages per second, which means at least that many reads and writes to the database.

The centralization also introduces other potential reliability issues. We've added a single point of failure, which, if it goes down, would force our crawler to grind to a halt.

As a rule of thumb, designs with large centralized components usually have at least two problems:

They don't scale well to massive sizes.
They introduce a single point of failure.

Keep these in mind when building your system. Either you'll need to address these problems (e.g.: by showing that the centralized component scales to your loads and adding redundancy), or you'll want to come up with a different design.

We'll probably want to get rid of both centralized pieces of this design: the database and the queue.

Let's start by looking at the database. How can we get rid of this centralized storage in our system?

We could split up our database across multiple machines. (This is usually called data partitioning or sharding.) Instead of having one big machine that stores all of our data, we'll have a few smaller machines that each store part of it.

We could keep our database machines separate from our crawler machines. But it'll probably be faster to just have the crawlers store the data locally, instead of adding in another group of servers and more network traffic.

So, we'll break up the downloaded page data and have each crawler machine store a portion of it.

How should we divide the data up between the crawlers?

There are a few different ways we could break up our data:

We could divide it up alphabetically. If we have four crawlers, we could give the first one all the URLs that start with the first quarter of the alphabet, the second crawler the URLs with the next quarter, and so on.
We could hash each URL and assign to machines by taking a modulus. Taking our example of four crawlers, we'd compute Crawler = hash(URL) % 4.
We could use consistent hashing—giving each crawler a value (or values) across the range of possible hash values and assigning URLs to the machine that appears next along in range.

Splitting up our URLs alphabetically would be simple, but it's not guaranteed to distribute the load evenly across all of our crawlers. For instance, one crawler might end up with Facebook, Google, and Instagram.

Hashing the URLs and then taking the modulus could work, but it'll make it really hard to add in any more crawlers if we need them. Any time we change the number of crawlers, we'll have to re-hash all of the URLs and assign them to new machines.

Consistent hashing avoids both of these problems. Our hashed URLs should fall pretty evenly across the range of possible values, meaning each crawler will have roughly the same load. And, if we need to add in more crawlers to speed things up, we'll only have to re-assign URLs that appear near the new server on the range of hash values, not all of them.

Okay, so we'll use consistent hashing to map URLs to machines responsible for storing their data.

Here's what our system looks like now. We've taken our database and visited URLs set away from the central tracker and split them up between the crawlers using consistent hashing:

mvp

And, here's what the workflow is for each crawler:

Dequeue the next URL to visit from the central queue.
Hash the URL to determine which crawler would be storing its content.
Check with that crawler to see if we've already grabbed that page. If yes, we're done with this one.
Download the page. Send it to the crawler responsible for storing it.
Parse the page, extracting all outbound links. Add these links to the queue to be visited later.

This seems like it's on the right track, but there are still a few issues:

Our centralized queue is still a potential bottleneck, since all the crawlers will be constantly requesting and sending URLs.
We're sending lots of network traffic between crawlers—querying crawlers to see if a page has already been downloaded and sending retrieved data to be added to the database.

Can we fix both of these at the same time by splitting up our centralized queue?

Maybe. Instead of having one big queue shared between all of the crawlers, we could give each crawler their own queue. Then, when enqueuing new URLs to visit, we'd hash the URL and add it to the queue for the node responsible for storing it.

Here's what our system looks like now:

mvp

And, here's how each crawler's workflow changes:

Dequeue the next URL to visit from our queue.
Check our local visited set to see if we've already grabbed that page. If yes, we're done with this one.
Download the page. Store it in our database.
Parse the page, extracting all outbound links. Add these links to the queue for crawler responsible for processing them.

Nice! This removes the potential bottleneck of a single queue and eliminates the network traffic generated by sending a downloaded page from one crawler to another.

One concern though: our crawler could try to download lots of pages from the same site at once, overwhelming it with requests. That's not polite. How can we avoid that?

There are a few different ways we could do it:

We could add some sort of distributed lock manager service. Before requesting a page, a crawler would grab the lock for the site we're requesting from. If two crawlers both try to send requests at the same time, only one will get the lock, and the other one will have to wait for the first one to finish before sending its own request.
We could assign sites to crawlers instead of pages. That way, all of the pages for a site would go to the same crawler, preventing multiple simultaneous requests from different crawlers.

Both of these could work. But, throughout this system, we've generally ended up choosing decentralized designs over centralized ones. So, we'll continue that overarching design choice here, and go with the second option.

This doesn't really change our system's architecture. Each crawler still has its own queue, and we use consistent hashing to break up the data across crawlers.

The only change is how we're dividing the data up across machines. Instead of doing consistent hashing on the entire URL, we're just hashing the site.

One second though. Doesn't this have the same problem as our idea of splitting up URLs by their first letter? Is it possible that some crawlers will have way more work than others?

Oof. Yeah, it does.

If we just hash the site, not the full URL, then one crawler will end up responsible for crawling the entire Wikipedia site. That's not good.

So, maybe hashing by site isn't a great idea after all. Let's try running with the other option: a distributed lock manager.

If you're familiar with distributed lock managers, it can't hurt to mention any specific implementations you know about. Some common ones are Google's Chubby, Apache's ZooKeeper, or Redis.

Using the lock manager makes our crawler more polite by preventing us from flooding a site with tons of page requests all at once. It lets us set a hard cap on the number of page requests we'll send to a site at a time and coordinates across all of our crawler machines.

How many requests should we allow to a site at a time?

The specific number will probably vary depending on the popularity of the site. For a heavily-visited site like Wikipedia, we might allow hundreds of concurrent page requests. For smaller sites, we could have a smaller limit, like 10 or 20.

Here's what our workflow on each crawler machine looks like now:

Dequeue the next URL to visit from the crawler's queue.
Check our local database to see if we've already grabbed that page. If yes, we're done with this one.
Grab a lock for the site hosting that URL.
Download the page. Store it in our database.
Release the lock for the site hosting the URL.
Parse the page, extracting all outbound links. Add these links to the queue for crawler responsible for processing them.

And here's our system's overall architecture:

mvp

Looks good! This architecture should get us most of the way. But, before we wrap up, let's run through a few refinements that we could make to speed up our crawlers.

Let's look at the visited URLs set. Any way we can make accessing that faster?

Each crawler machine has to keep track of the URLs its already downloaded. All these URLs will take up a good amount of space— probably too much to fit in RAM.

How much space will the URLs take up?

Most browsers don't support URLs longer than 2,000 characters, and most URLs should be much shorter than that. If we budget 1,000 characters per URL, that's 1,000 bytes (assuming ASCII). That's about 4.5 terabytes of URLs—small relative to the amount of data, but still too big to fit in RAM.

There are a few ways we could speed up our check of whether a URL has been crawled yet:

Even if we can't fit all of the URLs in RAM, we could cache some of the most recently visited or popular ones. Chances are we'll have lots of links to a small number of popular pages (like home pages), or we're likely to have pages link back to each other.
We could use a bloom filter, which is constant space (and fits in memory), to do a quick check of whether a URL might have already been crawled. If it definitely hasn't been crawled yet, we can go ahead and download it. If it may have already been crawled, we can fall back on a slower lookup in our cache or the visited set.
We could use data structures optimized for string storage to efficiently compress our URLs. Tries or ternary search trees are two data structures that could work.

Okay cool.

What about our network requests? Any way we can speed them up?

Before we can request a page from a server, we have to connect to it. And, setting up a connection has a few steps:

First, we need to do a DNS query to covert the site's name into an IP address.
Then, we need to set up a TCP connection with the running server. Establishing a TCP connection involves three messages between our crawler and the server.
With this setup done, we can start requesting pages from the server.
Once we're done requesting pages, we have to cleanly close the connection with the server. That takes at least three messages.

This connection setup and teardown can add up to a lot of overhead in our crawler. If we need to request millions of pages from Wikipedia, it would be silly to start this connection process from scratch for each page.

How could we prevent this?

We could have each machine cache open connections. Once a page has been downloaded, keep the connection to the server open for a bit, so that we can quickly download other pages that are on the same site. Of course, we'd need to add some additional logic to each crawler machine to close out connections after some period of inactivity, to keep the total number of open connections reasonable.

We can also speed up our DNS queries by adding a shared DNS cache. That way, we only have to do the expensive name to IP address translation once, store the result locally, and grab it from the cache next time we need to connect to the site.

Our DNS cache could be shared between all of our crawler machines, preventing duplicate work between them.

Interviewer: "Great. We're almost out of time. Can you sketch out the final system you've designed?"

Our crawler system is made up of multiple crawler machines that work in parallel. URLs are assigned to machines using consistent hashing. Each machine runs independently, except for coordination through a distributed lock manager (to cap the number of requests sent to a single site at once) and a shared DNS cache.

mvp

Each crawler machine has multiple worker threads that process URLs from the machine's queue in parallel. Threads verify that a URL has not already been fetched (first checking the bloom filter and falling back on querying the entire stored set if necessary). Once a URL has been fetched, it is added to the machine's database. Each machine caches open connections to sites, to reduce the overhead of establishing new connections.

mvp

Having fun? Here are some other factors we'd want to consider before launching our crawler onto the Internet.

Some sites have a robots.txt file that specifies what portions of the site a crawler shouldn't visit. How could we make sure our crawler respects a site's robots.txt file?
Some malicious "honeypot" sites will continually create new pages with unique links to trap crawlers. How could we avoid getting stuck?
How could we handle pages that aren't plain HTML, but are instead rendered with Javascript after being loaded?

Start your free trial!

Where do I enter my password?

Actually, we don't support password-based login. Never have. Just the OAuth methods above. Why?

It's easy and quick. No "reset password" flow. No password to forget.
It lets us avoid storing passwords that hackers could access and use to try to log into our users' email or bank accounts.
It makes it harder for one person to share a paid Interview Cake account with multiple people.

Ready for more?

Check out our full course

You only have free questions left (including this one).

How fast do we need to crawl pages?

How much space will we need for storage?

Start your free trial!

Ready for more?

Programming interview questions by company:

Programming interview questions by topic: