You only have 3 free questions left (including this one).

But it doesn't have to end here! Sign up for the 7-day coding interview crash course and you'll get a free Interview Cake problem every week.

I'm making a search engine called MillionGazillion™.

I wrote a crawler that visits web pages, stores a few keywords in a database, and follows links to other web pages. I noticed that my crawler was wasting a lot of time visiting the same pages over and over, so I made a hash set, visited, where I'm storing URLs I've already visited. Now the crawler only visits a URL if it hasn't already been visited.

Thing is, the crawler is running on my old desktop computer in my parents' basement (where I totally don't live anymore), and it keeps running out of memory because visited is getting so huge.

How can I trim down the amount of space taken up by visited?

Gotchas

Your strategy shouldn't take a hit on runtime.

Replacing common substrings like ".com" and "www" with characters that aren't allowed in URLs definitely wins us something, but we can do even better. How can we even further exploit overlaps or shared prefixes between URLs?

Breakdown

Start your free trial!

Log in or sign up with one click to get immediate access to 3 free mock interview questions

Where do I enter my password?

Actually, we don't support password-based login. Never have. Just the OAuth methods above. Why?

  1. It's easy and quick. No "reset password" flow. No password to forget.
  2. It lets us avoid storing passwords that hackers could access and use to try to log into our users' email or bank accounts.
  3. It makes it harder for one person to share a paid Interview Cake account with multiple people.

Solution

Start your free trial!

Log in or sign up with one click to get immediate access to 3 free mock interview questions

Where do I enter my password?

Actually, we don't support password-based login. Never have. Just the OAuth methods above. Why?

  1. It's easy and quick. No "reset password" flow. No password to forget.
  2. It lets us avoid storing passwords that hackers could access and use to try to log into our users' email or bank accounts.
  3. It makes it harder for one person to share a paid Interview Cake account with multiple people.

Complexity

How much space does this save? This is about to get MATHEMATICAL.

How many characters were we storing in our flat hash map approach? Suppose visited includes all possible URLs of length 5 or fewer characters. Let's ignore non-alphabetical characters to simplify, sticking to the standard 26 English letters in lowercase. There are 26526^5 different possible 5-character URLs (26 options for the first character, times 26 options for the 2nd character, etc), and of course 26426^4 different possible 4-character URLs, etc. If we store each 5-character URL as a normal string in memory, we are storing 55 characters per string, for a total of 52655 * 26^5 characters for all possible 5-character strings (and 42644 * 26^4 total characters for all 4-character strings, etc). So for all 1, 2, 3, 4, or 5 character URLs, our total number of characters stored is:

5265+4264+3263+2262+12615 * 26^5 + 4 * 26^4 + 3 * 26^3 + 2 * 26^2 + 1 * 26 ^ 1

So for all possible URLs of length nn or fewer, our total storage space is:

n26n+(n1)26(n1)+...+1261n26^n + (n-1)26^{(n-1)} + . . . + 1 * 26 ^ 1

This is O(n26n)O(n26^n).

How many characters are stored in our trie? The first layer has 26 nodes (and thus 26 characters), one for each possible starting character. On the second layer, each of those 26 nodes has 26 children, for a total of 26226^2 nodes. The fifth layer has 26526^5 nodes. To store all 1, 2, 3, 4, or 5 character URLs our trie will have 5 layers. So the total number of nodes is:

265+264+263+262+26126^5 + 26^4 + 26^3 + 26^2 + 26^1

So for all URLs of length nn or fewer, we have:

26n+26(n1)+...+26126^n + 26^{(n-1)} + ... + 26^1

This is O(26n)O(26^n). We've shaved off a factor of nn.

Bonus trivia: although the HTTP spec allows for unlimited URL length, in practice many web browsers won't support URLs over 2,000 characters.

What We Learned

Start your free trial!

Log in or sign up with one click to get immediate access to 3 free mock interview questions

Where do I enter my password?

Actually, we don't support password-based login. Never have. Just the OAuth methods above. Why?

  1. It's easy and quick. No "reset password" flow. No password to forget.
  2. It lets us avoid storing passwords that hackers could access and use to try to log into our users' email or bank accounts.
  3. It makes it harder for one person to share a paid Interview Cake account with multiple people.

Do you have an answer?

6
7
8
# Implement a trie and use it to efficiently store strings
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Reset editor

Powered by qualified.io

. . .