But it doesn't have to end here! Sign up for the 7-day coding interview crash course and you'll get a free Interview Cake problem every week.
You're in!
You want to build a word cloud, an infographic where the size of a word corresponds to how often it appears in the body of text.
To do this, you'll need data. Write code that takes a long string and builds its word cloud data in a hash, where the keys are words and the values are the number of times the words occurred.
Think about capitalized words. For example, look at these sentences:
What do we want to do with "After", "Dana", and "add"? In this example, your final hash should include one "Add" or "add" with a value of 2. Make reasonable (not necessarily perfect) decisions about cases like "After" and "Dana".
Assume the input will only contain words and standard punctuation.
You could make a reasonable argument to use regex in your solution. We won't, mainly because performance is difficult to measure and can get pretty bad.
Are you sure your code handles hyphenated words and standard punctuation?
Are you sure your code reasonably handles the same word with different capitalization?
Try these sentences:
We can do this in runtime and space.
The final hash we return should be the only data structure whose length is tied to n.
We should only iterate through our input string once.
We'll have to go through the entire input string, and we're returning a hash with every unique word. In the worst case every word is different, so our runtime and space cost will both be at least .
This challenge has several parts. Let's break them down.
How would you start the first part?
We could use the built-in split method to separate our words, but if we just split on spaces we'd have to iterate over all the words before or after splitting to clean up the punctuation. And consider em dashes or ellipses, which aren't surrounded by spaces but nonetheless separate words. Instead, we'll make our own split_words method, which will let us iterate over the input string only once.
How can we check if a character in our input string is a letter?
Two good options are to build a helper method or to use regular expressions. Either will work for this problem. We'll build our own helper method is_letter.
Now how can we split each word? Let's assume, for now, that our helper method will return an array of words.
We'll iterate over all the characters in the input string. How can we identify when we've reached the end of a word?
Here's a simple example. It doesn't work perfectly yet—you'll need to add code to handle the end of the input string, hyphenated words, punctuation, and edge cases.
If you thought of building up the word character by character (using <<), that definitely works too. :)
Now we've solved the first part of the challenge, splitting the words. The next part is populating our hash with unique words. What do we do with each word?
If the word is in the hash, we'll increment its count. Otherwise, we'll add it to the hash with a count of 1.
Alright, last part! How should we handle words that are uppercase and lowercase?
Consider these sentences:
Take some time to think of possible approaches. What are some other sentences you might run into. What are all your options?
When are words that should be lowercase not?
Why not?
What are the ideal cases we'd want in our hash?
Here are a few options:
What are the pros and cons for each one?
Pros and cons include:
Any of these could be considered reasonable. Importantly, none of them are perfect. They all have tradeoffs, and it is very difficult to write a highly accurate algorithm. Consider "cliff" and "bill" in these sentences:
You can choose whichever of the options you'd like, or another option you thought of. For this breakdown, we're going to choose option (1).
Now, how do we update our add_word_to_hash method to avoid duplicate words?
Think about the different possibilities:
Moving forward, we can either:
We'll choose the second approach since it will save us a walk through our hash. How should we start?
If the word we're adding is already in the hash in its current case, let's increment its count. What if it's not in the hash?
There are three possibilities:
Let's start with the first possibility. What do we want to do?
Because we only include a word as uppercase if it is always uppercase, we simply increment the lowercase version's count.
What about the second possibility?
This is a little more complicated. We need to remove the uppercase version from our hash if we encounter a lowercase version. But we still need the uppercase version's count!
Finally, what if the word is not in the hash at all?
Easy—we add it and give it a count of 1.
Now we have all our pieces! We can split words, add them to a hash, and track the number of times each word occurs without having duplicate words of the same case. Can we improve our solution?
Let's look at our runtime and space cost. We iterate through every character in the input string once and then every word in our array once. That's a runtime of , which is the best we can achieve for this challenge (we have to look at the entire input string). The space we're using includes an array for each word and a hash for every unique word. Our worst case is that every word is different, so our space cost is also , which is also the best we can achieve for this challenge (we have to return a hash of words).
But we can still make some optimizations!
How can we make our space cost even smaller?
We're storing all our split words in a separate array. That at least doubles the memory we use! How can we eliminate the need for that array?
Right now, we store each word in our array as we split them. Instead, let's just immediately populate each word in our hash!
In our solution, we make four decisions:
To split the words in the input string and populate a hash of the unique words to the number of times they occurred, we:
If the input word is uppercase and there's a lowercase version in the hash, we increment the lowercase version's count. If the input word is lowercase and there's an uppercase version in the hash, we "demote" the uppercase version by adding the lowercase version and giving it the uppercase version's count.
Runtime and memory cost are both . This is the best we can do because we have to look at every character in the input string and we have to return a hash of every unique word. We optimized to only make one pass over our input and have only one data structure.
We haven't explicitly talked about how to handle more complicated character sets. How would you make your solution work with more unicode characters? What changes need to be made to handle silly sentences like these:
I'm singing ♬ on a ☔ day.
☹ + ☕ = ☺.
To handle capitalized words, there were lots of heuristics and approaches we could have used, each with their own strengths and weaknesses. Open-ended questions like this can really separate good engineers from great engineers.
Good engineers will come up with a solution, but great engineers will come up with several solutions, weigh them carefully, and choose the best solution for the given context. So as you're running practice questions, challenge yourself to keep thinking even after you have a first solution. See how many solutions you can come up with. This will grow your ability to quickly see multiple ways to solve a problem, so you can figure out the best solution. And use the hints and gotchas on each Interview Cake question—they're designed to help you cultivate this skill.
Reset editor
Powered by qualified.io