Find Duplicate Files (Practice Interview Question)

You only have free questions left (including this one).

But it doesn't have to end here! Sign up for the 7-day coding interview crash course and you'll get a free Interview Cake problem every week.

You're in!

You left your computer unlocked and your friend decided to troll you by copying a lot of your files to random spots all over your file system.

Even worse, she saved the duplicate files with random, embarrassing names ("this_is_like_a_digital_wedgie.txt" was clever, I'll give her that).

Write a function that returns a linked list of all the duplicate files. We'll check them by hand before actually deleting them, since programmatically deleting files is really scary. To help us confirm that two files are actually duplicates, return a linked list of FilePaths structures with variables for the original and duplicate paths:

typedef struct FilePaths { char *duplicatePath; char *originalPath; } FilePaths; FilePaths * filePathsNew(const char* duplicatePath, const char* originalPath) { FilePaths *filePaths = malloc(sizeof(FilePaths)); assert(filePaths != NULL); filePaths->duplicatePath = strdup(duplicatePath); assert(filePaths->duplicatePath != NULL); filePaths->originalPath = strdup(originalPath); assert(filePaths->originalPath != NULL); return filePaths; } void filePathsFree(FilePaths *filePaths) { if (filePaths != NULL) { free(filePaths->originalPath); free(filePaths->duplicatePath); free(filePaths); } } void filePathsPrint(const FilePaths *filePaths) { printf("(original: %s, duplicate: %s)", filePaths->originalPath, filePaths->duplicatePath); }

For example:

(original: /home/parker/secret_puppy_dance.mpg, duplicate: /tmp/parker_is_dumb.mpg) (original: /etc/apache2/httpd.conf, duplicate: /home/trololol.mov)

You can assume each file was only duplicated once.

Are you correctly handling child folders as well as sibling folders? Be careful that you're traversing your file tree correctly...

When you find two files that are the same, don't just choose a random one to mark as the "duplicate." Try to figure out which one your friend made!

Does your solution work correctly if it's an empty file system (meaning the root directory is empty)?

Our solution takes time and space, where n is the number of files. Is your solution order of the total size on disk of all the files? If so, you can do better!

To get our time and space costs down, we took a small hit on accuracy—we might get a small number of false positives. We're okay with that since we'll double-check before actually deleting files.

Start your free trial!

Where do I enter my password?

Actually, we don't support password-based login. Never have. Just the OAuth methods above. Why?

It's easy and quick. No "reset password" flow. No password to forget.
It lets us avoid storing passwords that hackers could access and use to try to log into our users' email or bank accounts.
It makes it harder for one person to share a paid Interview Cake account with multiple people.

Start your free trial!

Where do I enter my password?

Actually, we don't support password-based login. Never have. Just the OAuth methods above. Why?

It's easy and quick. No "reset password" flow. No password to forget.
It lets us avoid storing passwords that hackers could access and use to try to log into our users' email or bank accounts.
It makes it harder for one person to share a paid Interview Cake account with multiple people.

Each "fingerprint" takes time and space, so our total time and space costs are where n is the number of files on the file system.

If we add the last-minute check to see if two files with the same fingerprints are actually the same files (which we probably should), then in the worst case all the files are the same and we have to read their full contents to confirm this, giving us a runtime that's order of the total size of our files on disk.

If we wanted to get this code ready for a production system, we might want to make it a bit more modular. Try separating the file traversal code from the duplicate detection code.

What about concurrency? Can we go faster by splitting this procedure into multiple threads? Also, what if a background process edits a file while our script is running? Will this cause problems?

What about link files (files that point to other files or folders)? One gotcha here is that a link file can point back up the file tree. How do we keep our file traversal from going in circles?

Start your free trial!

Where do I enter my password?

Actually, we don't support password-based login. Never have. Just the OAuth methods above. Why?

It's easy and quick. No "reset password" flow. No password to forget.
It lets us avoid storing passwords that hackers could access and use to try to log into our users' email or bank accounts.
It makes it harder for one person to share a paid Interview Cake account with multiple people.

Ready for more?

Check out our full course

You only have free questions left (including this one).

Start your free trial!

Start your free trial!

Start your free trial!

Ready for more?

Programming interview questions by company:

Programming interview questions by topic: