Questions about string matching are pretty often in interviews and this is one of them. I've seen two variations online: one to split the string in meaningful words and another one to determine all possible meaningful words that can be produced out of sub-strings of the input.

First one seems easy - you just go once through the string and output what you've matched, thus O(n) where n is the number of the input string's characters. Second one would involve trying all possible sub-words, which would be O(nk) where k is the length of the longest word.

Now another problem is how would you represent a dictionary. A hash map won't work here because there will be too many collisions and the hash map would loose its efficiency. The first solution here would be a trie and that's what I used in both mentioned problems. An example of an implementation of a trie can be found here.

Now can it be done faster? Yes, we could use the Aho-Corasick string matching algorithm which uses a finite state machine that resembles to a trie. That will improve our time in the second problem from O(nk) to O(n), but don't forget that we have the pre-processing time. This may not be asked to be implemented in an interview, but it's good to mention.

I am using this variant for PHP5. It's pretty fast and simple.

ReplyDelete