Pages

Sunday, July 28, 2013

Finding The Median In Large Sets Of Numbers Split Across 1000 Servers

How would you find the median across a thousand servers with a billion of numbers each?

This is a question that involves lots of discussion because it may be quite vague and may require taking some assumptions. Obviously we can't do any kind of in-memory sorting because we don't have enough memory. We may possibly fit one set at a time, which would be under 1 GB of memory.

The first obvious solution is an external merge sort and then a look up of the n/2 element (or the average of n/2 and n/2 + 1 on even n's). This solution would be n log n, but you may look for an O (n) solution, which someone may find impossible, since you need a sorted set to determine the median that involves element comparison which leads to the n log n complexity. But, who says we need sort by comparison.. ?

Another solution comes from our observation that we need to find the n/2 element in a sorted array, which can be generalized to the order statistics algorithm of finding the kth smallest number in an unsorted array. This algorithm is also called quick select or the selection algorithm. I talked about an implementation of this algorithm in this article. The problem is that this is an in-memory algorithm and we would not have enough memory to maintain all the sets. So we will need a distributed version of this algorithm which will basically need a routine to swap numbers across servers. We have to assume that this is an optimized and efficient routine, because there's just too many factors that can affect the efficiency of this routine (network latency, connection problems). This is a O(n) solution and the distributed version of kth smallest may look like this:


Another solution comes from the second observation that we shouldn't use a comparison sort. We could use a histogram of the counts of the numbers across the servers, like in the counting sort. But here we would have to assume a few more things. First we have to know something about the range of the numbers. If we have ranges of order of billions we could store an array of a few billions cells or at least one billion since the problem statement allows us to process a billion numbers in memory, which is the second assumption. Again this is a O(n) solution because we compute the histogram by going once through all the numbers and then find the kth number (the median) by going through the elements of the histogram. In code it may look similar to this:



We could think of more solutions if we could assume even more facts. Let's say we know that the numbers across all servers are uniformly distributed. Having this large amount of numbers we could say the median is also the average of all numbers. So if we also know the range of distribution, say 0.. 1 billion, then computing the median is a O(1) operation and all we need to do is to compute 0 + (1 billion - 0) / 2 which is 500 million. If we don't know the range we can compute the median of medians in O(n) by using order statistics on each server.

If the distribution is not uniform but we know some information about it we could still calculate the median with some probability. Of course we can find other different solutions if we take various assumptions, but the above solutions can probably satisfy any interviewer or whoever asked this question.

Friday, July 19, 2013

Printing Outside Nodes (Boundary) Of A Binary Tree

Contrary to many solutions I found online for this problem - I'm going to say a breadth first search (while printing the first and the last element) is not appropriate here and it won't help in cases where the tree has leafs at levels closer to the root, and those algorithms would miss that. The solution here is a depth first pre-order algorithm, but let me explain the problem statement first. So having the tree below (it is a binary search tree), print the boundary nodes of the tree anticlockwise. Printing the boundary nodes for our tree should result in the following sequence: [60, 41, 16, 25, 42, 55, 62, 64, 70, 65, 74].

Simple Binary Search Tree
Fig. 1 - Simple Binary Search Tree

A pre-order traversal is more appropriate here because it traverses branches in the way we need; for example, it will traverse [60, 41, 16, 25] first which is what we need and we print them. The next sequences are of no interest to us, so we could add a flag that determines whether nodes should be printed out or not (ex: print only leaf nodes). So out of [53, 46, 42, 55, 74, 65, 63, 62, 64, 70] we would print only leaf nodes [42, 55, 62, 64, 70]. We're left with right branch only, where we have to print all the non-leaf nodes. For this we could have a separate stack where, while traversing we add the non-leaf nodes. In code this may look like this:




Wednesday, July 17, 2013

Matching An Input String With A Given Dictionary

Questions about string matching are pretty often in interviews and this is one of them. I've seen two variations online: one to split the string in meaningful words and another one to determine all possible meaningful words that can be produced out of sub-strings of the input.

First one seems easy - you just go once through the string and output what you've matched, thus O(n) where n is the number of the input string's characters. Second one would involve trying all possible sub-words, which would be O(nk) where k is the length of the longest word.

Now another problem is how would you represent a dictionary. A hash map won't work here because there will be too many collisions and the hash map would loose its efficiency. The first solution here would be a trie and that's what I used in both mentioned problems. An example of an implementation of a trie can be found here




Now can it be done faster? Yes, we could use the Aho-Corasick string matching algorithm which uses a finite state machine that resembles to a trie. That will improve our time in the second problem from O(nk) to O(n), but don't forget that we have the pre-processing time. This may not be asked to be implemented in an interview, but it's good to mention.

Tuesday, July 16, 2013

Re-creating A Binary Search Tree Given Its Pre-order Traversal

As a reminder a pre-order traversal consists of visiting the root first then the left and right child. Thus, the first element in a pre-order traversal will always be the root of the tree. It may be easier to find a quick solution by looking at this post where I showed how to validate a tree. Our interest lays in the recursive solution. We could traverse the array and create the tree in the same manner it is traversed in the above mentioned solution, by maintaining a minimum and maximum value.





Note:
Using the external index here is because passing it as an integer argument will not work, since Java passes arguments by value. To avoid using the external index, we could have a mutable integer class or an array of ints, where the first cell is the index value:


Implementing a mutable integer class is fairly easy but it is mostly re-inventing the wheel - Apache Commons has an implementation of this class that could be used here.

Wednesday, July 10, 2013

Fast Fibonacci

I was looking for a faster Fibonacci algorithm and stumbled across this one. The doubling method assumes we know F(n) and F(n+1) to produce F(2n) and F(2n+1) out of the following formulas:

F(2n) = F(n) * (2*F(n+1) - F(n))
F(2n+1) = F(n+1)2 + F(n)2


The formulas can be extracted from the matrix exponentiation algorithm mentioned in the above article and reduced of redundant calculations.




Note:
Integer.numberOfLeadingZeros is a Java shortcut to get to the highest set bit. It may be implemented in some other way in different languages. It is needed here to avoid a few iterations; in our case considering we have 32 iterations and we can get an n up to 92, we can save up to 24 iterations.

Sunday, July 7, 2013

Verifying A Sudoku Solution

This one may be easy for Sudoku fans. I used to play it so it wasn't difficult to come up with the rules that would define the algorithm. For those who haven't played, Sudoku is a number game where you have a board (usually 9 by 9) divided in 9 sub-grids (blocks 3 by 3). Every cell of the board can contain a number from 1 to 9. The goal is to fill the board with numbers in such a way that numbers are unique column-wise, row-wise and block-wise. The board comes with some numbers filled in. For more information check the Wikipedia page. The above Sudoku game was of order 3, but we may want our problem to be a little less restrictive and accept higher orders too. Thus if it would accept a Sudoku board of order n, the board would have ncolumns and n2 rows; the blocks would be n by n, and we would have to fill in numbers from 1 to n2.

Our algorithm will have at least O(n2) time complexity since we have to check each element on the board. It sounds like we could one table traversal and check each element's validity. But how would you check one element when it depends on other elements that may be on the same row/column/block? We could use a Set array of size nfor columns, another array of same size for rows and a bi-dimensional array of size n for blocks. Thus, while traversing the board we note each element in the associated set or if it is already there then we stop the algorithm since we have an invalid solution.

In code looks like this:




It seems like we could improve the code a little by using some bit patterns instead of sets. We could set a bit for each existing number. For example if we already have 3 and 1 our bit pattern should look like this: 101. We would need some simple bit manipulation logic to check and set a bit. I used longs so that I can allow Sudoku boards of orders up to 8:




Helper initialize method and test data for convenience:



Saturday, July 6, 2013

Array Rotation By A Particular Amount

John Bentley's "Programming Pearls" again, with the rotated array problem.  So we have a one-dimensional array of N elements that we have to rotate left by I positions. We have to do this in time proportional to N and in O(1) space. So for example if we have the array [1 2 3 4 5 6 7 8 9] and we want to rotate it by 3 positions we should result in this array: [4 5 6 7 8 9 1 2 3]. Note that the array is not necessarily sorted.

There's plenty of solutions that may come in our minds at first but they probably require additional temporary space. For example we could copy the I elements into a temporary vector, then move to the left all the rest in the original array by I positions and finally copy the elements from the temporary array to the end of the original array. As you see this is too expensive space wise.

To solve this problem we will require a different view of the problem and some time until you hit that aha! moment. We basically need to transform AB to BA, where A and B are sub-arrays of the original array. Imagine we have a routine that can reverse the elements in a specific portion of the original array (A, B). So we reverse elements in A first resulting in ARB and then reverse the elements in B resulting in ARBR. And finally we need to reverse ARBR to produce (ARBR)R which is in fact the array that we were looking for BA.

This may be easier to understand with an example. So we have our array arr = [1, 2, 3, 4, 5, 6, 7, 8, 9] that we need to rotate left by 3. We need to first rotate the sub-array [1, 2, 3] resulting in [3, 2, 1, 4, 5, 6, 7, 8, 9] and then we rotate [4, 5, 6, 7, 8, 9] resulting in [3, 2, 1, 9, 8, 7, 6, 5, 4]. And finally we reverse the whole array resulting in [4, 5, 6, 7, 8, 9, 1, 2, 3] which is the array we were looking for.

In code it looks even easier to understand, considering that we already have the routine that does the reversal of the elements in an array:


The routine for reversing is just a for loop that is of O(n/2) complexity, where n is the number of elements it has to reverse. The routine may look like this:



Note:
- the swap routine is trivial and may look like this:

Thursday, July 4, 2013

The Subset Problem

There are many variations of this problem, I will stay on the general problem of finding all subsets of a set. For example if our set is [1, 2, 3] - we would have 8 (2 to the power of 3) subsets: {[], [1], [2], [3], [1, 2], [1, 3], [1, 2, 3], [2, 3]}. So basically our algorithm can't be faster than O(2^n) since we need to go through all possible combinations.

There's a few ways of doing this. I'll mention two ways here - the recursive way, that we've been taught in high schools; and using a bit string.

Using a bit string involves some bit manipulation but the final code can be found easy to understand. The idea  is that all the numbers from 0 to 2^n are represented by unique bit strings of n bit width that can be translated into a subset. So for example in the above mentioned array we would have 8 numbers from 0 to 7 inclusive that would have a bit representation that is translated using the bit index as the index of the array.

Nr Bits Combination
0 000 {}
1 001 {1}
2 010 {2}
3 011 {1, 2}
4 100 {3}
5 101 {1, 3}
6 110 {2, 3}
7 111 {1, 2, 3}


It may look easier in the code:


Note:
- The Math.pow call computes 2 to the power of n
- The findLowestBitSet is a method I mentioned earlier. In Java it could look like this:



Or it could be optimized to look like this:


For comparison find below the code for the recursive algorithm: