As I haven't found any explanation on the web about this useful algorithm, I've decided to write this post. Consider this a follow up of the previous post about the n-smallest algorithm.
Disclamer
Actually the problem we wanted to solve in that post was not really the generic "find the median", but find the K-th element in an unsorted list of length 2K. We are going to short it with "find the median" but be aware it is a bit different.
Finding the median
Probably finding the median in an array is not a common operation, usually it is also not necessary to have a fast algorithm to do it. A non optimized version is just enough for most people:
def find_median_with_sort(seq):
l = sorted(seq)
assert len(l) > 0
return l[len(l)/2]
The complexity of this algoritm is O(N log N), that is the complexity of the sorted function in Python (and most languages).
Small intro to Quicksort
One of the most famous algorithm to compute sorting is Quicksort. Quicksort is a very efficient way to sort and it is particularly popular. C standard library function qsort() implements the algorithm.
Its steps are:
- Choose an item from the array named pivot.
- Split the items in two partitions: the ones which are smaller than the pivot and the rest.
- Recursively Quicksort the items of both the partitions.
The recursion stops when the number of items to sort are either zero or one. A good way to choose a pivot is just pick it randomly.
The second operation is usually called partition. Well, let's say I usually call it partition. I will also name both the partitions: left is the partition with the items smaller thant the pivot, right the other one.
Quicksort worst case complexity is quadratic, fortunately the worst case is very uncommon and the average case complexity is O(N log N) like most of the other sort algorithms. It is also quite simple to implement in-place and that's why it is a very popular algorithm.
The variant
We've already told a variant of this algorithm can find the median, obviously the average complexity is less than the average complexity of the full Quicksort. Basically the recursion is done only for one of the two partitions: the partition with the median.
Let's look better at the partition operation. Say we want to find the k-th items of the array, after partitioning the array given a random pivot we run the recursion on the left partition if the number L of the items contained is less than K, otherwise we look for the (K - L - 1)-th item on the right partition.
For the n-smallest algorithm we needed the first k items of an array, here is a simple Python implementation of a function to extract them:
def quickpart(n, s):
"""Does quicksort-like partition
n -- number of items to select
s -- a list of items (cannot be an iterator)
>>> sorted(quickpart(2, [2,4,1,6]))
[1, 2]
NOTE: This implementation doesn't work if `s` contains duplicates.
"""
assert n > 0
if n == 1:
return [min(s)]
# find random pivot
p = random.choice(s)
# partition the items into two sets
left = [x for x in s if x < p]
right = [x for x in s if x > p]
if len(left) <= n:
n -= len(left)
if n > 1:
left.append(p)
n -= 1
if n > 1:
return left + quickpart(n - 1, right)
else:
return left
else:
return quickpart(n, left)
This implementation is simplified in many ways:
- It does not consider duplicates (homework ;-) )
- It is not optimized: the items list is scan twice (is there a faster way in Python to partion a list?)
- Recursion in Python has quite an high overhead (and it is limited)