Angry Bits

Words on bytes and bits

Random sampling and select smallest items

I want to highlight a simple trick that I've used in the sampling script of the language detector written in another post.

The trick I'm talking about is the one in the following line:

awk '{print rand() "\t" $0}' | sort -nk1 | cut -f 2-

(The rest of the script)

This shuffle the lines of the input file. Some versions of sort have an option to do that, unluckily the version bundled with OSX doesn't have this option so you need a trick like this one.

The trick works as follow:

  1. First we add a column with a random number to the input.
  2. Then we do the sort based on this new column.
  3. Finally we remove the column.

It might sound weird but I've seen people writing a custom script or a program to just do that, I felt like sharing this would help saving time to someone.

Indeed this "trick" is actually well known in the Perl community with the name of Schwartzian Transform.

Random sampling

Obviously once you have a quick way to do shuffling, you can also use it for uniform sampling. This is actually the original purpose of the script I've written last time:

awk '{print rand() "\t" $0}' | sort -nk1 | cut -f 2- | head -n $lines

You just need to set $lines to the number of lines you want to keep. If you want to split the datasets in a training set and a testing set, you can use split.

Holding only the first items of a sorted list, means we need only the smallest set. Python has a function in the heapq module called nsmallest that implements an algorithm that just returns the smallest set of size K (that is a parameter). The algorithm used is a variant of the Heapsort:

  1. an heap with the first K items of the stream is created. The heap keeps the largest item at the head.
  2. each time we push an item we check if it is smaller than the head.
  3. if it is smaller we substitute the head pushing the new item down to the heap.

The complexity of the algorithm is O(N log K) and it is a common way to explain some of the advantages of Heapsort over other kind of sorts: you don't have to sort all the items to get the first K and if your data source is a stream you need to keep only K items in memory (the size of the heap).

The unknown algorithm

For the purpose of random sampling there is a better algorithm that is not very commonly known. It is an ideal algorithm that works in streaming and has a linear complexity. It's also very very easy to implement.

I don't want to talk about that. I want to focus on a better algorithm to extract the nsmallest items from a stream. The algorithm is very simple, it has linear complexity and it just uses the double of the memory required by the Heapsort one. The first one someone told me about it, two years ago, I was surprised I haven't heard of it before. When I speak about it now I still find people that are surprised they haven't heard of it before. That's why I've named it the unknown algorithm.

Clean up your monitor and read carefully, this is the algorithm:

  1. Read the first K items of the stream and put them in a list L.
  2. Repeat till the end of the stream:
    1. Read K items and append them to L.
    2. Find the median, hold the first K items of L, discard the others.
  3. Sort L if you need the smallest set to be sorted.

To find the median of a list there is a not so easy linear algorithm, in practice a variant of quicksort is mostly used for this task, because it is easier to implement and usually performs very well. (I couldn't find any source on the web about the quicksort variant, I'll write about it in another post)

Let's see the complexity:

  • We still need to take all the items of the stream: N
  • Taking K elements at time we need to find the median: (N/K) * K = N
  • Optionally we have to sort L: K log K

Here we are! The algorithm's complexity is linear. Don't you feel like you should have heard of it before? :-)

Comments