Comparison of four Python implementations of the n choose k problem: Choose randomly n elements in a list of k elements. The distribution should be uniform: equiprobability is a necessary condition.
This works takes its roots from this blog post,that explain both interests and history of the problem and its solutions.
The complexity in time is linear, and constant in memory (if you do not save the choosen items).
The linear implementation provides not only a performance boost compared to stdlib when the number of item is greater than 20 in the tests, but also a greater flexibility: you don't have to provides the full items in order to walk them, just their number.
You can therefore choose N elements into a generator of K elements, without having to store the full generator or even the N elements if you treat them immediatly.
The current implementation of linear_choosen implements these uses, and is a standalone ; you can therefore copy-paste it whereever you need it.
- plot.py: plotting with matplotlib
- test.py: launch benchmarks
- linear_choosen.py: methods implementation
- results/: images outputs
The four following methods are implemented. The third one is the main interest of this repository. Further formal analysis of the method needs to be done.
The random.sample
method do exactly the job. It automatically choose between two internal implementations,
function to n. Link to the doc.
The very obvious mix it, then take the n firsts. Very costly.
This algorithm is detailed in Vitter paper An Efficient Algorithm for Sequential Random Sampling, published in 1987, with the mathematical proof. You can also find it in Knutt's Seminumerical Algorithms. (source: Jon Bentley's Programming Pearls)
This is an implementation proposal, that could probably be improved. Its less efficient than stdlib when n is small, but notably quicker when n comes near k.
The first element have a n/k
likelihood to be in the output subset.
If this element is choosen, then find the next choosens is like n-1 choose k-1.
This treatment is recursively applied on the k items.
This method allows an O(1) complexity in memory, and a O(k) complexity in time (worst case), because elements are walked only once, and deciding whether an element is choosen or not is a comparison of a random number against a likelihood. The subset is consequently constructed during the walk of all elements. All the k elements don't need to be walked, when the search is 0 choose k-i.
Purely recursive implementation of the linear algorithm. Slower, don't work on big dataset without modification of python stack size.
The stdlib implementation is better when n is small, but the linear implementation is quicker in other cases.
Following graphics don't show any obvious distribution bias between firsts and lasts elements.
These results are comparable to stdlib:
And to the dumb method:
Further analysis of the data could highlight a bias induced by the linear method. If so, it is probably a bias of implementation (bad source code) or a random bias (bad random generator). The method itself is proven.