Implementations of randomized sorting on large parallel machines
Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitter-directed routing. In this paper we present B-Flashsort, a new batched-routing variant of Flashsort designed to sort N > P values using P processors connected in a d-dimensional mesh and using constant space in addition to the input and output. The key advantage of the Flashsort approach over Samplesort is a decrease in memory requirements, by avoiding the broadcast of the splitter set to all processors. The practical advantage of B-Flashsort over Flashsort is that it replaces pipelined splitter-directed routing with a set of synchronous local communications and bounds recursion, while still being demonstrably efficient. The performance of B-Flashsort and Samplesort is compared using a parameterized analytic model in the style of [BLM+91] to show that on a d-dimensional toroidal mesh B-Flashsort improves on Samplesort when (N/P) < P/(c1log P + c2dP1/d + c3), for machine-dependent parameters c1, c2, and c3. Empirical confirmation of the analytical model is obtained through implementations on a MasPar MP-1 of Samplesort and two B-Flashsort variants.