Understanding Merge and Sort Algorithms on Modern Processors

10 min readFeb 6, 2020

This article explains the research paper, “Patience is a Virtue: Revisiting Merge and Sort on Modern Processors” published by Microsoft Research.

I intend to explain the research paper in a simple manner. The reason I chose this algorithm is due to its application scope — infrastructure monitoring and historical analysis of logs(event-based data). I have observed how tedious it is to sort event-logging datasets using algorithms like Timsort, especially when we have real-time monitoring of systems.

Your job is to get all the different sorted datasets into one sorted dataset for your central/global monitoring system.

The paper addresses the problem of sorting nearly-sorted data such as logs. Log data is huge in quantity and it is sorted at some level. In today’s context of distributed systems and cloud systems, there are many unsorted sorted log files. While there has been some work done to make logs queriable, such as Stripe’s canonical logging mechanism, we still have to merge sorted-logs into one globally sorted log system. Microsoft addresses this problem by aiming to solve merge and sort in memory using a technique P3 Sort which stands for Ping-Pong Patience Sort.

In the sorting literature, much work is done on randomly ordered data but not enough on almost sorted data. While addressing log files, most of our queries that we run on logs are temporal (data that represents a state in time) in nature and so, there should be a globally sorted (w.r.t time) log system. The industry mostly uses Timsort to perform this task but Microsoft’s P3Sort turned out to be faster and overall better than Timsort.

P3 sort turned out to be 20% faster than Quicksort on random data and 20% to 400% than Timsort for almost-sorted data. These time calculations are done in real and not just in theory.

Let me elaborate P3 sort: Ping-Pong Patience Sort. Ping-Pong is a way to merge sorted data and Patience Sort is an algorithm to sort almost-sorted data.

What is Patience Sort?

Patience game is often known as Solitaire Cards. You have a series of inputs, you create piles of sorted numbers and then, later on, merge all these piles into one resultant sorted array. I will use the example given in the paper:

[3,5,4,2,17,6,8,9,10]

Patience sort scans this array from left to right. At the first step, there are no piles so it creates one and puts 3:

[3] : run 1

Then, the algorithm sees 5, it checks whether any of the current piles formed can hold 5 i.e. if 5 can be placed in existing arrays and keep them sorted. The answer is yes, if you place 5 in run1 array, run1 will remain sorted.

[3,5]: run 1

Similarly, the following piles are formed:

run 1: 3, 5, 7, 8, 9, 10

run 2: 4, 6

run 3: 2

run 4: 1

The runtime complexity of this algorithm is O(n.log n). If the data is already sorted, only 1 run will be needed and so the complexity becomes O(n). As the data becomes more disordered(less sorted or not sorted at all), the number of runs increases and the time complexity becomes O(n.logn) which eventually becomes Merge Sort.

Microsoft Research improved Patience Sort using the following customizations:

Reduce the number of times memory is allocated during the sorting and merging phases.
Sequentialize memory access patterns.
Improving memory prefetching (technical advancement available in modern processors).

The authors decided to pre-allocate the buffer memory to the algorithm. The size of the buffer is decided in advance and in case the buffer fills up, larger buffers are allocated. In this way, the algorithm doesn’t have to allocate memory for a new run( recall that if a number cannot be placed in any of the current pile, a new pile is created. This new pile creation is memory allocation but due to buffer already granted, memory need not be allocated repeatedly).

In Patience Sort, you have to know the last element in each of the runs(piles) to decide where to append the current element or instead, make a new pile for it. Fetching the last element now needs O(1) time as Linked Lists are used where all the tails of the runs are maintained in the memory. You perform a binary search on all these tails to determine where you can place the current element or create a new run for it altogether. This enables fast append. In case the data is not nearly sorted or in fact, completely reversed, you can maintain a head node for all the runs as well, along with the tail nodes. This will help you append to both sides.

Following images will make this easier:

You can see how easy it is now to append the data at both the ends in case we get reverse-ordered logs.

Moving onto the merging phase, the authors discuss two ways:

Heaps
Tree-Q merge

They have optimized heap implementation by using iterations instead of recursion. Recursions are for humans to think easily and cut code but they are not memory efficient. Each stack of a recursion call demands memory for itself. And so, roughly and overly simplified: n recursive calls demand n times the memory needed.

Secondly, Tree-Q merge is a more efficient way of merging that heaps. It is a k-way merge that uses heap implementation but the size of k is such that all the data while merging fits in the L2 cache. L2 Cache feeds the L1 cache which feeds the processor. The optimal value for k is 1000 (as stated in the paper).

Regarding the memory allocation, since the size is k for each merge(merge b/w two sorted datasets), the unused memory often found in the leaf nodes is reused when needed, so, memory allocation is avoided. This boosts the performance.

Even then, the job is not completed as the performance of Patience Sort has been improved significantly but it still doesn’t match what Microsoft needed.

So, they introduce Ping-Pong Merge in the paper.

Ping-Pong Merge

To understand what Ping-Pong Merge is, we have to first set a base premise that binary merges are better than heaps when there is parallelism involved.

To read more about this, read the following (or anyone of them) research papers:

J. Chhugani et al. Efficient Implementation of Sorting on MultiCore SIMD CPU Architecture. In VLDB, 2008.
P. Larson. External Sorting: Run Formation Revisited. IEEE Trans. Knowl. Data Eng. 15(4): 961–972 (2003).
C. Balkesen et al. Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited. In VLDB, 2014.
M.-C. Albutiu et al. Massively parallel sort-merge joins in main memory multi-core database systems. In VLDB, 2012.
C. Kim et al. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. In VLDB, 2009.

The paper cites these research papers to establish the aforementioned base premise: binary merges better than heap merge.

Ping-Pong Merge uses binary merging technique as opposed to the famous heap-based merging techniques. Binary merging is as following:

If k = 1, return [single list]
If k = 2, [sort the two using comparison and return the sorted list of size 2]
Else call the binary-merge function recursively for [0:k/2] and [k/2: k] where k = [0, length(input)]

This is the divide-and-conquer approach used in Merge Sort.

The paper discusses 2 variants of Ping-Pong Merge:

Balanced Ping Pong Merge
Unbalanced Ping Pong Merge

Balanced Ping Pong Merge

Input: single array combined of multiple sorted runs. For eg [3,5,7,8,9,10, 4, 6, 2, 1]. Notice the bold numbers. Each of them marks a separate pile (shown at the beginning of this blog) formed after completing the first phase of Patience Sort.
Merge the adjacent runs. The above input becomes: [3, 4,5,6,7,8,9,10, 1, 2]
Merge the above array again: [1,2,3,4,5,6,7,8,9,10]
This algorithm goes back and forth (hence, ping-pong) until the complete array is sorted.

Advantages of this algorithm:

No memory allocation(all happens in one array).
O(n*log(r)) is the time complexity, same as the heap approach, optimal. Note that the r denotes the number of runs.
Cache friendly: all you need is 3 cache lines: i) input ii) output iii) data that is read and is being merged.
Modern processors take advantage of this algorithm’s sequential nature and pre-fetch the memory needed.

Unbalanced Ping Pong Merge

One fact: merging two sorted runs at a time, it is efficient to first merge the small runs before large ones. This can be easily understood by any sorting book or algorithm. Read about Merge Sort here and observe the scans it takes to merge the two arrays.

The paper improved the unbalanced Ping-Pong Merge in the following way:

In the final array, place the runs in a sorted manner. For eg: [3,5,7,8,9,10, 4, 6, 2, 1]. → [2, 1, 4, 6, 3, 5, 7, 8, 9, 10]. As you can see, 2 and 1 are of length 1 each, 4 of 2 and then the last one with the highest length.
Merge 2 pairs at a time: from smallest to largest. Then update the merge position back to the first two runs. This ensures that you almost every time merge the smallest possible arrays.

Ping Pong Patience Sort and some improvements

The authors call the raw form of Ping-Pong Merge and Patience Sort as naive P3 Sort. They call it naive because they suggest improvements later on.

Cache-Sensitive P3 Sort

This customization aims to reduce the time spent in the first phase: Patience Sort. As the dataset’s size increases, the time spent in the first phase increases. CS(Cache Sensitive) P3 Sort plays on the fact caching is an issue with naive-P3 sort’s large percentage of time in the first phase. So, if you limit the number of runs to which you can append the new data(numbers from the input), you will have data under the limit of your cache and a more balanced tree for your merging. Let me elaborate on the first part: data under the limit of the cache. If you recall from the beginning of this article, I explained how binary search is made faster by maintaining tail pointers. Now, data is so big that performing binary searches across all the tail pointers deters the performance. So, you keep a limited number of tail pointers in your cache. This way, caching is not an issue for you.

Final P3 Sort

The authors further improve the P3 Sort as they observed that CSP3Sort is still inefficient for almost sorted data — which was the main objective. So, they go one step beyond, again, around appending new data to the tail pointers. Final P3 sort will keep appending the new data if it is consecutive to the last appended data(integers, timestamp, etc) to the same tail, thus, saving the binary search calls.

Consider the input array: [5,6,11,12,7,8,9,10]. Read the steps from 1 to 9. Naive P3 Sort and Final P3 Sort are the same for the first 8 steps. It is from step 9 that you will see the difference.

These are the two tails maintained throughout. Binary search is done on these 2 tails in the memory. After 7 is pushed, and the incoming data is all consecutive to the last appended data (8, 9, 10 is consecutive to 7,8,9), 3 binary search calls are saved.

Replacement Selection Sort

It is a very common method to sort data that doesn’t fit in memory. It leverages external sort(merge sort on disk) by determining the bounded disorder on the input data. This algorithm maintains a heap of size k and when k elements are processed, the smallest element is removed from the heap and written on the disk.

Time taken by Replacement Sort is directly proportional to the size of the heap. So for the same disordered data, P3 Sort has lower CPU cost whereas higher tolerance(more disorderedness) demands higher CPU cost with Replacement Sort.

Flat Replacement Selection Sort

Instead of maintaining a heap, this algorithm would create a buffer(list), insert the data in it and sort the list. It then flushes the first portion of the dataset and gets more data, re-sorts it. This continues until the entire dataset is processed. This overcomes two of the deficiencies in the traditional replacement selection sort: i) O(n*log n) time complexity even when the disorder is small, ii) it uses heaps which are expensive to maintain.

Flat Replacement Selection Sort uses to list (array) instead of heaps and the time complexity is improved from O(n* log n) to linear as it does not sort the already sorted data in the list(it just copies it to the correct final location).

Note: The paper also discusses some other variants to replacement-sorting and evaluation proofs of the aforementioned algorithms which we are not covering in this blog post.

Conclusion

The paper significantly improved the patience sort without relying on a particular architecture and leveraging the current hardware capabilities. Being a data engineer, I am intrigued by the new possibilities of sorting data in event-logging systems and databases using this algorithm. Maybe in the next part, I will cover some application that visualizes the system-monitoring logs using P3 Sort. In case you would like me to cover some topics, please mention in the comments. Also, please provide some feedback.