Understanding Rules of Thumb for Computer Storage
A blog post that explains the research paper: The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb
While working as a data engineer, I often encounter the problem of data storage: whether data access should be kept faster or should I go for cold-storage or something in between? Usually, I focus on the frequency of data-reads and make sure that I am using tools that reference the read-data. However, I don’t want to be anecdotal in my observations and hence, wanted to look upon some fundamentals. So, I started reading a research paper by Jim Gray, Goetz Graefe, published at Microsoft in 1998. I am writing this blog post as a pre-requisite for reading this paper, to help you understand some basic terminology and to give you some context.
Click here for the research paper.
The Five-Minute Rule Ten Years Later
While deciding the appropriate page sizes and main memory size for accessing data, we should consider both economic and performance aspects. And so, we seek the break-even time duration. This duration is often referred to as the reference interval. It tells us the time for which a page should be kept in the main memory.
In 1997, at the time of publishing this paper, the reference interval suggested was 5 minutes. This means that at any given point of time t, data referenced within from t-5 to t should be in the main memory.
Now, this 5-minute rule does not apply to all machines for randomly accessed pages. For eg, mainframes follow a three-minute rule instead. This is because of the higher disk prices. From the equation given above, (Pages Per MB of RAM) / (Accesses Per Second Per Disk) is called the technology ratio, (Price of Disk Drive) / (Price Of MB of RAM) is called the economic ratio. When technology ratio decreases and economic ratio increases, reference interval increases i.e. you get to keep data in the main memory for a longer period of time.
In the paper, a rule named ‘10-byte rule’ is mentioned. Let me give you a bit of context here. Processors are expensive and so, they store the results of instructions (frequent ones) in the memory. The 10-byte rule states that you can keep 10 bytes of CPU instructions in the DRAM.
If a sequential operation reads data and never references it, then there is no need to cache the data in RAM.
In section 1.2 of the paper, “Sequential Data Access”, this line mentions a kind of operations that don’t reference the data. What is the meaning of ‘never referencing it?’ Usually, when an operation reads the data, it creates a reference to it, basically, it indexes the data. Now, if an operation is reading some data and not referencing it (usually done when data is not required again), then we don’t need to cache it as it was a one-time read.
The paper mentions one-pass and two-pass sorting algorithms. It again uses the first equation to get the break-even point in order to decide which algorithm to use: 1-pass or 2-pass. You might have experienced this yourself while fetching a small file in memory v/s reading a huge file in chunks. The authors finally suggest the 1-minute sequential rule. Notice the lines in Figure 3 in the paper, “Fundamentally, tape technology is VERY expensive to access. This encourages very large trape page sizes and very cold data on tape.” Cold data storage is a term used to store large amounts of inactive data. Deciding which data would be inactive is another pain in the neck that engineers still solve case-to-case.
Section 2 of the paper addresses the length of index pages.
First of all, let’s understand why do we need indexes?
An index increases the speed of the search.
Why do we need B-tree for index pages?
B-tree is commonly used in data storage and file systems. It keeps all the keys sorted for sequential traversing, keeps the index-balanced, can handle an arbitrary number of insertions and deletions, uses hierarchy to minimize the number of disk reads. Click here to read in detail.
Now, you have to read these pages which consist of indexes. This reading comes at a cost and your target is that the total cost of reading the index page and then fetching the data is significantly lower than reading the data as if it wasn’t indexed.
The paper also addresses the depreciation cost used in storage systems. I wonder how storage services like S3 and GCS decide their pricing module considering the depreciation cost in mind.
Summary
I found the research paper quite good. It uses case studies to identify problems and suggests solutions. However, it would be great if there were formulae derivations in the paper or some references given to them. I hope that after reading the research paper, you get to decide your main-memory, cache, and, cold-storage size using some concrete formulae.
I am also looking to read some similar papers published recently and want to draw some analogies — 1997 to 2020. Please give me feedback on how can I improve my writing, what else could be included for you to understand the paper better and faster.
[1] The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb by Jim Gray, Goetz Graefe, Microsoft Research