Predicting disk cache hits for 100% random access.
Tuesday, 23 April 2013
Lets say I have a cache of 512G. My block size is 4K, so that’s 134,217,728 entries. How many blocks will I need to read to fill the cache? Well if I read data sequentially, then obviously I just need to read 512G worth of files. But what if I read random blocks? Most caches will try to cache randomly read blocks, since sequential reads get least benefit from disk caching.
So, If I read that same 512G randomly how many blocks will end up in cache? Not 512G because some of those random blocks will be ‘re-hits’ of blocks that were already cached.
It turns out that by simulation, we find the ratio of 0.6321 of the entire cache (about 323.5G). Repeated simulations show that the ratio is pretty constant. So, is there something magical about the ratio 0.6321 (Rather than 0.666 which was my guess).
Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/cachehit.py Re-Hit ratio 0.3676748 Miss (Insert) ratio 0.6323252
Result of 4 trials…
print (0.6322271+0.6320339+0.6322528+0.6320873)/4 0.632150275
Is there anything interesting about that value?
Tells us that the value 0.6321 can be more-or-less represented as.
Furthermore we see http://www.wolframalpha.com/input/?i=1-1%2Fe
I can’t figure out what of the above series representations actually explains the cache hit behavior, but it makes sense to have something to so with factorials since the more data we read in, the higher chance that the next read will actually be a hit in the cache rather than inserting a new value.
If anyone can explain the underlying math to this effect, I would be very interested. Looks like it’s related to http://en.wikipedia.org/wiki/Derangement
Thanks go to Matti Vanninen for pointing out that 0.6321 was somehow magical.
Here’s code to simulate the cache in Python. This causes python to malloc about 400M of memory.
Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/cachehit.py Re-Hit ratio 0.3677729 Miss (Insert) ratio 0.6322271
import random import math import numpy #10 Million entries. cachesize=10000000 hit=0.0 cache= miss=0.0 for i in range(0,cachesize): cache.append(0) for i in range(0,cachesize): b=random.randint(0,cachesize-1) if cache[b] == 1: hit+=1 else: cache[b]=1 miss+=1 print "Re-Hit ratio",hit/cachesize print "Miss (Insert) ratio",miss/cachesize