Predicting disk cache hits for 100% random access.

Lets say I have a cache of 512G. My block size is 4K, so that’s 134,217,728 entries. How many blocks will I need to read to fill the cache? Well if I read data sequentially, then obviously I just need to read 512G worth of files. But what if I read random blocks? Most caches will try to cache randomly read blocks, since sequential reads get least benefit from disk caching.

So, If I read that same 512G randomly how many blocks will end up in cache? Not 512G because some of those random blocks will be ‘re-hits’ of blocks that were already cached.

It turns out that by simulation, we find the ratio of 0.6321 of the entire cache (about 323.5G). Repeated simulations show that the ratio is pretty constant. So, is there something magical about the ratio 0.6321 (Rather than 0.666 which was my guess).

  • Exmample output.
    Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/cachehit.py 
    Re-Hit ratio 0.3676748
    Miss (Insert) ratio 0.6323252
    

    Result of 4 trials…

    print (0.6322271+0.6320339+0.6322528+0.6320873)/4
    0.632150275
    

    Is there anything interesting about that value?
    http://www.wolframalpha.com/input/?i=0.6321
    Tells us that the value 0.6321 can be more-or-less represented as.

    1-(1/e)
    

    Furthermore we see http://www.wolframalpha.com/input/?i=1-1%2Fe

    Series representation.

    I can’t figure out what of the above series representations actually explains the cache hit behavior, but it makes sense to have something to so with factorials since the more data we read in, the higher chance that the next read will actually be a hit in the cache rather than inserting a new value.

    If anyone can explain the underlying math to this effect, I would be very interested. Looks like it’s related to http://en.wikipedia.org/wiki/Derangement

    Thanks go to Matti Vanninen for pointing out that 0.6321 was somehow magical.

    Here’s code to simulate the cache in Python. This causes python to malloc about 400M of memory.

    Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/cachehit.py 
    Re-Hit ratio 0.3677729
    Miss (Insert) ratio 0.6322271
    
    import random
    import math
    import numpy
    
    #10 Million entries.
    cachesize=10000000
    hit=0.0
    cache=[]
    miss=0.0
    
    for i in range(0,cachesize):
            cache.append(0)
    
    for i in range(0,cachesize):
            b=random.randint(0,cachesize-1)
            if cache[b] == 1:
                    hit+=1
            else:
                    cache[b]=1
                    miss+=1
    
    print "Re-Hit ratio",hit/cachesize
    print "Miss (Insert) ratio",miss/cachesize
    
  • 2 Responses to “Predicting disk cache hits for 100% random access.”

    1. Neil Gunther writes:

      Congratulations! I think you just simulated radioactive decay for an “atom” that has a a half-life of 0.693147 steps. :)

      I’ll try to organize my explanation better after I finish my GDAT class and get back to you.

    2. Neil Gunther writes:

      All is finally revealed about Exponential Cache Behavior. (I think?) http://perfdynamics.blogspot.com/2013/05/exponential-cache-behavior.html

    Leave a Reply

    *