Fun with memory bandwidth / Now that’s what I call optimization.

Recently the performance team was contacted about a benchmark that showed very low throughput for memory bandwidth on a nutanix cluster. The customer was using the sysbench tool, which has a memory bandwidth sub-test.

The defaults in that test are to read 1K of memory in a loop until some total is reached, at which point the bandwidth measurement is returned. At the beginning and end of the 1K read, the benchmark does mutex lock/unlock as well as a gettimeofday() to measure the elapsed time of each 1K read.

I measured the execution times using the linux “perf” tool, and found that nearly all the time was consumed by gettimeofday(). With the defaults, a single node was able to only read around 400MB/s. The problem seemed obvious, that the overhead of gettimeofday() was dwarfing the memory read – since it was only reading 1K for every two gettimeofday() requests (one at each side of the read).

root@icrossing:~/sysbench-0.4.12/sysbench# sysbench --test=memory --memory-scope=global --memory-oper=read --memory-total-size=1G run

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 1024M

Memory operations type: read
Memory scope type: global
Threads started!

Operations performed: 1048576 (421735.42 ops/sec)

1024.00 MB transferred (411.85 MB/sec)

The gettimeofday() calls are embedded in the LOG_EVENT_START macros. The actual read is performed in the line “tmp = *buf”.

  LOG_EVENT_START(msg, thread_id);
      case SB_MEM_OP_READ:
        for (; buf < end; buf++)
          tmp = *buf;
        log_text(LOG_FATAL, "Unknown memory request type:%d. Aborting...\n",
        return 1;
  LOG_EVENT_STOP(msg, thread_id);

The simple solution of course is to increase the size of the 1K buffer to something larger so that the gettimeofday() calls are amortized over a larger memory read component. So that’s what I did.

root@icrossing:~/sysbench-0.4.12/sysbench# sysbench --test=memory --memory-scope=global --memory-oper=read --memory-total-size=10G --memory-block-size=100K run
sysbench 0.4.12:  multi-threaded system evaluation benchmark
Operations performed: 104858 (422219.41 ops/sec)
10240.04 MB transferred (41232.36 MB/sec)

root@icrossing:~/sysbench-0.4.12/sysbench# sysbench --test=memory --memory-scope=global --memory-oper=read --memory-total-size=100G --memory-block-size=1M run
sysbench 0.4.12:  multi-threaded system evaluation benchmark
Operations performed: 102400 (421914.06 ops/sec)
102400.00 MB transferred (421914.06 MB/sec)

root@icrossing:~/sysbench-0.4.12/sysbench# sysbench --test=memory --memory-scope=global --memory-oper=read --memory-total-size=100G --memory-block-size=10M run
sysbench 0.4.12:  multi-threaded system evaluation benchmark
Operations performed: 10240 (416435.23 ops/sec)
102400.00 MB transferred (4164352.35 MB/sec)

OK, so that seems to have worked. The transfer rate is much better with later values of –memory-block-size. By amortizing the slower calls over 10MB – I am able to get 4164352.35 MB/s. Which is roughly 4TB/s.

What. 4TB/s? That does not sound right.

I decided to compile the code from scratch. While doing so, I noticed that gcc was called with -O2, meaning allow the compiler to optimize the code. Look again at the part of the benchmark that is supposed to read from memory.

      case SB_MEM_OP_READ:
        for (; buf < end; buf++)
          tmp = *buf;

The memory is de-referenced from *buf and stored in tmp. But tmp itself is never used again before the larger function returns. The compiler can probably legitimately optimize out the read.

We can check that by running gbb against the compiled code.


(gdb) disassemble /m memory_execute_request

304	      case SB_MEM_OP_READ:
305	        for (; buf < end; buf++)
   0x0000000000409b9d <+773>:	addq   $0x4,-0x30(%rbp)
   0x0000000000409ba2 < +778>:	mov    -0x30(%rbp),%rax
=> 0x0000000000409ba6 < +782>:	cmp    -0x18(%rbp),%rax
   0x0000000000409baa < +786>:	jb     0x409b94 

306	          tmp = *buf;
   0x0000000000409b94 < +764>:	mov    -0x30(%rbp),%rax    < ----  linux "perf" reports ~25% of time here.
   0x0000000000409b98 <+768>:	mov    (%rax),%eax
   0x0000000000409b9a < +770>:	mov    %eax,-0xc(%rbp)     < ----- linux "perf" reports ~50% of time hee.

307	        break;
   0x0000000000409bac <+788>:	jmp    0x409bd1 

308	      default:
309	        log_text(LOG_FATAL, "Unknown memory request type:%d. Aborting...\n",
   0x0000000000409bb4 < +796>:	mov    %eax,%edx
   0x0000000000409bb6 < +798>:	mov    $0x415140,%esi
   0x0000000000409bbb < +803>:	mov    $0x0,%edi
   0x0000000000409bc0 < +808>:	mov    $0x0,%eax
   0x0000000000409bc5 < +813>:	callq  0x404b94 

and here is the optimized version.

303	        break;
304	      case SB_MEM_OP_READ:
305	        for (; buf < end; buf++)
306	          tmp = *buf;            <---- Notice no ASM here!
307	        break;
308	      default:
309	        log_text(LOG_FATAL, "Unknown memory request type:%d. Aborting...\n",
   0x00000000004081c1 <+257>:	xor    %eax,%eax
   0x00000000004081c3 < +259>:	mov    $0x413720,%esi
   0x00000000004081c8 < +264>:	xor    %edi,%edi
   0x00000000004081ca < +266>:	callq  0x404130 

After compiling the benchmark without compiler optimizations, the throughput numbers were much more believable, and using linux perf again, we see the bulk of the time in the read operation as we’d expect.

% Hot       Code

 24.23 │2fc:   mov    -0x30(%rbp),%rax

       │       mov    (%rax),%eax

 52.80 │       mov    %eax,-0xc(%rbp)

 12.30 │       addq   $0x4,-0x30(%rbp)

  0.10 │30a:   mov    -0x30(%rbp),%rax

 10.54 │       cmp    -0x18(%rbp),%rax

       │     ↑ jb     2fc

Without the opimization, the benchmark is now bounded by clock-speed – and so even in this case it does not do a good job of measuring actual memory bandwidth.

Instead, I recommend the stream benchmark :

With the stream benchmark, the observed throughput was around 60GB. Even with the stream benchmark, it needs to be compiled correctly to get the highest memory throughput for the platform.

I get around 30G/s with 4 threads, and 60GB/s with 2 sockets, 8 cores ==16vCPUs.

Compile the stream benchmark with Multi threaded capability using the following options which allows us to pull memory requests from all sockets/cores/MMUs.:

gcc -O3 -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=20  stream.c -o stream.100M_20_MP -fopenmp -D_OPENMP  -lpthread -mcmodel=large

root@unix-gary-jump:~/stream# ./stream.100M_20_MP 
STREAM version $Revision: 5.10 $
This system uses 8 bytes per array element.
Array size = 100000000 (elements), Offset = 0 (elements)
Memory per array = 762.9 MiB (= 0.7 GiB).
Total memory required = 2288.8 MiB (= 2.2 GiB).
Each kernel will be executed 20 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
Number of Threads requested = 16
Number of Threads counted = 16
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 19848 microseconds.
   (= 19848 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           53917.4     0.029915     0.029675     0.031679
Scale:          53633.5     0.030230     0.029832     0.032336
Add:            60234.9     0.040285     0.039844     0.042503
Triad:          60455.2     0.039948     0.039699     0.041805

Speccy Luv. 8-bit programming.

Marketing Your Tech Talent

Marketing Your Tech Talent from deirdrestraughan

What’s an IOP, and why do I care?

Recently I was talking to a network guy about storage IOPS, when he asked me. What is an IOP exactly? Immediately I remembered having the exact same question when I first started seriously working with storage gear back in 2006. I knew that an IOP was an I/O Operation, and IOPS is typically meant to be I/O Operations per second. Surely nobody really cares how many I/O’s are done per second? We only care how long it takes to move the required amount of data (typically MB/s or GB/s).

It turns out that there are a lot of cases where the amount of data being read is actually quite small, and the bottleneck to having the application run faster is how long it takes to do an individual IO. But you might say, that still makes no sense. If I do an I/O of 512b in 1 second, then surely an I/O of 8192 bytes takes 16 seconds – and so, again what I really care about is MB/s.

That’s what I thought too. However, for small I/O amounts (e.g. 512b – 8K) the “setup time” far outweighs the “transfer time” to perform the I/O. This is especially true for spinning disk drives.

This brings us to the next unstated assumption when storage people talk about IOPS. They are typically talking about small “random” I/Os.

Why are small random I/Os interesting? Surely nobody writes applications to just randomly read data from disk? Of course not, but what tends to happen is that for many large applications, the access requests appear to be random from the point of view of the disk-drive. Take for instance an example of a web ticket booking application. This can be either airplane tickets or concert tickets. In each case the vendor has in total many more customers than can be kept in the main memory of the application servers. So, whenever I login to book a ticket, the application has to fetch my information from disk. Since the application has no idea that I am about to book a ticket, the request to the disk drive results in a random IO (there was no way to predict the incoming request). The amount of data requested is probably quite small, a few KB – mix up my request with everyone elses request, and what this starts to look like is a very large amount of random IO. My request, and the next guys request are not related, so it’s all random.

Now, since disk drive speeds are many orders of magnitude slower than CPU and main memory, often the first order bottleneck of large-scale servers is the time taken to read the data from disk so that it can be processed. Hence random IOPs (or just IOPs) of the storage system becomes a very important figure when architecting the overall solution.

Before SSD/Flash technology came along, the only mainstream way to solve the problem (besides buying enough DRAM to store everything in memory) was to lash together hundreds of slow spinning disk to provide enough aggregate IOPS to service the application. Doing so requires quite a lot of software to layout the data efficiently such that the aggregate IOPS can be realized by the application accessing it. This hard, and is how EMC and lately NetApp made a good business of managing hundreds of spindles to provide throughput and capacity.

Now that SSD/Flash is mainstream, we don’t really need to use the HDDs to provide IOPS, we use Flash for the IOPS and HDD’s for their large capacity. The software challenge is now how to best do that.

VMware performance for guru’s.

I am definitely not a guru yet. But one day….

VMware Performance for Gurus – A Tutorial from Richard McDougall

Predicting disk cache hits for 100% random access.

Lets say I have a cache of 512G. My block size is 4K, so that’s 134,217,728 entries. How many blocks will I need to read to fill the cache? Well if I read data sequentially, then obviously I just need to read 512G worth of files. But what if I read random blocks? Most caches will try to cache randomly read blocks, since sequential reads get least benefit from disk caching.

So, If I read that same 512G randomly how many blocks will end up in cache? Not 512G because some of those random blocks will be ‘re-hits’ of blocks that were already cached.

It turns out that by simulation, we find the ratio of 0.6321 of the entire cache (about 323.5G). Repeated simulations show that the ratio is pretty constant. So, is there something magical about the ratio 0.6321 (Rather than 0.666 which was my guess).

  • Exmample output.
    Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/ 
    Re-Hit ratio 0.3676748
    Miss (Insert) ratio 0.6323252

    Result of 4 trials…

    print (0.6322271+0.6320339+0.6322528+0.6320873)/4

    Is there anything interesting about that value?
    Tells us that the value 0.6321 can be more-or-less represented as.


    Furthermore we see

    Series representation.

    I can’t figure out what of the above series representations actually explains the cache hit behavior, but it makes sense to have something to so with factorials since the more data we read in, the higher chance that the next read will actually be a hit in the cache rather than inserting a new value.

    If anyone can explain the underlying math to this effect, I would be very interested. Looks like it’s related to

    Thanks go to Matti Vanninen for pointing out that 0.6321 was somehow magical.

    Here’s code to simulate the cache in Python. This causes python to malloc about 400M of memory.

    Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/ 
    Re-Hit ratio 0.3677729
    Miss (Insert) ratio 0.6322271
    import random
    import math
    import numpy
    #10 Million entries.
    for i in range(0,cachesize):
    for i in range(0,cachesize):
            if cache[b] == 1:
    print "Re-Hit ratio",hit/cachesize
    print "Miss (Insert) ratio",miss/cachesize
  • Show status of aggregate creation / disk zero status.

    For the most part, you only really care about disk zeroing if you’re waiting for an aggregate to be created on disks that were previously used on an old aggregate that was only just destroyed.

    To see how long disk-zeroing is taking, use the command “aggr status -r”

    filer-6280*> aggr status -r
    Aggregate aggr1_large (creating, raid_dp, initializing) (block checksums)
      Plex /aggr1_large/plex0 (offline, empty, active)
      Targeted to traditional volume or aggregate but not yet assigned to a raid group
          RAID Disk Device          HA  SHELF BAY CHAN Pool Type  RPM     
          --------- ------          ------------- ---- ---- ---- ----- --------------    --------------
          pending   3a.00.8         3a    0   8   SA:A   0   SAS 10000 (zeroing, 71% done)
          pending   3a.00.14        3a    0   14  SA:A   0   SAS 10000 (zeroing, 69% done)
          pending   3a.00.16        3a    0   16  SA:A   0   SAS 10000 (zeroing, 70% done)
          pending   3a.00.18        3a    0   18  SA:A   0   SAS 10000 (zeroing, 69% done)
          pending   3a.00.20        3a    0   20  SA:A   0   SAS 10000 (zeroing, 69% done)
          pending   3a.00.22        3a    0   22  SA:A   0   SAS 10000 (zeroing, 71% done)

    Netapp API hacking with python

    As a non-programmer, I’ve always been reticent to use anything with acronyms like API and SDK, relying instead on issuing a full command line using rsh or ssh. That works for a while until you want to start doing things like checking for errors – and until you get fed up with 90% of the script being dedicated to parsing the output. NetApp filers have a reasonable API, that can be used to get both sysadmin data (number and fullness of volumes) and also performance analysis numbers (number of ops and the response times). Best of all the SDK provides libraries for both perl and python. I have switched almost entirely to Python for anything that needs more than a few lines of automation. In python, all you have to do is import 2 libraries and you can start using the API.

    The API can be downloaded from the netapp support site under the ‘Download Software’ link.

    The API is implemented at the lowest level by sending RPC/XML calls over http to the filer. Inside NetApp, any new functionality must provide API (aka ZAPI) access – so learning the API should be a good investment. It helps to know that the implementation is XML since retrieving the data returned from the filer – follows the tortuous access pattern familiar to anyone who has used XML in the past.

    I have used the API in my lab to monitor disk usage during long term testing, and in my previous life as a consultant implemented a scheduler to manage database snapshots. Once you’re used to accessing the API, it’s much easier than sending CLI commands over ssh/rsh.

    Here are the steps to get the API

    Download the API (ontapi SDK)

  • Head to

    You’re looking for
    NetApp Manageability SDK

  • Select “All platforms” then hit “GO”.
  • Click the button “View & Download”
  • Fill in some sort of form…. fill in all the fields, otherwise you’ll have to start over. For some reason, the form only talks about Perl, Java, C and .Net. After that a download link will appear, there’s a license to click through – and eventually you’ll be able to download the SDK. When I downloaded, the tarball was 89 MB.
  • Click on yet another hyperlink “Thank you for completing the registration form. To continue with your download click here
  • Scroll to the bottom of the page, and click the “CONTINUE” link (yes, painful isn’t it?)
  • Now you’ll have to read the EULA… then hit “Accept” (if you can live with the the EULA)
  • Maddeningly, there is yet another link to click, that implies you need to login elsewhere… but actually clicking the link “Log in to the NetApp Support Site and click NetApp Manageability SDK.” will actually start the download.

    !!Now the download will actually start!!

  • Once the tarball/zip file has downloaded on your client machine, go and find it, extract it to somewhere sensible as you normally would.

    The file is called “””" on my mac.

    Inside the ‘lib’ directory, is python/NetApp. These are the python modules that we’ll be using.
  • Now, regarding the documentation. At the top level of the directory structure, of the unzipped file – there is a file SDK_help.htm. If I open this file with the Chrome browser – then I get a mostly blank page. If I open the file in Safari browser, then I get a decent help screen.
  • To see some examples
  • Home > NetApp Manageability SDK > Sample Codes > Data ONTAP sample codes
  • Now, let’s run the ZAPI ‘hello world’, return the name of the filer. Obviously – this is just as easy to do with the CLI – but once we start to get into iterating over tens or hundreds of volumes, or other structured data – the power of using the API will become obvious.

    For now though, let’s start with something simple…

    lovebox:[~] $ export PYTHONPATH=$PYTHONPATH:~/Downloads/netapp-manageability-sdk-5.0R1/lib/python/NetApp/
    lovebox:[~] $ ipython
    In [1]: from NaElement import *
    In [2]: from NaServer import *
    In [3]: server=NaServer("",1,6)
    In [4]: server.set_admin_user('root',"root")
    In [5]: cmd = NaElement('system-get-info')
    In [6]: out=server.invoke_elem(cmd)
    In [7]: system_info=out.child_get("system-info")
    In [8]: system_info.child_get_string("system-name")
    Out[8]: u'gjlfiler'
    from NaElement import *
    from NaServer import *
    cmd = NaElement('system-get-info')

    Something more tricky

    Regrettably, the SDK documentation no-longer contains the ONTAP portion of the API, IOW – what calls I can make to the filer, and what it will respond with. To access that documentation.

    Click the link for Data ONTAP API Documentation, and download the zipfile (currently that filename is

    Again, I was unable to view the html doc with Chrome for some reason, but Firefox works OK.

    So, let’s try somehing else – how about a short script to get the size of each volume in the system.

  • Open the documentation folder, and fire open “SDK_help.htm”.
  • Go to the index tab, and find the “Volume” section – click on “volume-list-info”, because that just sounds like it might be what we want.

    We see from the documentation that the function returns a list of volumes, in a structure named “volumes”, because that’s the name given in the “Output Name” field in the API documentation.

    Input Name Range Type Description
    verbose boolean


    If set to “true”, more detailed volume information is returned. If not supplied or set to “false”, this extra information is not returned.
    volume string


    The name of the volume for which we want status information. If not supplied, then we want status for all volumes on the filer. Note that if status information for more than 20 volumes is desired, the volume-list-info-iter-* zapis will be more efficient and should be used instead.
    Output Name Range Type Description
    volumes volume-info[]

    List of volumes and their status information.
  • We can tell that the return type is going to be a list because of the [ ] at the end of the type description e.g. volume-info[]. Click on the volume-info[] link to see what is in the list.
  • One of the elements is “size-total” which is what we want.

    So, now we know that we want to grab the volume-info[] list (which we can guess is a list of each volume in the system, and each item in the list has information about the volume).

    So, as before we do some setup to reach the filer, and this time we’re going to issue the command volume-ist-info.

    In [1]: import sys
    In [2]: sys.path.append("/opt/netapp/netapp-manageability-sdk-5.0R1/lib/python/")
    In [3]: from NaElement import *
    In [4]: from NaServer import *
    In [6]: filer=NaServer("",1,6)
    In [7]: filer.set_admin_user('root','root')
    In [8]: cmd = NaElement("volume-list-info")
    In [10]: ret = filer.invoke_elem(cmd)
    In [11]: ret
  • The object ‘ret’, is a container, which we know contains the output “volumes” – but we need to access it via the magical accessors. How do we know the magic words to use? Well, we know from the API document that the output name as returned by the API is “volumes” and that it is of type list (because of the []). So, we ask the container (i.e. the whole XML returned from the filer) to provide us the “volumes” object.

    Above we see that ‘ret’ is just an NaElement instance, we want the specific ‘volumes’ object. The might be more than one object returned to us (although in this case there is not) so we have to unpack the volumes object – just like in the simple example we did first.

    In [12]: volumes = ret.child_get("volumes")
    n [13]: volumes

    So, now we have an object that should be a list of volumes. But unfortunately we can’t just access that as a real python list – we have to use the specific accessors.

  • We know that volumes contains a list of per-volume information.
  • To access an individual item of the list we need to issue children_get()
  • By looking at the definition of ‘volume-info’ we can see that there is a field called “name” which is a string.
  • To extract a particular field from the per-volume structure we issue child_get_string()
         vol --- size
               --- name
               --- etc,
    In [14]: for vol in volumes.children_get():
        ...:     print vol.child_get_string("name")

    Since we also want the size – we can find that field in the volume-info structure, figure out the type and use the correct accessor. In this case we can make a guess that size-total is what we want, and that its type is integer and so we end up with this :-

    for vol in volumes.children_get():
        print vol.child_get_string("name")
        print vol.child_get_int("size-total")

    The documentation that tells you about child_get() etc. is in the SDK documentation, in the following section

    Home > NetApp Manageability SDK > Programming Guide > SDK Core APIs > Python Core APIs > Input Output Management APIs
  • Thoughts on “Latency Numbers Every Programmer Should Know”

    Last week a number of twitter users were posting about a very cool looking interactive latency predicterizor. In the small world of computer performance nerds, it was a veritable tsunami of attention. My twitter skills are so feeble that I cannot figure out how to determine the total number of tweets – but as of now (Wed Jan 2nd 5:32 EST) the most recent tweet was 19 minutes ago. Fascinating.

    The interactive paper is here

    At face value, it seemed that some of the numbers were really quite low. For instance I was pretty surprised to see the projected random read latency from an SSD at 17 micro-seconds (uSec). What’s really nice about this graphic is that the sources for the numbers presented, are contained in the javascript source. In fact that’s really really nice. The numbers for flash/SSD are taken from a berkley paper published in 2012? which is, itself quite an interesting read.

    Flash paper :

    From the interactive diagram – which states main memory read latency of 1uSec, you’d assume that reading from an SSD is 17x slower than main memory. Personally I think that’s WAY off for a couple of reasons. Firstly, and least interestingly – the 17uSec is for a direct access to the NAND cell itself. In practice SSD’s are packaged with a Flash Translation Layer (FTL) which translates something like a linear address range (LBA) into mappings to the flash memories. The FAST paper (above) pegs the FTL latency at 30 uSec – which seems high relative to the 17 uSec response from the NAND memory (the comments in the code, and the paper itself – seem to peg the NAND response time at 20 uSec – but the interactive tool shows me 17 uSec for “2012″).

    The more interesting consideration – particularly given the title “Latency Numbers every _programmer_ should know” is that, as a programmer – I can to a large extent expect to achieve a 1 uSec response from main memory when I attempt to read some value. However, there’s no way that I will ever get the 17 uSec (or even a 17+39 uSec) response from SSD. The reason is that, as a user I cannot access that SSD directly. For most programmers – the SSD will be accessed via a filesystem, then a device driver.

    Programmers access to the filesystem is almost always the other side of a system call, which means a trap or interrupt call saving stack pointers and setting up adresses for buffers etc. Typically the data will be read from SSD into the memory of the host computer – and then returned to the user/programmer. Even with DMA and other zero-copy techniques there will be many reads/writes to main memory to setup the system call.

    When we think about spinning-disk accesses of ~4 milliseconds (ms) or 4,0000 uSecs – we can more-or-less gloss over the setup cost required to setup the system call and move the data through the kernel and back to the user, because the overhead of moving this mechanical instrument dwarfs the other costs. But with SSD’s that setup cost starts to impact how quickly a user-land application can really access data.

    Thinking Clearly about Performance – ACM Queue article.

    Great article from Cary Millsap, covers performance analysis in general – not specific to Oracle.