It turns out that for Windows 2012 you must download the binaries for “Windows XP and Later”, NOT “Windows Vista and later”. I installed the Vista binaries and for the most part it installed and appeared to work – however no packets actually got passeed over the VPN. I removed the Vista package, and installed the “XP and later” package and the VPN started working as expected.
Saturday, 1 November 2014
Monday, 16 December 2013
Recently the performance team was contacted about a benchmark that showed very low throughput for memory bandwidth on a nutanix cluster. The customer was using the sysbench tool, which has a memory bandwidth sub-test.
The defaults in that test are to read 1K of memory in a loop until some total is reached, at which point the bandwidth measurement is returned. At the beginning and end of the 1K read, the benchmark does mutex lock/unlock as well as a gettimeofday() to measure the elapsed time of each 1K read.
I measured the execution times using the linux “perf” tool, and found that nearly all the time was consumed by gettimeofday(). With the defaults, a single node was able to only read around 400MB/s. The problem seemed obvious, that the overhead of gettimeofday() was dwarfing the memory read – since it was only reading 1K for every two gettimeofday() requests (one at each side of the read).
root@icrossing:~/sysbench-0.4.12/sysbench# sysbench --test=memory --memory-scope=global --memory-oper=read --memory-total-size=1G run Running the test with following options: Number of threads: 1 Doing memory operations speed test Memory block size: 1K Memory transfer size: 1024M Memory operations type: read Memory scope type: global Threads started! Done. Operations performed: 1048576 (421735.42 ops/sec) 1024.00 MB transferred (411.85 MB/sec)
The gettimeofday() calls are embedded in the LOG_EVENT_START macros. The actual read is performed in the line “tmp = *buf”.
LOG_EVENT_START(msg, thread_id); ... case SB_MEM_OP_READ: for (; buf < end; buf++) tmp = *buf; break; default: log_text(LOG_FATAL, "Unknown memory request type:%d. Aborting...\n", mem_req->type); return 1; } } ... LOG_EVENT_STOP(msg, thread_id);
The simple solution of course is to increase the size of the 1K buffer to something larger so that the gettimeofday() calls are amortized over a larger memory read component. So that’s what I did.
root@icrossing:~/sysbench-0.4.12/sysbench# sysbench --test=memory --memory-scope=global --memory-oper=read --memory-total-size=10G --memory-block-size=100K run sysbench 0.4.12: multi-threaded system evaluation benchmark ... Operations performed: 104858 (422219.41 ops/sec) 10240.04 MB transferred (41232.36 MB/sec) root@icrossing:~/sysbench-0.4.12/sysbench# sysbench --test=memory --memory-scope=global --memory-oper=read --memory-total-size=100G --memory-block-size=1M run sysbench 0.4.12: multi-threaded system evaluation benchmark ... Operations performed: 102400 (421914.06 ops/sec) 102400.00 MB transferred (421914.06 MB/sec) root@icrossing:~/sysbench-0.4.12/sysbench# sysbench --test=memory --memory-scope=global --memory-oper=read --memory-total-size=100G --memory-block-size=10M run sysbench 0.4.12: multi-threaded system evaluation benchmark ... Operations performed: 10240 (416435.23 ops/sec) 102400.00 MB transferred (4164352.35 MB/sec)
OK, so that seems to have worked. The transfer rate is much better with later values of –memory-block-size. By amortizing the slower calls over 10MB – I am able to get 4164352.35 MB/s. Which is roughly 4TB/s.
What. 4TB/s? That does not sound right.
I decided to compile the code from scratch. While doing so, I noticed that gcc was called with -O2, meaning allow the compiler to optimize the code. Look again at the part of the benchmark that is supposed to read from memory.
case SB_MEM_OP_READ: for (; buf < end; buf++) tmp = *buf; break;
The memory is de-referenced from *buf and stored in tmp. But tmp itself is never used again before the larger function returns. The compiler can probably legitimately optimize out the read.
We can check that by running gbb against the compiled code.
(gdb) disassemble /m memory_execute_request 304 case SB_MEM_OP_READ: 305 for (; buf < end; buf++) 0x0000000000409b9d <+773>: addq $0x4,-0x30(%rbp) 0x0000000000409ba2 < +778>: mov -0x30(%rbp),%rax => 0x0000000000409ba6 < +782>: cmp -0x18(%rbp),%rax 0x0000000000409baa < +786>: jb 0x409b94
306 tmp = *buf; 0x0000000000409b94 < +764>: mov -0x30(%rbp),%rax < ---- linux "perf" reports ~25% of time here. 0x0000000000409b98 <+768>: mov (%rax),%eax 0x0000000000409b9a < +770>: mov %eax,-0xc(%rbp) < ----- linux "perf" reports ~50% of time hee. 307 break; 0x0000000000409bac <+788>: jmp 0x409bd1 308 default: 309 log_text(LOG_FATAL, "Unknown memory request type:%d. Aborting...\n", 0x0000000000409bb4 < +796>: mov %eax,%edx 0x0000000000409bb6 < +798>: mov $0x415140,%esi 0x0000000000409bbb < +803>: mov $0x0,%edi 0x0000000000409bc0 < +808>: mov $0x0,%eax 0x0000000000409bc5 < +813>: callq 0x404b94
and here is the optimized version.
303 break; 304 case SB_MEM_OP_READ: 305 for (; buf < end; buf++) 306 tmp = *buf; <---- Notice no ASM here! 307 break; 308 default: 309 log_text(LOG_FATAL, "Unknown memory request type:%d. Aborting...\n", 0x00000000004081c1 <+257>: xor %eax,%eax 0x00000000004081c3 < +259>: mov $0x413720,%esi 0x00000000004081c8 < +264>: xor %edi,%edi 0x00000000004081ca < +266>: callq 0x404130
After compiling the benchmark without compiler optimizations, the throughput numbers were much more believable, and using linux perf again, we see the bulk of the time in the read operation as we’d expect.
% Hot Code 24.23 │2fc: mov -0x30(%rbp),%rax │ mov (%rax),%eax 52.80 │ mov %eax,-0xc(%rbp) 12.30 │ addq $0x4,-0x30(%rbp) 0.10 │30a: mov -0x30(%rbp),%rax 10.54 │ cmp -0x18(%rbp),%rax │ ↑ jb 2fc
Without the opimization, the benchmark is now bounded by clock-speed – and so even in this case it does not do a good job of measuring actual memory bandwidth.
Instead, I recommend the stream benchmark : http://www.cs.virginia.edu/stream/
With the stream benchmark, the observed throughput was around 60GB. Even with the stream benchmark, it needs to be compiled correctly to get the highest memory throughput for the platform.
I get around 30G/s with 4 threads, and 60GB/s with 2 sockets, 8 cores ==16vCPUs.
Compile the stream benchmark with Multi threaded capability using the following options which allows us to pull memory requests from all sockets/cores/MMUs.:
gcc -O3 -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=20 stream.c -o stream.100M_20_MP -fopenmp -D_OPENMP -lpthread -mcmodel=large ------------------------------------------------------------- root@unix-gary-jump:~/stream# ./stream.100M_20_MP ------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 100000000 (elements), Offset = 0 (elements) Memory per array = 762.9 MiB (= 0.7 GiB). Total memory required = 2288.8 MiB (= 2.2 GiB). Each kernel will be executed 20 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 16 Number of Threads counted = 16 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 19848 microseconds. (= 19848 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 53917.4 0.029915 0.029675 0.031679 Scale: 53633.5 0.030230 0.029832 0.032336 Add: 60234.9 0.040285 0.039844 0.042503 Triad: 60455.2 0.039948 0.039699 0.041805 -------------------------------------------------------------
Monday, 28 October 2013
Tuesday, 21 May 2013
Recently I was talking to a network guy about storage IOPS, when he asked me. What is an IOP exactly? Immediately I remembered having the exact same question when I first started seriously working with storage gear back in 2006. I knew that an IOP was an I/O Operation, and IOPS is typically meant to be I/O Operations per second. Surely nobody really cares how many I/O’s are done per second? We only care how long it takes to move the required amount of data (typically MB/s or GB/s).
It turns out that there are a lot of cases where the amount of data being read is actually quite small, and the bottleneck to having the application run faster is how long it takes to do an individual IO. But you might say, that still makes no sense. If I do an I/O of 512b in 1 second, then surely an I/O of 8192 bytes takes 16 seconds – and so, again what I really care about is MB/s.
That’s what I thought too. However, for small I/O amounts (e.g. 512b – 8K) the “setup time” far outweighs the “transfer time” to perform the I/O. This is especially true for spinning disk drives.
This brings us to the next unstated assumption when storage people talk about IOPS. They are typically talking about small “random” I/Os.
Why are small random I/Os interesting? Surely nobody writes applications to just randomly read data from disk? Of course not, but what tends to happen is that for many large applications, the access requests appear to be random from the point of view of the disk-drive. Take for instance an example of a web ticket booking application. This can be either airplane tickets or concert tickets. In each case the vendor has in total many more customers than can be kept in the main memory of the application servers. So, whenever I login to book a ticket, the application has to fetch my information from disk. Since the application has no idea that I am about to book a ticket, the request to the disk drive results in a random IO (there was no way to predict the incoming request). The amount of data requested is probably quite small, a few KB – mix up my request with everyone elses request, and what this starts to look like is a very large amount of random IO. My request, and the next guys request are not related, so it’s all random.
Now, since disk drive speeds are many orders of magnitude slower than CPU and main memory, often the first order bottleneck of large-scale servers is the time taken to read the data from disk so that it can be processed. Hence random IOPs (or just IOPs) of the storage system becomes a very important figure when architecting the overall solution.
Before SSD/Flash technology came along, the only mainstream way to solve the problem (besides buying enough DRAM to store everything in memory) was to lash together hundreds of slow spinning disk to provide enough aggregate IOPS to service the application. Doing so requires quite a lot of software to layout the data efficiently such that the aggregate IOPS can be realized by the application accessing it. This hard, and is how EMC and lately NetApp made a good business of managing hundreds of spindles to provide throughput and capacity.
Now that SSD/Flash is mainstream, we don’t really need to use the HDDs to provide IOPS, we use Flash for the IOPS and HDD’s for their large capacity. The software challenge is now how to best do that.
Friday, 3 May 2013
Tuesday, 23 April 2013
Lets say I have a cache of 512G. My block size is 4K, so that’s 134,217,728 entries. How many blocks will I need to read to fill the cache? Well if I read data sequentially, then obviously I just need to read 512G worth of files. But what if I read random blocks? Most caches will try to cache randomly read blocks, since sequential reads get least benefit from disk caching.
So, If I read that same 512G randomly how many blocks will end up in cache? Not 512G because some of those random blocks will be ‘re-hits’ of blocks that were already cached.
It turns out that by simulation, we find the ratio of 0.6321 of the entire cache (about 323.5G). Repeated simulations show that the ratio is pretty constant. So, is there something magical about the ratio 0.6321 (Rather than 0.666 which was my guess).
Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/cachehit.py Re-Hit ratio 0.3676748 Miss (Insert) ratio 0.6323252
Result of 4 trials…
print (0.6322271+0.6320339+0.6322528+0.6320873)/4 0.632150275
Is there anything interesting about that value?
Tells us that the value 0.6321 can be more-or-less represented as.
Furthermore we see http://www.wolframalpha.com/input/?i=1-1%2Fe
I can’t figure out what of the above series representations actually explains the cache hit behavior, but it makes sense to have something to so with factorials since the more data we read in, the higher chance that the next read will actually be a hit in the cache rather than inserting a new value.
If anyone can explain the underlying math to this effect, I would be very interested. Looks like it’s related to http://en.wikipedia.org/wiki/Derangement
Thanks go to Matti Vanninen for pointing out that 0.6321 was somehow magical.
Here’s code to simulate the cache in Python. This causes python to malloc about 400M of memory.
Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/cachehit.py Re-Hit ratio 0.3677729 Miss (Insert) ratio 0.6322271
import random import math import numpy #10 Million entries. cachesize=10000000 hit=0.0 cache= miss=0.0 for i in range(0,cachesize): cache.append(0) for i in range(0,cachesize): b=random.randint(0,cachesize-1) if cache[b] == 1: hit+=1 else: cache[b]=1 miss+=1 print "Re-Hit ratio",hit/cachesize print "Miss (Insert) ratio",miss/cachesize
Monday, 14 January 2013
For the most part, you only really care about disk zeroing if you’re waiting for an aggregate to be created on disks that were previously used on an old aggregate that was only just destroyed.
To see how long disk-zeroing is taking, use the command “aggr status -r”
filer-6280*> aggr status -r Aggregate aggr1_large (creating, raid_dp, initializing) (block checksums) Plex /aggr1_large/plex0 (offline, empty, active) Targeted to traditional volume or aggregate but not yet assigned to a raid group RAID Disk Device HA SHELF BAY CHAN Pool Type RPM --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- pending 3a.00.8 3a 0 8 SA:A 0 SAS 10000 (zeroing, 71% done) pending 3a.00.14 3a 0 14 SA:A 0 SAS 10000 (zeroing, 69% done) pending 3a.00.16 3a 0 16 SA:A 0 SAS 10000 (zeroing, 70% done) pending 3a.00.18 3a 0 18 SA:A 0 SAS 10000 (zeroing, 69% done) pending 3a.00.20 3a 0 20 SA:A 0 SAS 10000 (zeroing, 69% done) pending 3a.00.22 3a 0 22 SA:A 0 SAS 10000 (zeroing, 71% done)
Wednesday, 9 January 2013
As a non-programmer, I’ve always been reticent to use anything with acronyms like API and SDK, relying instead on issuing a full command line using rsh or ssh. That works for a while until you want to start doing things like checking for errors – and until you get fed up with 90% of the script being dedicated to parsing the output. NetApp filers have a reasonable API, that can be used to get both sysadmin data (number and fullness of volumes) and also performance analysis numbers (number of ops and the response times). Best of all the SDK provides libraries for both perl and python. I have switched almost entirely to Python for anything that needs more than a few lines of automation. In python, all you have to do is import 2 libraries and you can start using the API.
The API can be downloaded from the netapp support site now.netapp.com under the ‘Download Software’ link.
The API is implemented at the lowest level by sending RPC/XML calls over http to the filer. Inside NetApp, any new functionality must provide API (aka ZAPI) access – so learning the API should be a good investment. It helps to know that the implementation is XML since retrieving the data returned from the filer – follows the tortuous access pattern familiar to anyone who has used XML in the past.
I have used the API in my lab to monitor disk usage during long term testing, and in my previous life as a consultant implemented a scheduler to manage database snapshots. Once you’re used to accessing the API, it’s much easier than sending CLI commands over ssh/rsh.
Here are the steps to get the API
Download the API (ontapi SDK)
You’re looking for
NetApp Manageability SDK
!!Now the download will actually start!!
The file is called “”netapp-manageability-sdk-5.0R1.zip”" on my mac.
Inside the ‘lib’ directory, is python/NetApp. These are the python modules that we’ll be using.
DfmErrno.py NaElement.py NaErrno.py NaServer.py
For now though, let’s start with something simple…
lovebox:[~] $ export PYTHONPATH=$PYTHONPATH:~/Downloads/netapp-manageability-sdk-5.0R1/lib/python/NetApp/ lovebox:[~] $ ipython In : from NaElement import * In : from NaServer import * In : server=NaServer("gjlfiler.mylab.netapp.com",1,6) In : server.set_admin_user('root',"root") In : cmd = NaElement('system-get-info') In : out=server.invoke_elem(cmd) In : system_info=out.child_get("system-info") In : system_info.child_get_string("system-name") Out: u'gjlfiler'
from NaElement import * from NaServer import * server=NaServer("gjlfiler.mylab.netapp.com",1,6) server.set_admin_user('root',"root") cmd = NaElement('system-get-info') out=server.invoke_elem(cmd) system_info=out.child_get("system-info") system_info.child_get_string("system-name")
Something more tricky
Regrettably, the SDK documentation no-longer contains the ONTAP portion of the API, IOW – what calls I can make to the filer, and what it will respond with. To access that documentation.
Click the link for Data ONTAP API Documentation, and download the zipfile (currently that filename is netapp-manageability-sdk-ontap-api-documentation.zip)
Again, I was unable to view the html doc with Chrome for some reason, but Firefox works OK.
So, let’s try somehing else – how about a short script to get the size of each volume in the system.
We see from the documentation that the function returns a list of volumes, in a structure named “volumes”, because that’s the name given in the “Output Name” field in the API documentation.
|If set to “true”, more detailed volume information is returned. If not supplied or set to “false”, this extra information is not returned.|
|The name of the volume for which we want status information. If not supplied, then we want status for all volumes on the filer. Note that if status information for more than 20 volumes is desired, the volume-list-info-iter-* zapis will be more efficient and should be used instead.|
||List of volumes and their status information.|
So, now we know that we want to grab the volume-info list (which we can guess is a list of each volume in the system, and each item in the list has information about the volume).
So, as before we do some setup to reach the filer, and this time we’re going to issue the command volume-ist-info.
In : import sys In : sys.path.append("/opt/netapp/netapp-manageability-sdk-5.0R1/lib/python/") In : from NaElement import * In : from NaServer import * In : filer=NaServer("gjlfile.mylab.netapp.com",1,6) In : filer.set_admin_user('root','root') In : cmd = NaElement("volume-list-info") In : ret = filer.invoke_elem(cmd) In : ret Out:
Above we see that ‘ret’ is just an NaElement instance, we want the specific ‘volumes’ object. The might be more than one object returned to us (although in this case there is not) so we have to unpack the volumes object – just like in the simple example we did first.
In : volumes = ret.child_get("volumes") n : volumes Out:
So, now we have an object that should be a list of volumes. But unfortunately we can’t just access that as a real python list – we have to use the specific accessors.
volume-info vol vol vol --- size --- name --- etc, vol vol
In : for vol in volumes.children_get(): ...: print vol.child_get_string("name") ...: db2_fv db3_fv db4_fv db5_fv db6_fv log1_fv log2_fv log3_fv log4_fv log5_fv log6_fv db1_fv vol0
Since we also want the size – we can find that field in the volume-info structure, figure out the type and use the correct accessor. In this case we can make a guess that size-total is what we want, and that its type is integer and so we end up with this :-
for vol in volumes.children_get(): print vol.child_get_string("name") print vol.child_get_int("size-total") db2_fv 3006477107200 db3_fv 3006477107200 db4_fv 3006477107200 db5_fv 3006477107200 db6_fv 3006477107200 log1_fv 21474836480 log2_fv 21474836480 log3_fv 21474836480 log4_fv 21474836480 log5_fv 21474836480 log6_fv 21474836480 db1_fv 3006477107200 vol0 476002729984
The documentation that tells you about child_get() etc. is in the SDK documentation, in the following section
Home > NetApp Manageability SDK > Programming Guide > SDK Core APIs > Python Core APIs > Input Output Management APIs
Wednesday, 2 January 2013
Last week a number of twitter users were posting about a very cool looking interactive latency predicterizor. In the small world of computer performance nerds, it was a veritable tsunami of attention. My twitter skills are so feeble that I cannot figure out how to determine the total number of tweets – but as of now (Wed Jan 2nd 5:32 EST) the most recent tweet was 19 minutes ago. Fascinating.
The interactive paper is here http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Flash paper : http://cseweb.ucsd.edu/users/swanson/papers/FAST2012BleakFlash.pdf
From the interactive diagram – which states main memory read latency of 1uSec, you’d assume that reading from an SSD is 17x slower than main memory. Personally I think that’s WAY off for a couple of reasons. Firstly, and least interestingly – the 17uSec is for a direct access to the NAND cell itself. In practice SSD’s are packaged with a Flash Translation Layer (FTL) which translates something like a linear address range (LBA) into mappings to the flash memories. The FAST paper (above) pegs the FTL latency at 30 uSec – which seems high relative to the 17 uSec response from the NAND memory (the comments in the code, and the paper itself – seem to peg the NAND response time at 20 uSec – but the interactive tool shows me 17 uSec for “2012″).
The more interesting consideration – particularly given the title “Latency Numbers every _programmer_ should know” is that, as a programmer – I can to a large extent expect to achieve a 1 uSec response from main memory when I attempt to read some value. However, there’s no way that I will ever get the 17 uSec (or even a 17+39 uSec) response from SSD. The reason is that, as a user I cannot access that SSD directly. For most programmers – the SSD will be accessed via a filesystem, then a device driver.
Programmers access to the filesystem is almost always the other side of a system call, which means a trap or interrupt call saving stack pointers and setting up adresses for buffers etc. Typically the data will be read from SSD into the memory of the host computer – and then returned to the user/programmer. Even with DMA and other zero-copy techniques there will be many reads/writes to main memory to setup the system call.
When we think about spinning-disk accesses of ~4 milliseconds (ms) or 4,0000 uSecs – we can more-or-less gloss over the setup cost required to setup the system call and move the data through the kernel and back to the user, because the overhead of moving this mechanical instrument dwarfs the other costs. But with SSD’s that setup cost starts to impact how quickly a user-land application can really access data.