Monday, 28 October 2013
Tuesday, 21 May 2013
Recently I was talking to a network guy about storage IOPS, when he asked me. What is an IOP exactly? Immediately I remembered having the exact same question when I first started seriously working with storage gear back in 2006. I knew that an IOP was an I/O Operation, and IOPS is typically meant to be I/O Operations per second. Surely nobody really cares how many I/O’s are done per second? We only care how long it takes to move the required amount of data (typically MB/s or GB/s).
It turns out that there are a lot of cases where the amount of data being read is actually quite small, and the bottleneck to having the application run faster is how long it takes to do an individual IO. But you might say, that still makes no sense. If I do an I/O of 512b in 1 second, then surely an I/O of 8192 bytes takes 16 seconds – and so, again what I really care about is MB/s.
That’s what I thought too. However, for small I/O amounts (e.g. 512b – 8K) the “setup time” far outweighs the “transfer time” to perform the I/O. This is especially true for spinning disk drives.
This brings us to the next unstated assumption when storage people talk about IOPS. They are typically talking about small “random” I/Os.
Why are small random I/Os interesting? Surely nobody writes applications to just randomly read data from disk? Of course not, but what tends to happen is that for many large applications, the access requests appear to be random from the point of view of the disk-drive. Take for instance an example of a web ticket booking application. This can be either airplane tickets or concert tickets. In each case the vendor has in total many more customers than can be kept in the main memory of the application servers. So, whenever I login to book a ticket, the application has to fetch my information from disk. Since the application has no idea that I am about to book a ticket, the request to the disk drive results in a random IO (there was no way to predict the incoming request). The amount of data requested is probably quite small, a few KB – mix up my request with everyone elses request, and what this starts to look like is a very large amount of random IO. My request, and the next guys request are not related, so it’s all random.
Now, since disk drive speeds are many orders of magnitude slower than CPU and main memory, often the first order bottleneck of large-scale servers is the time taken to read the data from disk so that it can be processed. Hence random IOPs (or just IOPs) of the storage system becomes a very important figure when architecting the overall solution.
Before SSD/Flash technology came along, the only mainstream way to solve the problem (besides buying enough DRAM to store everything in memory) was to lash together hundreds of slow spinning disk to provide enough aggregate IOPS to service the application. Doing so requires quite a lot of software to layout the data efficiently such that the aggregate IOPS can be realized by the application accessing it. This hard, and is how EMC and lately NetApp made a good business of managing hundreds of spindles to provide throughput and capacity.
Now that SSD/Flash is mainstream, we don’t really need to use the HDDs to provide IOPS, we use Flash for the IOPS and HDD’s for their large capacity. The software challenge is now how to best do that.
Friday, 3 May 2013
Tuesday, 23 April 2013
Lets say I have a cache of 512G. My block size is 4K, so that’s 134,217,728 entries. How many blocks will I need to read to fill the cache? Well if I read data sequentially, then obviously I just need to read 512G worth of files. But what if I read random blocks? Most caches will try to cache randomly read blocks, since sequential reads get least benefit from disk caching.
So, If I read that same 512G randomly how many blocks will end up in cache? Not 512G because some of those random blocks will be ‘re-hits’ of blocks that were already cached.
It turns out that by simulation, we find the ratio of 0.6321 of the entire cache (about 323.5G). Repeated simulations show that the ratio is pretty constant. So, is there something magical about the ratio 0.6321 (Rather than 0.666 which was my guess).
Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/cachehit.py Re-Hit ratio 0.3676748 Miss (Insert) ratio 0.6323252
Result of 4 trials…
print (0.6322271+0.6320339+0.6322528+0.6320873)/4 0.632150275
Is there anything interesting about that value?
Tells us that the value 0.6321 can be more-or-less represented as.
Furthermore we see http://www.wolframalpha.com/input/?i=1-1%2Fe
I can’t figure out what of the above series representations actually explains the cache hit behavior, but it makes sense to have something to so with factorials since the more data we read in, the higher chance that the next read will actually be a hit in the cache rather than inserting a new value.
If anyone can explain the underlying math to this effect, I would be very interested. Looks like it’s related to http://en.wikipedia.org/wiki/Derangement
Thanks go to Matti Vanninen for pointing out that 0.6321 was somehow magical.
Here’s code to simulate the cache in Python. This causes python to malloc about 400M of memory.
Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/cachehit.py Re-Hit ratio 0.3677729 Miss (Insert) ratio 0.6322271
import random import math import numpy #10 Million entries. cachesize=10000000 hit=0.0 cache= miss=0.0 for i in range(0,cachesize): cache.append(0) for i in range(0,cachesize): b=random.randint(0,cachesize-1) if cache[b] == 1: hit+=1 else: cache[b]=1 miss+=1 print "Re-Hit ratio",hit/cachesize print "Miss (Insert) ratio",miss/cachesize
Monday, 14 January 2013
For the most part, you only really care about disk zeroing if you’re waiting for an aggregate to be created on disks that were previously used on an old aggregate that was only just destroyed.
To see how long disk-zeroing is taking, use the command “aggr status -r”
filer-6280*> aggr status -r Aggregate aggr1_large (creating, raid_dp, initializing) (block checksums) Plex /aggr1_large/plex0 (offline, empty, active) Targeted to traditional volume or aggregate but not yet assigned to a raid group RAID Disk Device HA SHELF BAY CHAN Pool Type RPM --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- pending 3a.00.8 3a 0 8 SA:A 0 SAS 10000 (zeroing, 71% done) pending 3a.00.14 3a 0 14 SA:A 0 SAS 10000 (zeroing, 69% done) pending 3a.00.16 3a 0 16 SA:A 0 SAS 10000 (zeroing, 70% done) pending 3a.00.18 3a 0 18 SA:A 0 SAS 10000 (zeroing, 69% done) pending 3a.00.20 3a 0 20 SA:A 0 SAS 10000 (zeroing, 69% done) pending 3a.00.22 3a 0 22 SA:A 0 SAS 10000 (zeroing, 71% done)
Wednesday, 9 January 2013
As a non-programmer, I’ve always been reticent to use anything with acronyms like API and SDK, relying instead on issuing a full command line using rsh or ssh. That works for a while until you want to start doing things like checking for errors – and until you get fed up with 90% of the script being dedicated to parsing the output. NetApp filers have a reasonable API, that can be used to get both sysadmin data (number and fullness of volumes) and also performance analysis numbers (number of ops and the response times). Best of all the SDK provides libraries for both perl and python. I have switched almost entirely to Python for anything that needs more than a few lines of automation. In python, all you have to do is import 2 libraries and you can start using the API.
The API can be downloaded from the netapp support site now.netapp.com under the ‘Download Software’ link.
The API is implemented at the lowest level by sending RPC/XML calls over http to the filer. Inside NetApp, any new functionality must provide API (aka ZAPI) access – so learning the API should be a good investment. It helps to know that the implementation is XML since retrieving the data returned from the filer – follows the tortuous access pattern familiar to anyone who has used XML in the past.
I have used the API in my lab to monitor disk usage during long term testing, and in my previous life as a consultant implemented a scheduler to manage database snapshots. Once you’re used to accessing the API, it’s much easier than sending CLI commands over ssh/rsh.
Here are the steps to get the API
Download the API (ontapi SDK)
You’re looking for
NetApp Manageability SDK
!!Now the download will actually start!!
The file is called “”netapp-manageability-sdk-5.0R1.zip”" on my mac.
Inside the ‘lib’ directory, is python/NetApp. These are the python modules that we’ll be using.
DfmErrno.py NaElement.py NaErrno.py NaServer.py
For now though, let’s start with something simple…
lovebox:[~] $ export PYTHONPATH=$PYTHONPATH:~/Downloads/netapp-manageability-sdk-5.0R1/lib/python/NetApp/ lovebox:[~] $ ipython In : from NaElement import * In : from NaServer import * In : server=NaServer("gjlfiler.mylab.netapp.com",1,6) In : server.set_admin_user('root',"root") In : cmd = NaElement('system-get-info') In : out=server.invoke_elem(cmd) In : system_info=out.child_get("system-info") In : system_info.child_get_string("system-name") Out: u'gjlfiler'
from NaElement import * from NaServer import * server=NaServer("gjlfiler.mylab.netapp.com",1,6) server.set_admin_user('root',"root") cmd = NaElement('system-get-info') out=server.invoke_elem(cmd) system_info=out.child_get("system-info") system_info.child_get_string("system-name")
Something more tricky
Regrettably, the SDK documentation no-longer contains the ONTAP portion of the API, IOW – what calls I can make to the filer, and what it will respond with. To access that documentation.
Click the link for Data ONTAP API Documentation, and download the zipfile (currently that filename is netapp-manageability-sdk-ontap-api-documentation.zip)
Again, I was unable to view the html doc with Chrome for some reason, but Firefox works OK.
So, let’s try somehing else – how about a short script to get the size of each volume in the system.
We see from the documentation that the function returns a list of volumes, in a structure named “volumes”, because that’s the name given in the “Output Name” field in the API documentation.
|If set to “true”, more detailed volume information is returned. If not supplied or set to “false”, this extra information is not returned.|
|The name of the volume for which we want status information. If not supplied, then we want status for all volumes on the filer. Note that if status information for more than 20 volumes is desired, the volume-list-info-iter-* zapis will be more efficient and should be used instead.|
||List of volumes and their status information.|
So, now we know that we want to grab the volume-info list (which we can guess is a list of each volume in the system, and each item in the list has information about the volume).
So, as before we do some setup to reach the filer, and this time we’re going to issue the command volume-ist-info.
In : import sys In : sys.path.append("/opt/netapp/netapp-manageability-sdk-5.0R1/lib/python/") In : from NaElement import * In : from NaServer import * In : filer=NaServer("gjlfile.mylab.netapp.com",1,6) In : filer.set_admin_user('root','root') In : cmd = NaElement("volume-list-info") In : ret = filer.invoke_elem(cmd) In : ret Out:
Above we see that ‘ret’ is just an NaElement instance, we want the specific ‘volumes’ object. The might be more than one object returned to us (although in this case there is not) so we have to unpack the volumes object – just like in the simple example we did first.
In : volumes = ret.child_get("volumes") n : volumes Out:
So, now we have an object that should be a list of volumes. But unfortunately we can’t just access that as a real python list – we have to use the specific accessors.
volume-info vol vol vol --- size --- name --- etc, vol vol
In : for vol in volumes.children_get(): ...: print vol.child_get_string("name") ...: db2_fv db3_fv db4_fv db5_fv db6_fv log1_fv log2_fv log3_fv log4_fv log5_fv log6_fv db1_fv vol0
Since we also want the size – we can find that field in the volume-info structure, figure out the type and use the correct accessor. In this case we can make a guess that size-total is what we want, and that its type is integer and so we end up with this :-
for vol in volumes.children_get(): print vol.child_get_string("name") print vol.child_get_int("size-total") db2_fv 3006477107200 db3_fv 3006477107200 db4_fv 3006477107200 db5_fv 3006477107200 db6_fv 3006477107200 log1_fv 21474836480 log2_fv 21474836480 log3_fv 21474836480 log4_fv 21474836480 log5_fv 21474836480 log6_fv 21474836480 db1_fv 3006477107200 vol0 476002729984
The documentation that tells you about child_get() etc. is in the SDK documentation, in the following section
Home > NetApp Manageability SDK > Programming Guide > SDK Core APIs > Python Core APIs > Input Output Management APIs
Wednesday, 2 January 2013
Last week a number of twitter users were posting about a very cool looking interactive latency predicterizor. In the small world of computer performance nerds, it was a veritable tsunami of attention. My twitter skills are so feeble that I cannot figure out how to determine the total number of tweets – but as of now (Wed Jan 2nd 5:32 EST) the most recent tweet was 19 minutes ago. Fascinating.
The interactive paper is here http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Flash paper : http://cseweb.ucsd.edu/users/swanson/papers/FAST2012BleakFlash.pdf
From the interactive diagram – which states main memory read latency of 1uSec, you’d assume that reading from an SSD is 17x slower than main memory. Personally I think that’s WAY off for a couple of reasons. Firstly, and least interestingly – the 17uSec is for a direct access to the NAND cell itself. In practice SSD’s are packaged with a Flash Translation Layer (FTL) which translates something like a linear address range (LBA) into mappings to the flash memories. The FAST paper (above) pegs the FTL latency at 30 uSec – which seems high relative to the 17 uSec response from the NAND memory (the comments in the code, and the paper itself – seem to peg the NAND response time at 20 uSec – but the interactive tool shows me 17 uSec for “2012″).
The more interesting consideration – particularly given the title “Latency Numbers every _programmer_ should know” is that, as a programmer – I can to a large extent expect to achieve a 1 uSec response from main memory when I attempt to read some value. However, there’s no way that I will ever get the 17 uSec (or even a 17+39 uSec) response from SSD. The reason is that, as a user I cannot access that SSD directly. For most programmers – the SSD will be accessed via a filesystem, then a device driver.
Programmers access to the filesystem is almost always the other side of a system call, which means a trap or interrupt call saving stack pointers and setting up adresses for buffers etc. Typically the data will be read from SSD into the memory of the host computer – and then returned to the user/programmer. Even with DMA and other zero-copy techniques there will be many reads/writes to main memory to setup the system call.
When we think about spinning-disk accesses of ~4 milliseconds (ms) or 4,0000 uSecs – we can more-or-less gloss over the setup cost required to setup the system call and move the data through the kernel and back to the user, because the overhead of moving this mechanical instrument dwarfs the other costs. But with SSD’s that setup cost starts to impact how quickly a user-land application can really access data.
Wednesday, 19 December 2012
Great article from Cary Millsap, covers performance analysis in general – not specific to Oracle.
Wednesday, 19 December 2012
Anyone who deals with systems performance in an enterprise environment, will inevitably need to deal with an Oracle database performance problem. These databases often make up parf of the largest, most complex and business critical infrastructures. In the past I have always looked at DB performance issues from a systemic perspective – looking at the database in the large. Below are a couple of articles that I found while looking through some old Oracle magazines. They take a more session centric view of performance – diagnosing specific wait events that might be causing a particular user session to appear slow.
Examples include finding the oracle session which belongs to a particular unix PID, and a bunch of related stuff using the v$ tables.