I am definitely not a guru yet. But one day….
- Author: gary
- Published: May 3rd, 2013
- Category: Uncategorized
- Comments: None
VMware performance for guru’s.
- Author: gary
- Published: Apr 23rd, 2013
- Category: Uncategorized
- Comments: 2
Predicting disk cache hits for 100% random access.
Lets say I have a cache of 512G. My block size is 4K, so that’s 134,217,728 entries. How many blocks will I need to read to fill the cache? Well if I read data sequentially, then obviously I just need to read 512G worth of files. But what if I read random blocks? Most caches will try to cache randomly read blocks, since sequential reads get least benefit from disk caching.
So, If I read that same 512G randomly how many blocks will end up in cache? Not 512G because some of those random blocks will be ‘re-hits’ of blocks that were already cached.
It turns out that by simulation, we find the ratio of 0.6321 of the entire cache (about 323.5G). Repeated simulations show that the ratio is pretty constant. So, is there something magical about the ratio 0.6321 (Rather than 0.666 which was my guess).
Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/cachehit.py Re-Hit ratio 0.3676748 Miss (Insert) ratio 0.6323252
Result of 4 trials…
print (0.6322271+0.6320339+0.6322528+0.6320873)/4 0.632150275
Is there anything interesting about that value?
http://www.wolframalpha.com/input/?i=0.6321
Tells us that the value 0.6321 can be more-or-less represented as.
1-(1/e)
Furthermore we see http://www.wolframalpha.com/input/?i=1-1%2Fe
I can’t figure out what of the above series representations actually explains the cache hit behavior, but it makes sense to have something to so with factorials since the more data we read in, the higher chance that the next read will actually be a hit in the cache rather than inserting a new value.
If anyone can explain the underlying math to this effect, I would be very interested. Looks like it’s related to http://en.wikipedia.org/wiki/Derangement
Thanks go to Matti Vanninen for pointing out that 0.6321 was somehow magical.
Here’s code to simulate the cache in Python. This causes python to malloc about 400M of memory.
Garys-Nutanix-MBP13:Versions garylittle$ python ~/Dropbox/scratch/cachehit.py Re-Hit ratio 0.3677729 Miss (Insert) ratio 0.6322271
import random
import math
import numpy
#10 Million entries.
cachesize=10000000
hit=0.0
cache=[]
miss=0.0
for i in range(0,cachesize):
cache.append(0)
for i in range(0,cachesize):
b=random.randint(0,cachesize-1)
if cache[b] == 1:
hit+=1
else:
cache[b]=1
miss+=1
print "Re-Hit ratio",hit/cachesize
print "Miss (Insert) ratio",miss/cachesize
Show status of aggregate creation / disk zero status.
For the most part, you only really care about disk zeroing if you’re waiting for an aggregate to be created on disks that were previously used on an old aggregate that was only just destroyed.
To see how long disk-zeroing is taking, use the command “aggr status -r”
filer-6280*> aggr status -r
Aggregate aggr1_large (creating, raid_dp, initializing) (block checksums)
Plex /aggr1_large/plex0 (offline, empty, active)
Targeted to traditional volume or aggregate but not yet assigned to a raid group
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
pending 3a.00.8 3a 0 8 SA:A 0 SAS 10000 (zeroing, 71% done)
pending 3a.00.14 3a 0 14 SA:A 0 SAS 10000 (zeroing, 69% done)
pending 3a.00.16 3a 0 16 SA:A 0 SAS 10000 (zeroing, 70% done)
pending 3a.00.18 3a 0 18 SA:A 0 SAS 10000 (zeroing, 69% done)
pending 3a.00.20 3a 0 20 SA:A 0 SAS 10000 (zeroing, 69% done)
pending 3a.00.22 3a 0 22 SA:A 0 SAS 10000 (zeroing, 71% done)
Netapp API hacking with python
As a non-programmer, I’ve always been reticent to use anything with acronyms like API and SDK, relying instead on issuing a full command line using rsh or ssh. That works for a while until you want to start doing things like checking for errors – and until you get fed up with 90% of the script being dedicated to parsing the output. NetApp filers have a reasonable API, that can be used to get both sysadmin data (number and fullness of volumes) and also performance analysis numbers (number of ops and the response times). Best of all the SDK provides libraries for both perl and python. I have switched almost entirely to Python for anything that needs more than a few lines of automation. In python, all you have to do is import 2 libraries and you can start using the API.
The API can be downloaded from the netapp support site now.netapp.com under the ‘Download Software’ link.
The API is implemented at the lowest level by sending RPC/XML calls over http to the filer. Inside NetApp, any new functionality must provide API (aka ZAPI) access – so learning the API should be a good investment. It helps to know that the implementation is XML since retrieving the data returned from the filer – follows the tortuous access pattern familiar to anyone who has used XML in the past.
I have used the API in my lab to monitor disk usage during long term testing, and in my previous life as a consultant implemented a scheduler to manage database snapshots. Once you’re used to accessing the API, it’s much easier than sending CLI commands over ssh/rsh.
Here are the steps to get the API
Download the API (ontapi SDK)
You’re looking for
NetApp Manageability SDK
!!Now the download will actually start!!
The file is called “”netapp-manageability-sdk-5.0R1.zip”" on my mac.
Inside the ‘lib’ directory, is python/NetApp. These are the python modules that we’ll be using.
DfmErrno.py NaElement.py NaErrno.py NaServer.py
For now though, let’s start with something simple…
lovebox:[~] $ export PYTHONPATH=$PYTHONPATH:~/Downloads/netapp-manageability-sdk-5.0R1/lib/python/NetApp/
lovebox:[~] $ ipython
In [1]: from NaElement import *
In [2]: from NaServer import *
In [3]: server=NaServer("gjlfiler.mylab.netapp.com",1,6)
In [4]: server.set_admin_user('root',"root")
In [5]: cmd = NaElement('system-get-info')
In [6]: out=server.invoke_elem(cmd)
In [7]: system_info=out.child_get("system-info")
In [8]: system_info.child_get_string("system-name")
Out[8]: u'gjlfiler'
from NaElement import *
from NaServer import *
server=NaServer("gjlfiler.mylab.netapp.com",1,6)
server.set_admin_user('root',"root")
cmd = NaElement('system-get-info')
out=server.invoke_elem(cmd)
system_info=out.child_get("system-info")
system_info.child_get_string("system-name")
Something more tricky
Regrettably, the SDK documentation no-longer contains the ONTAP portion of the API, IOW – what calls I can make to the filer, and what it will respond with. To access that documentation.
https://communities.netapp.com/community/interfaces_and_tools/developer/apidoc
Click the link for Data ONTAP API Documentation, and download the zipfile (currently that filename is netapp-manageability-sdk-ontap-api-documentation.zip)
Again, I was unable to view the html doc with Chrome for some reason, but Firefox works OK.
So, let’s try somehing else – how about a short script to get the size of each volume in the system.
We see from the documentation that the function returns a list of volumes, in a structure named “volumes”, because that’s the name given in the “Output Name” field in the API documentation.
| Input Name | Range | Type | Description |
| verbose | boolean optional |
If set to “true”, more detailed volume information is returned. If not supplied or set to “false”, this extra information is not returned. | |
| volume | string optional |
The name of the volume for which we want status information. If not supplied, then we want status for all volumes on the filer. Note that if status information for more than 20 volumes is desired, the volume-list-info-iter-* zapis will be more efficient and should be used instead. | |
| Output Name | Range | Type | Description |
| volumes | volume-info[] |
List of volumes and their status information. |
So, now we know that we want to grab the volume-info[] list (which we can guess is a list of each volume in the system, and each item in the list has information about the volume).
So, as before we do some setup to reach the filer, and this time we’re going to issue the command volume-ist-info.
In [1]: import sys
In [2]: sys.path.append("/opt/netapp/netapp-manageability-sdk-5.0R1/lib/python/")
In [3]: from NaElement import *
In [4]: from NaServer import *
In [6]: filer=NaServer("gjlfile.mylab.netapp.com",1,6)
In [7]: filer.set_admin_user('root','root')
In [8]: cmd = NaElement("volume-list-info")
In [10]: ret = filer.invoke_elem(cmd)
In [11]: ret
Out[11]:
Above we see that ‘ret’ is just an NaElement instance, we want the specific ‘volumes’ object. The might be more than one object returned to us (although in this case there is not) so we have to unpack the volumes object – just like in the simple example we did first.
In [12]: volumes = ret.child_get("volumes")
n [13]: volumes
Out[13]:
So, now we have an object that should be a list of volumes. But unfortunately we can’t just access that as a real python list – we have to use the specific accessors.
volume-info
vol
vol
vol --- size
--- name
--- etc,
vol
vol
In [14]: for vol in volumes.children_get():
...: print vol.child_get_string("name")
...:
db2_fv
db3_fv
db4_fv
db5_fv
db6_fv
log1_fv
log2_fv
log3_fv
log4_fv
log5_fv
log6_fv
db1_fv
vol0
Since we also want the size – we can find that field in the volume-info structure, figure out the type and use the correct accessor. In this case we can make a guess that size-total is what we want, and that its type is integer and so we end up with this :-
for vol in volumes.children_get():
print vol.child_get_string("name")
print vol.child_get_int("size-total")
db2_fv
3006477107200
db3_fv
3006477107200
db4_fv
3006477107200
db5_fv
3006477107200
db6_fv
3006477107200
log1_fv
21474836480
log2_fv
21474836480
log3_fv
21474836480
log4_fv
21474836480
log5_fv
21474836480
log6_fv
21474836480
db1_fv
3006477107200
vol0
476002729984
The documentation that tells you about child_get() etc. is in the SDK documentation, in the following section
Home > NetApp Manageability SDK > Programming Guide > SDK Core APIs > Python Core APIs > Input Output Management APIs
- Author: gary
- Published: Jan 2nd, 2013
- Category: Performance, Uncategorized
- Comments: None
Thoughts on “Latency Numbers Every Programmer Should Know”
Last week a number of twitter users were posting about a very cool looking interactive latency predicterizor. In the small world of computer performance nerds, it was a veritable tsunami of attention. My twitter skills are so feeble that I cannot figure out how to determine the total number of tweets – but as of now (Wed Jan 2nd 5:32 EST) the most recent tweet was 19 minutes ago. Fascinating.
The interactive paper is here http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
At face value, it seemed that some of the numbers were really quite low. For instance I was pretty surprised to see the projected random read latency from an SSD at 17 micro-seconds (uSec). What’s really nice about this graphic is that the sources for the numbers presented, are contained in the javascript source. In fact that’s really really nice. The numbers for flash/SSD are taken from a berkley paper published in 2012? which is, itself quite an interesting read.
Flash paper : http://cseweb.ucsd.edu/users/swanson/papers/FAST2012BleakFlash.pdf
From the interactive diagram – which states main memory read latency of 1uSec, you’d assume that reading from an SSD is 17x slower than main memory. Personally I think that’s WAY off for a couple of reasons. Firstly, and least interestingly – the 17uSec is for a direct access to the NAND cell itself. In practice SSD’s are packaged with a Flash Translation Layer (FTL) which translates something like a linear address range (LBA) into mappings to the flash memories. The FAST paper (above) pegs the FTL latency at 30 uSec – which seems high relative to the 17 uSec response from the NAND memory (the comments in the code, and the paper itself – seem to peg the NAND response time at 20 uSec – but the interactive tool shows me 17 uSec for “2012″).
The more interesting consideration – particularly given the title “Latency Numbers every _programmer_ should know” is that, as a programmer – I can to a large extent expect to achieve a 1 uSec response from main memory when I attempt to read some value. However, there’s no way that I will ever get the 17 uSec (or even a 17+39 uSec) response from SSD. The reason is that, as a user I cannot access that SSD directly. For most programmers – the SSD will be accessed via a filesystem, then a device driver.
Programmers access to the filesystem is almost always the other side of a system call, which means a trap or interrupt call saving stack pointers and setting up adresses for buffers etc. Typically the data will be read from SSD into the memory of the host computer – and then returned to the user/programmer. Even with DMA and other zero-copy techniques there will be many reads/writes to main memory to setup the system call.
When we think about spinning-disk accesses of ~4 milliseconds (ms) or 4,0000 uSecs – we can more-or-less gloss over the setup cost required to setup the system call and move the data through the kernel and back to the user, because the overhead of moving this mechanical instrument dwarfs the other costs. But with SSD’s that setup cost starts to impact how quickly a user-land application can really access data.
- Author: gary
- Published: Dec 19th, 2012
- Category: External Articles and papers
- Comments: None
Thinking Clearly about Performance – ACM Queue article.
Great article from Cary Millsap, covers performance analysis in general – not specific to Oracle.
- Author: gary
- Published: Dec 19th, 2012
- Category: External Articles and papers
- Comments: None
Oracle performance diagnostics (Session centric)
Anyone who deals with systems performance in an enterprise environment, will inevitably need to deal with an Oracle database performance problem. These databases often make up parf of the largest, most complex and business critical infrastructures. In the past I have always looked at DB performance issues from a systemic perspective – looking at the database in the large. Below are a couple of articles that I found while looking through some old Oracle magazines. They take a more session centric view of performance – diagnosing specific wait events that might be causing a particular user session to appear slow.
Examples include finding the oracle session which belongs to a particular unix PID, and a bunch of related stuff using the v$ tables.
- Author: gary
- Published: Nov 14th, 2012
- Category: Uncategorized
- Comments: None
Effects of Windows power management on storage performance.
Something very odd happened on my Jetstress box. I disconnected my iSCSI LUN’s in order to do an aggregate snap-restore, and when I restarted the test – the oprate achieved from Jetstress (with the exact same parameters) was around 1/2 of what it had achieved just a few minutes ago. It looks like power-savings kicked in while the luns were offline.
Here’s the short take-away. With Windows 2008 power settings set to “Balanced” I achieve only 1/2 the throughput (oprate) that I do when the power setting in Windows is set to “Full Power”
with “Balanced” power setting
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s
in out read write read write age hit time ty util in out in out
13% 0 0 0 1870 24859 70727 57600 0 0 0 0s 97% 0% - 34% 0 0 1870 0 0 24475 70414
8% 0 0 0 1141 11407 58021 48496 0 0 0 0s 98% 0% - 26% 4 0 1137 0 0 11148 57799
29% 0 0 0 1054 8284 55235 50180 16878 0 0 >60 99% 22% Ts 28% 0 0 1054 0 0 8050 55034
16% 0 0 0 981 11293 50431 64100 168960 0 0 >60 99% 100% :f 43% 0 0 981 0 0 11064 50233
11% 0 0 0 935 9829 49900 44216 106192 0 0 >60 98% 94% : 33% 0 0 935 0 0 9608 49709
10% 0 0 0 1059 13088 51064 36503 0 0 0 >60 98% 0% - 23% 0 0 1059 0 0 12946 50885
8% 0 0 0 1052 9332 53281 57060 0 0 0 >60 98% 0% - 27% 4 0 1048 0 0 9001 53051
9% 0 0 0 1287 12829 58028 43288 0 0 0 1s 98% 0% - 25% 0 0 1287 0 0 12555 57799
11% 0 0 0 1529 13086 66611 59088 0 0 0 1s 97% 0% - 32% 0 0 1529 0 0 12771 66351
10% 0 0 0 1387 12685 66571 56900 0 0 0 1s 98% 0% - 33% 0 0 1387 0 0 12382 66314
8% 0 0 0 1037 12370 51644 38640 0 0 0 1s 98% 0% - 25% 0 0 1037 0 0 12131 51442
9% 0 0 0 1118 10402 53779 52108 0 0 0 1s 98% 0% - 27% 4 0 1114 0 0 10158 53576
9% 0 0 0 1172 12642 57266 47236 0 0 0 2s 98% 0% - 26% 0 0 1172 0 0 12378 57569
9% 0 0 0 1305 12714 58096 48824 0 0 0 2s 97% 0% - 30% 0 0 1305 0 0 12441 57344
11% 0 0 0 1044 11125 51985 46134 8 0 0 2s 98% 2% Tn 25% 0 0 1044 0 0 10889 51782
with “High performance” power setting
perfdisk-6280-1*> sysstat -x 1
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s
in out read write read write age hit time ty util in out in out
14% 0 0 0 2198 19282 85558 70372 0 0 0 2s 97% 0% - 39% 0 0 2198 0 0 18777 85225
15% 0 0 0 2312 24365 80659 67488 0 0 0 3s 97% 0% - 45% 0 0 2312 0 0 23848 80314
16% 0 0 0 2469 22900 88474 66532 0 0 0 3s 97% 0% - 41% 0 0 2469 0 0 22350 88105
16% 0 0 0 2560 24089 96448 78876 0 0 0 2s 97% 0% - 44% 0 0 2560 0 0 23511 96068
15% 0 0 0 2298 23169 80427 62648 0 0 0 2s 97% 0% - 36% 5 0 2293 0 0 22662 80085
15% 0 0 0 2387 25311 85323 70716 0 0 0 2s 97% 0% - 39% 0 0 2387 0 0 24780 84963
18% 0 0 0 2810 25211 93821 73608 0 0 0 2s 97% 0% - 41% 0 0 2810 0 0 24608 93422
16% 0 0 0 2847 31037 94805 72640 0 0 0 2s 97% 0% - 41% 0 0 2847 0 0 30418 94392
60% 0 0 0 2161 31831 78890 86160 107040 0 0 1s 99% 72% Tf 49% 0 0 2161 0 0 31311 78545
19% 0 0 0 2284 29344 84730 68596 146432 0 0 1s 97% 100% :f 49% 8 0 2276 0 0 28798 84369
18% 0 0 0 2347 26055 85583 68028 157888 0 0 2s 97% 100% :f 56% 0 0 2347 0 0 25518 85230
17% 0 0 0 2107 18594 79451 66024 117756 0 0 2s 97% 99% : 48% 0 0 2107 0 0 18114 79135
14% 0 0 0 2066 19512 81547 69632 0 0 0 2s 97% 0% - 39% 2 0 2064 0 0 19028 81228
20% 0 0 0 3301 26889 90424 70632 0 0 0 2s 97% 0% - 42% 695 0 2606 0 0 26315 90038
17% 0 0 0 2560 27279 85441 70044 0 0 0 3s 97% 0% - 41% 9 0 2551 0 0 26722 85062
The longer story
After some head scratching and googling, I found that the power scheme on the Windows box was set to “balanced” – which seemed a bit odd for a server OS. So, I switched it to “High Performance” and almost instantaneously the throughput to the filer doubled (back to what it was previously).
It seems that the most likely culprit is that the “balanced” power option manages power to the PCI bus as well as CPU. My guess is that when I disconnected the iSCSI luns – the PCI card (Intel 10GbE) went idle and the power-saving mode kicked in. For some reason, it never went back to full-power even though I was once again using the 10GbE card.
One of the things that makes the power-saving issue a little tricky is that although I’m not moving a LOT of data over the network – I AM dependent on achieving low latency to meet my oprate (IOW I don’t have a lot of concurrency). I wonder how much work has been done to measure the effect of power saving mode on latency at low loads.
It was lucky for me that I stumbled across the power savings mode – the only evidence I had that the problem lay at the Windows side was a large unexplained delta in the latency seen by the Windows host – and the latency attributable to the filer.
- Author: gary
- Published: Nov 9th, 2012
- Category: how-to, Uncategorized
- Comments: None
Create a sysstat like command for any filer statistic.
Here’s a neat trick to create a sysstat like output for any statistic available in the counter manager. In this example I chose some counters that were relevant to my Jetstress testing. Here’s the output
----------------iSCSI------------- --Disk-- --CPU-- ram EC hdd ssd hyaA hyaB hyaC
ops lat Rd_lat Wr_lat read_lat all
/s ms ms ms ms % % % % % % % %
937 3.12 4.28 0.17 8.87 31 74 0 25 0 0 0 0
886 2.84 4.30 0.29 8.68 32 74 0 25 0 0 0 0
1846 2.97 5.04 0.19 9.33 48 68 0 31 0 0 0 0
866 3.07 4.58 0.14 8.75 31 73 0 26 0 0 0 0
619 3.34 4.20 0.15 10.00 25 78 0 21 0 0 0 0
558 3.43 4.19 0.12 10.05 29 80 0 20 0 0 0 0
905 2.70 4.41 0.19 9.95 30 77 0 22 0 0 0 0
828 2.70 4.25 0.16 9.78 28 77 0 22 0 0 0 0
1336 2.77 5.19 0.26 10.90 99 74 0 25 0 0 0 0
1078 3.03 4.53 0.18 9.57 45 74 0 25 0 0 0 0
654 2.09 3.54 0.18 11.00 29 83 0 16 0 0 0 0
749 3.13 4.27 0.16 9.23 26 75 0 24 0 0 0 0
To the left I have some iSCSI stats, since I connected the filer to my Windows box via iSCSI. I show the number of iSCSI ops/second, followed by the average latency. In the next two columns I break out the iSCSI read and write latencies separately. Next I display the average time taken to return a read from disk.
Next I show the total CPU usage. This column will show anything from 0 to 100x#Number of CPU’s so if you have a filer with 4 CPU’s the range is 0-400.
The next section attempts to show where the blocks are being read from.
ram = buffer cache
EC = flash cache
hdd = from a physical spinning disk
ssd = from a physical SSD disk
Then there are several Flash Pool (hybrid aggregate) counters that I do not fully understand. And since I am not using flash pools I didn’t spend time trying to find out what the counters mean.
To get this sort of report on your own filer – all you have to do is to create an XML file of the correct format- and upload the file to your filers’ /etc/stats/preset directory. The easiest way to do that is to mount the filers /vol/vol0 to some friendly unix machine. Below is the XML file I used to generate the output shown above.
Then on the filer, type “stats show -p
- Author: gary
- Published: Oct 30th, 2012
- Category: Uncategorized
- Comments: None
filebench “reuse” parameter does not work on NFS mountpoint
Opened bug 3581710 on Sourceforge to cover this behavior. For some reason, filebench seems not to honor the reuse parameter correctly if the filebench dataset is created at in the mountpoint root (e.g. /a). The test itself seems to work OK, but the datafile(s) is always removed. However, if a subdir is created in the root e.g. /a/dir1 – then the reuse parameter works as expected.