dotplan

troubleshooting & performance analysis

Counting in unix with ‘uniq-c’. Or how many of something does this thing have?

Tags: ,

Recently we were looking for common components that might indicate why some of our tests failed. One of the sysadmins at our Sunnyvale lab came up with a quick script which did the job. He used ‘uniq -c’ to great effect. In this example I look in my iTunes directory structyre for songs which ‘love’ in the title, and then count how many are on each album…. an example is probably better

lovebox:Compilations gjl$ find . -print | grep -i love | awk ' BEGIN { FS="/" } ; { print $2 }' | uniq -c | sort -n

 8 Ministry Of Sound Disco Classi
   6 Music To Watch Girls By Vol 3
   5 The Best Of Sade
   4 Back To Love 03.03
   3 Funk Soul Anthems
   3 CD Singles
   3 Acoustic Hits
   2 Twelve Ibiza Classics - Adventures In Wonderland
   2 The Very Best Of
   2 Take Me to God
   2 Nu Cool (Disc 2)
   2 Music for Cocktails (Part 4)
   2 Ministry Of Sound Chillout Classics (CD2)
   2 MOS_ Funky House 2005 (Disc 1)
   2 Jason Moore - Triptonite (Disc 1)
   2 JDJ Danny Rampling
   2 Indie Anthems
   2 Ibiza - The History Of Chillout (Disc 3)
   2 Greatest 80's Album (Disk2)
   2 Greatest 80's Album (Disc 1)
   2 Fabriclive 11_ Bent
   2 Electro Jazz Divas
   2 Electro 7
   2 Chilled Jazz
   2 Best Pub Jukebox
   2 All Together Now
   2 A Time To Remember 1970
   2 12__80s [Disc 2]
   2 12__80s [Disc 1]
   1 ian_pooley_presents_a_subterranean_soundtrack
   1 Your Hundred Best Tunes, Volume 8
   1 Your Hundred Best Tunes, Volume 3
   1 Welcome To The Monkey House
   1 The Triptych (Disc 2)
   1 The Remix Album
   1 The Rebirth Of Cool 6
   1 Teenage Kicks
   1 Sty Wars - A Collection Of Pork Medallions
   1 Solid Steel presents.._
   1 Serve Chilled (Disc 2)
   1 Serve Chilled (Disc 1)
   1 Putumayo Presents_ Latin Lounge
   1 Profound Sounds Vol. 1
   1 Nu Cool (Disc 1)
   1 Music for Cocktails (Part 6)
   1 Music for Cocktails (Part 5)
   1 Music for Cocktails (Part 3)
   1 Music for Cocktails (Part 2)

Finding data corruption with ‘od’

Tags: ,

Here’s a weird one that I’d never seen before. We were finding that at least one of our linux binaries was becoming corrupt (verified using rpm -V) however we didn’t know how the binary was becoming corrupt – because these files should never be written to. On the off-chance that the corruption might tell us something about the corruption I decided to dump the binaries out using ‘od -x’ the ‘octal dump’ command, and allowed me to use ‘diff’ to see where in the file the corruption was and to see what sort of data was in the place where the executable code of the ‘cc1′ binary ought to be.

After using ‘diff’ to find where the corruption was and seeing that the data was not just a bunch of 00′s or FF’s, I switched to ‘od -c’ to dump out the binary character by character. After peering for a while I noticed that there were some filenames embedded in the corrupted data. Interestingly the filenames were not printed out using ‘strings’ because each character was zero terminated (I guess they were stored as multibyte chars rather than single byte chars – and so did not appear in ‘strings’.

The point of interest is here :-
16175560 \f 003 u \0 s \0 e \0 r \0 d \0 i \0 f \0
16175600 f \0 . \0 L \0 O \0 G \0 \0 \0 \0 \0 \0 \0
16175620 \0 \0 \0 \0 \0 \0 \0 \0 020 \0 \0 \0 002 \0 \0 \0

If you look closely, there is what looks like a filename ‘userdiff.LOG’ It turns out that the file ‘userdiff.log’ is a Windows file!

So, how does a windows filename end up embedded inside our linux binary? Here’s what I think happened. This client is a multi-purpose lab client sometimes booted into Windows, sometimes into Linux. The client is SAN booted from a NetApp LUN.

  • The system is booted into Windows, and runs some tests.
  • The Windows system is shutdown.
  • Before Windows is fully shutdown, the boot LUN is swapped over in preparation for the next test, which happens to be a Linux LUN.
  • As Windows continues to shutdown, it writes out the last few dirty pages from memory before finally shutting down. However these pages don’t go onto the Windows LUN – as they should, but onto the Linux LUN that has been swapped in.
  • The Linux LUN is now corrupted.
  • But surely when the Linux LUN is swapped in, the writes immediately stop because the Windows LUN is ripped away? No! the Windows SCSI driver (like all SCSI drivers) just puts the blocks on the SCSI bus with a destination address. There will probably be a SCSI error, such as SCSI_TIMEOUT whilst the new LUN is swapped in, however the SCSI driver WILL RETRY when the new LUN is swapped in, and then successfully writes to the new LUN after one or more retries.

    Verifying binaries in linux.

    Tags: , , , ,

    Recently we had an issue on some linux clients that would mysteriously fail to run SFS benchmarks. On closer inspection we found that the binary ‘cpp’ would exit with SIGSEGV.
    # cpp
    cc1: internal compiler error: Segmentation fault
    Please submit a full bug report,
    with preprocessed source if appropriate.
    See for instructions.

    For some reason that I didn’t understand, I was not able to get a core file (tried setting ulimit -c but while I could get cores for other binaries, I could not get one for cpp). Anyhow, since cpp was working on other clients, it looked like something had gone wrong with either the binary or library. So I took a look at the libraries…
    # ldd /usr/bin/cpp
    libc.so.6 => /lib64/tls/libc.so.6 (0x000000398f800000)
    /lib64/ld-linux-x86-64.so.2 (0x000000398f600000)

    Nothing peculiar there, and if libc was broken, pretty much nothing would work. So the next step was to verify the cpp binary itself. I’d initially planned to run cksum or md5 on the cpp binary on the broken machine, and on a known good machine. Then I remembered that modern package management systems often keep a checksum or hash of the binaries in the package database.


    So first we need to determine which package cpp belongs to;
    # rpm -qf /usr/bin/cpp
    cpp-3.4.5-2

    And then verify the package against the checksum/hash in the package DB using ‘rpm -V package- name, where ‘V’ (uppercase) is for ‘Verify’.
    rpm -V cpp-3.4.5-2
    ..5..... /usr/libexec/gcc/x86_64-redhat-linux/3.4.3/cc1

    The output is a bit cryptic, but what it is saying is that the file /usr/libexec/gcc/x86_64-redhat-linux/3.4.3/cc1 failed the md5 check. A more verbose output makes things a little clearer, note the lowercase ‘v’ for verbose.
    # rpm -Vv cpp-3.4.5-2
    ........ /lib/cpp
    ........ /usr/bin/cpp
    ........ /usr/libexec/gcc
    ........ /usr/libexec/gcc/x86_64-redhat-linux
    ........ /usr/libexec/gcc/x86_64-redhat-linux/3.4.3
    ..5..... /usr/libexec/gcc/x86_64-redhat-linux/3.4.3/cc1
    ........ /usr/libexec/gcc/x86_64-redhat-linux/3.4.5
    ........ d /usr/share/info/cpp.info.gz
    ........ d /usr/share/info/cppinternals.info.gz
    ........ d /usr/share/man/man1/cpp.1.gz

    So we have a corrupt bianry, but that binary is neither the executed binary or a linked library. Can this really be a cause of the failure? A strace of ‘cpp’ should tell us.

    [root@kc95b7 ~]# strace -f cpp 2>&1 | grep cc1
    stat("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", {st_mode=S_IFREG|0755, st_size=4149432, ...}) = 0
    access("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", X_OK) = 0
    [pid 31323] execve("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", ["/usr/libexec/gcc/x86_64-redhat-l"..., "-E", "-quiet", "-", "-mtune=k8"], [/* 27 vars */]) = 0
    [pid 31323] write(2, "cc1: internal compiler error: Se"..., 48cc1: internal compiler error: Segmentation fault) = 48

    Using the -f switch to strace I see that we call execve() on ‘cc1′ so cpp does call cc1. And using a full strace (without grep) shows that is is indeed cc1 that causes a SEGV


    Here we see the ‘cpp’ binary calling cc1. The cpp process has PID 31377, cpp is PID 31378.

    [pid 31377] wait4(31378, Process 31377 suspended

    [pid 31378] execve("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", ["/usr/libexec/gcc/x86_64-redhat-l"..., "-E", "-quiet", "-", "-mtune=k8"], [/* 27 vars */]) = 0
    [pid 31378] uname({sys="Linux", node="kc95b7", ...}) = 0
    [pid 31378] brk(0) = 0xaf4000

    Later on we see that PID 31378 (cc1) generates a segfault, and receives from the kernel a SIGSEGV, BUT the cc1 process catches the SEGV signal and deals with it directly with rt_sigaction,and prints its own error message. It does not dump core – which is why we do not get a core file

    [pid 31378] mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a983e1000
    [pid 31378] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
    [pid 31378] rt_sigaction(SIGSEGV, {SIG_DFL}, {0x572c06, [SEGV], SA_RESTORER|SA_RESTART, 0x398f82e380}, 8) = 0
    [pid 31378] open("/usr/share/locale/locale.alias", O_RDONLY) = 3
    ...
    [pid 31378] write(2, "cc1: internal compiler error: Se"..., 48cc1: internal compiler error: Segmentation fault) = 48
    [pid 31378] write(2, "\n", 1
    ) = 1
    [pid 31378] write(2, "Please submit a full bug report,"..., 138Please submit a full bug report,
    with preprocessed source if appropriate.
    See for instructions.
    ) = 138
    [pid 31378] exit_group(27) = ?
    Process 31377 resumed
    Process 31378 detached
    <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 27}], 0, NULL) = 31378
    --- SIGCHLD (Child exited) @ 0 (0) ---
    exit_group(1) = ?

    So now we just have to figure out how the binary got corrupted….

    © 2009 dotplan. All Rights Reserved.

    This blog is powered by Wordpress and Magatheme by Bryan Helmig.