dotplan

troubleshooting & performance analysis

  • Author: gary
  • Published: Apr 13th, 2012
  • Category: unix
  • Comments: None

Unable to login to graphical screen for some user in ubuntu/linux (X11)

Tags: , , ,

I was not able to login to my ubuntu desktop using my unix/nis login which was working fine via su. Here’s how I tracked down the problem to an unsupported ‘set’ command in my .profile.

1) I got this message in /va/log/auth.log

 gdm-session-worker[7336]: pam_ck_connector(gdm:session): nox11 mode

2) But that is actually a red-herring, and not the cause of my problems. I more interesting error was logged in my
~/.xession-errors file

$ cat .xsession-errors
/etc/gdm/Xsession: Beginning session setup...
set: 4: Illegal option -h

So, it is trying to run a ‘set’ command as part of login. I remembered that I had put some ‘set’ commands in my .profile, specifically

set -ha

3) Remove the “set -ha” line from .profile

4) Login via gdm, lightdm etc.

So, it seems that during login to the x session my .profile was being executed in a shell that did note support ‘set -ha’ and generated an error which meant that the X login sequence could not complete, and so I was not able to login even though I could happily ‘su’ to the user in question.

Invalidate Linux page cache

Tags:

Issue sync first, to flush dirty pages back to the backing store

sync

Issue this command to invalidate the cache

echo 3 > /proc/sys/vm/drop_caches

This kernel.org page has an interesting list of the various settable values for the linux page cache. http://www.kernel.org/doc/Documentation/sysctl/vm.txt

drop_caches

Writing to this will cause the kernel to drop clean caches, dentries and
inodes from memory, causing that memory to become free.

To free pagecache:
	echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
	echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
	echo 3 > /proc/sys/vm/drop_caches

Cannot see NetApp LUN’s from Linux?

Tags: , , ,

After some connectivity swap-a-roos in the lab, I could no longer see my LUNS from the linux host attached to my filer.

In this case I am using a QLogic HBA – and I am not using any of the NetApp host side tools – just the sanlun tool.

Using the SANsurfer Menu (/opt/QLogic_Corporation/SANsurferCLI) I can tell that this linux host can see the filers’ LUNS over FC. But there are no SCSI /dev/sdX devices for them, and so Linux cannot use them…

Here’s how I checked to see that there was FC connectivity – which also confirms that the FC protocol is working.

	SANsurfer FC/CNA HBA CLI

	v1.7.2 Build 7

    Main Menu

    1:	General Information  <---- Option 1
    2:	HBA Information
    3:	HBA Parameters
    4:	Target/LUN List
    5:	iiDMA Settings
...

    General Information Menu

    1:	Host Information
    2:	Host Topology
    3:	Report     <---- Option 3..
    4:	Refresh
    5:	Return to Previous Menu

	Note: 0 to return to Main Menu
	Enter Selection: 1

   Report Menu

    HBA Model QLE2462
      1: Port   1: WWPN: 21-00-00-E0-8B-9B-C5-36 Online
      2: Port   2: WWPN: 21-01-00-E0-8B-BB-C5-36 Online
      3: All HBAs  <---- Option 3
      4: Return to Previous Menu

	Note: 0 to return to Main Menu
	Enter Selection: 3

I could see that there was connectivity from the Linux host to the filer

---------------------------------------
LUN 1
---------------------------------------
Product Vendor                    : NETAPP
Product ID                        : LUN
Product Revision                  : 811a
LUN                               : 1
Size                              : 17.93 GB
Type                              : SBC-2 Direct access block device
			           (e.g., magnetic disk)
WWULN                             : 4E-45-54-41-50-50-20-20-20-4C-55-4E-20-32-46-68
			           72-53-3F-2D-68-4F-79-6C-33-00-00-00-00-00-00-00
OS LUN Name                       :

From the filer side, I could see that the host's FC adapters had connected to the filer,
and were in the right igroup

filer1*> igroup show
    filer1 (FCP) (ostype: linux):
        21:00:00:e0:8b:9b:c5:36 (logged in on: 0a)
        21:01:00:e0:8b:bb:c5:36 (logged in on: 0b)

The only thing that was missing was that there were no 'sd' devices created in Linux for these devices.

"sanlun" utility was not helpful and just told me that there wer no LUNs mapped.

The solution was to issue this very odd looking command

linuxhost:[/sys/class/scsi_host] $ echo "- - -" > host0/scan

This caused the sd devices to be created, representing the NetApp LUNs which I knew could already be seen over FC. Since I have both ports on the same HBA attached to the filer, host0 scan created my /dev/sdc* devices, and host1/scan created my /dev/sdd* devices.

The shell 'hung' for the duration of the command, and I would expect that Linux was off in kernel land for some time - and so i would NOT recommend issuing the command on a production server.

I'm still puzzled why the linux host did not see the luns even after reboot though.

  • Author: gary
  • Published: Jul 26th, 2011
  • Category: sysadmin
  • Comments: 1

Fix for “Cannot open master raw device ‘/dev/rawctl’ (No such device or address) “

Tags: ,

For some reason the device /dev/rawctl has a habit of disappearing. This can be fixed by creating the device file /dev/rawctl with the major/minor number 162,0

mknod /dev/rawctl c 162 0
 

This will allow /dev/raw devices to be created. However /dev/raw/raw0 cannot be used since 162,0 is used by /dev/rawctl.

Hardware design for high density storage pod

Tags: , , , , ,

These guys at ‘backblaze’ claim to be able to provide 67TB for $7,867 based on 45 SATA drives, RAID6 (Linux) and JFS. The hardware design which is presented in the blog as a how-to uses vertically stacked drives and looks very much like the Sun ‘Thumper’ (X4500) device which I first saw when I was still at Sun around 2006. No performance numbers are given, and whilst the capacity is much cheaper than ‘real’ storage, I doubt that write performance is much to write home about.

  • 45 SATA Drives
  • JFS
  • RAID6 (Linux mdadm)
  • Served via HTTPS (not NFS,iSCSI)
    • Author: gary
    • Published: Sep 2nd, 2009
    • Category: unix
    • Comments: None

    List binary objects by size.

    Tags: , , , , ,

    Sometimes you will want to know what a binary file contains (functions, arrays, objects) etc. The ‘nm’ command will do that for you, and the following 1-liner will sort the output by size. I use this to determine why some source compiled on Solaris is larger than when compiled in linux.

    e.g.

     nm  | sort -n -t \| +2
    

    Here is a small binary I happen to have.

    $ nm write_random | sort -n -t \| +2 
    
    write_random:
    
    [Index]   Value      Size    Type  Bind  Other Shndx   Name
    ..
    
    ..
    [72]    | 134520544|      60|OBJT |GLOB |0    |23     |tstr
    [85]    |         0|      60|FUNC |GLOB |0    |UNDEF  |lseek@@GLIBC_2.0
    [58]    | 134515432|      66|FUNC |GLOB |0    |12     |__libc_csu_fini
    [66]    |         0|      67|FUNC |GLOB |0    |UNDEF  |getopt@@GLIBC_2.0
    [74]    | 134515348|      82|FUNC |GLOB |0    |12     |__libc_csu_init
    [51]    |         0|     113|FUNC |GLOB |0    |UNDEF  |close@@GLIBC_2.0
    [76]    |         0|     113|FUNC |GLOB |0    |UNDEF  |fsync@@GLIBC_2.0
    [50]    |         0|     124|FUNC |GLOB |0    |UNDEF  |write@@GLIBC_2.0
    [88]    |         0|     124|FUNC |GLOB |0    |UNDEF  |open@@GLIBC_2.0
    [55]    | 134514968|     163|FUNC |GLOB |0    |12     |getSizeInBlocks
    [68]    | 134515131|     214|FUNC |GLOB |0    |12     |do_writes
    [89]    |         0|     217|FUNC |GLOB |0    |UNDEF  |exit@@GLIBC_2.0
    [79]    |         0|     221|FUNC |GLOB |0    |UNDEF  |__libc_start_main@@GLIBC_2.0
    [67]    |         0|     344|FUNC |GLOB |0    |UNDEF  |__fxstat@@GLIBC_2.0
    [53]    |         0|     523|FUNC |GLOB |0    |UNDEF  |perror@@GLIBC_2.0
    [62]    |         0|     539|FUNC |GLOB |0    |UNDEF  |malloc@@GLIBC_2.0
    [77]    | 134514424|     544|FUNC |GLOB |0    |12     |main
    

    Here’s the source code for the same binary, notice in the output of nm the size of the object tstr is 60 bytes, and in the code it’s specified as char[60].

    $ cat write_random.c
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    
    #define ARRAYSIZE 16*1024
    #define NUMTOWRITE 10
    
    int getSizeInBlocks(int,int, char *);
    void do_writes(int,int, int);
    int     loopsToDo=1;
    int     limitLoops=0;   /*Default to doing loops forever*/
    int     msPause=0;
    int     iostart,iostop,iotimeMS,ioThresh;
    struct tm *tm_ptr;
    time_t  tm;
    char    tstr[60];
    int     IOoverThresh=1,TotalIOs=1,dofsync=0,forcesize=0;
    double  badpct;
    
    int main(argc, argv)
    int     argc;
    char    *argv[];
    {
    
            int     i,n,count=0,SizeInBlocks;
            int     appBlockSize=8192,key;  /*Default IO size is 8k*/
            int     mode=O_RDWR;
            int     dsync=0;
            char    *array,c,filename[255];
            int fd;
    
            if (argc == 1) {
                    fprintf(stderr,"Usage: [-d  -f | -b  |-s | -l  | -o st_size; */
      FileSize=lseek(fd,0,SEEK_END);
      if (forcesize)
        FileSize=forcesize*1024*1024;
      appBlocks = (int) FileSize/appBlockSize;
      printf("file %s is %d bytes %d application blocks blocksize = %d\n",filename,FileSize,appBlocks,appBlockSize);
      return appBlocks;
    }
    
    void do_writes(int fd,int appBlockSize, int appBlocks)
    {
     char *array;
     array=malloc(appBlockSize);
    
     while  (loopsToDo >0 ){
      /*generate a random number 0-appBlocks*/
      int seek_val=rand()%appBlocks;
      lseek(fd,appBlockSize * seek_val, SEEK_SET);
      write(fd,array, appBlockSize);
      if (dofsync)
        fsync(fd);
      TotalIOs++;
    
      if (limitLoops)
            loopsToDo--;
      if (msPause)
            usleep(msPause*1000);
      }
      return;
    }
    

    Verifying binaries in linux.

    Tags: , , , ,

    Recently we had an issue on some linux clients that would mysteriously fail to run SFS benchmarks. On closer inspection we found that the binary ‘cpp’ would exit with SIGSEGV.
    # cpp
    cc1: internal compiler error: Segmentation fault
    Please submit a full bug report,
    with preprocessed source if appropriate.
    See for instructions.

    For some reason that I didn’t understand, I was not able to get a core file (tried setting ulimit -c but while I could get cores for other binaries, I could not get one for cpp). Anyhow, since cpp was working on other clients, it looked like something had gone wrong with either the binary or library. So I took a look at the libraries…
    # ldd /usr/bin/cpp
    libc.so.6 => /lib64/tls/libc.so.6 (0x000000398f800000)
    /lib64/ld-linux-x86-64.so.2 (0x000000398f600000)

    Nothing peculiar there, and if libc was broken, pretty much nothing would work. So the next step was to verify the cpp binary itself. I’d initially planned to run cksum or md5 on the cpp binary on the broken machine, and on a known good machine. Then I remembered that modern package management systems often keep a checksum or hash of the binaries in the package database.


    So first we need to determine which package cpp belongs to;
    # rpm -qf /usr/bin/cpp
    cpp-3.4.5-2

    And then verify the package against the checksum/hash in the package DB using ‘rpm -V package- name, where ‘V’ (uppercase) is for ‘Verify’.
    rpm -V cpp-3.4.5-2
    ..5..... /usr/libexec/gcc/x86_64-redhat-linux/3.4.3/cc1

    The output is a bit cryptic, but what it is saying is that the file /usr/libexec/gcc/x86_64-redhat-linux/3.4.3/cc1 failed the md5 check. A more verbose output makes things a little clearer, note the lowercase ‘v’ for verbose.
    # rpm -Vv cpp-3.4.5-2
    ........ /lib/cpp
    ........ /usr/bin/cpp
    ........ /usr/libexec/gcc
    ........ /usr/libexec/gcc/x86_64-redhat-linux
    ........ /usr/libexec/gcc/x86_64-redhat-linux/3.4.3
    ..5..... /usr/libexec/gcc/x86_64-redhat-linux/3.4.3/cc1
    ........ /usr/libexec/gcc/x86_64-redhat-linux/3.4.5
    ........ d /usr/share/info/cpp.info.gz
    ........ d /usr/share/info/cppinternals.info.gz
    ........ d /usr/share/man/man1/cpp.1.gz

    So we have a corrupt bianry, but that binary is neither the executed binary or a linked library. Can this really be a cause of the failure? A strace of ‘cpp’ should tell us.

    [root@kc95b7 ~]# strace -f cpp 2>&1 | grep cc1
    stat("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", {st_mode=S_IFREG|0755, st_size=4149432, ...}) = 0
    access("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", X_OK) = 0
    [pid 31323] execve("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", ["/usr/libexec/gcc/x86_64-redhat-l"..., "-E", "-quiet", "-", "-mtune=k8"], [/* 27 vars */]) = 0
    [pid 31323] write(2, "cc1: internal compiler error: Se"..., 48cc1: internal compiler error: Segmentation fault) = 48

    Using the -f switch to strace I see that we call execve() on ‘cc1′ so cpp does call cc1. And using a full strace (without grep) shows that is is indeed cc1 that causes a SEGV


    Here we see the ‘cpp’ binary calling cc1. The cpp process has PID 31377, cpp is PID 31378.

    [pid 31377] wait4(31378, Process 31377 suspended

    [pid 31378] execve("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", ["/usr/libexec/gcc/x86_64-redhat-l"..., "-E", "-quiet", "-", "-mtune=k8"], [/* 27 vars */]) = 0
    [pid 31378] uname({sys="Linux", node="kc95b7", ...}) = 0
    [pid 31378] brk(0) = 0xaf4000

    Later on we see that PID 31378 (cc1) generates a segfault, and receives from the kernel a SIGSEGV, BUT the cc1 process catches the SEGV signal and deals with it directly with rt_sigaction,and prints its own error message. It does not dump core – which is why we do not get a core file

    [pid 31378] mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a983e1000
    [pid 31378] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
    [pid 31378] rt_sigaction(SIGSEGV, {SIG_DFL}, {0x572c06, [SEGV], SA_RESTORER|SA_RESTART, 0x398f82e380}, 8) = 0
    [pid 31378] open("/usr/share/locale/locale.alias", O_RDONLY) = 3
    ...
    [pid 31378] write(2, "cc1: internal compiler error: Se"..., 48cc1: internal compiler error: Segmentation fault) = 48
    [pid 31378] write(2, "\n", 1
    ) = 1
    [pid 31378] write(2, "Please submit a full bug report,"..., 138Please submit a full bug report,
    with preprocessed source if appropriate.
    See for instructions.
    ) = 138
    [pid 31378] exit_group(27) = ?
    Process 31377 resumed
    Process 31378 detached
    <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 27}], 0, NULL) = 31378
    --- SIGCHLD (Child exited) @ 0 (0) ---
    exit_group(1) = ?

    So now we just have to figure out how the binary got corrupted….

    © 2009 dotplan. All Rights Reserved.

    This blog is powered by Wordpress and Magatheme by Bryan Helmig.