dotplan

troubleshooting & performance analysis

Case of the failing executable: Killed at startup.

Tags: , , , , , , , , , ,

We had a very strange problem at work whereby we could execute 8 copies of our load generation tool, but not more. This is a tool we use all the time, and regularly run more than 20 copies per host. The box I was trying to run the tool (sio) was a freshly installed copy of Solaris.

After successfully running ‘sio’ 8 times, attempting to execute another at the command line failed immediatley with a single line response.

Killed

Trying to truss the process gave no clues

$ truss -vall ./sio
truss: cannot execute program: ./sio

so I ran truss against a shell execution of the same process to see what happened and there was a little more data:

  • The child was killed with signal 9.
  • The last successfull syscall in the child was brk()
  • The child did not successfully call exec()
6017:   fork1()                                         = 6018
6018:   fork1()         (returning as child ...)        = 6017
6017:   lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
6018:   getpid()                                        = 6018 [6017]
6018:   lwp_self()                                      = 1
6018:   lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
6018:   getpid()                                        = 6018 [6017]
6018:   brk(0x08077808)                                 = 0
6017:   waitid(P_PID, 6018, 0x08047A60, WEXITED|WTRAPPED|WNOWAIT) (sleeping...)
6017:   waitid(P_PID, 6018, 0x08047A60, WEXITED|WTRAPPED|WNOWAIT) = 0
6017:         siginfo: SIGCLD CLD_KILLED pid=6018 status=0x0009
6017:   ioctl(0, TIOCGPGRP, 0x08047AD0)                 = 0
6017:           0x08047AD0: 6016
6017:   ioctl(0, TCGETS, 0x08075EC8)                    = 0
6017:           iflag=0026442 oflag=0014005 cflag=03200275 lflag=0105033
6017:               cc:  003 034 177 025 004 377 377 000
6017:                    021 023 032 031 022 017 027 026 000 000 000
6017:   waitid(P_PID, 6018, 0x08047A60, WEXITED|WTRAPPED) = 0
6017:         siginfo: SIGCLD CLD_KILLED pid=6018 status=0x0009

The binary itself is very small (around 100Kbytes), and quite simple, so I did not think that memory would be the problem. Also, I would have expected to get ENOMEM or some other message for a memory error. However it turned out that this small binary had a very large bss segment, which can be seen by running the ‘size’ command against the compiled binary. This memory space is allocated at startup, rather than by a malloc() call which is why the OS fails to complete the exec() call.

$ size sio
82136 + 2852 + 138212100 = 138297088

The BSS segment is the third column, and the output is in bytes, so this binary has a BSS segment of around 134Mbytes. The Solaris installation had a relatively small swap size of around 512Mb and vmstat showed 1069944 Kbytes of swap available. After running 7 copies of sio there was only 122264 Kbytes of swap available (from vmstat) and so I could not execute another copy of sio.

I was able to reproduce the problem using a small program with a large array allocation

#include 
int main()
{
 static int array[25*1024*1024];
printf("Hit return to exit");
scanf("Exit");
}

After compiling (gcc big-bss.c) the file size is very small as expected.

-rwxr-xr-x   1 little   engr        6396 Aug 25 09:07 a.out

but the bss is quite large (25*1024*1024) * 4 (bytes for an int).

$ size a.out
2388 + 336 + 104857660 = 104860384

So since we have not touched anything in the array we don’t expect to have any pages in memory, and that seems to be verified by pmap.

pmap -x 6322
6322:   ./a.out
 Address  Kbytes     RSS    Anon  Locked Mode   Mapped File
08046000       8       8       8       - rwx--    [ stack ]
08050000       4       4       -       - r-x--  a.out
08060000       4       4       4       - rwx--  a.out
08061000  102400       -       -       - rwx--    [ heap ]
FEE80000    1068     756       -       - r-x--  libc.so.1
FEF90000      24      12      12       - rwx--    [ anon ]
FEF9B000      32      32      32       - rw---  libc.so.1
FEFA3000       8       8       8       - rw---  libc.so.1
FEFC4000     156     156       -       - r-x--  ld.so.1
FEFF0000       4       4       4       - rwx--    [ anon ]
FEFFB000       8       8       8       - rwx--  ld.so.1
FEFFD000       4       4       4       - rwx--  ld.so.1
-------- ------- ------- ------- -------
total Kb  103720     996      80       -

© 2009 dotplan. All Rights Reserved.

This blog is powered by Wordpress and Magatheme by Bryan Helmig.