We had a very strange problem at work whereby we could execute 8 copies of our load generation tool, but not more. This is a tool we use all the time, and regularly run more than 20 copies per host. The box I was trying to run the tool (sio) was a freshly installed copy of Solaris.
After successfully running ‘sio’ 8 times, attempting to execute another at the command line failed immediatley with a single line response.
Killed
Trying to truss the process gave no clues
$ truss -vall ./sio truss: cannot execute program: ./sio
so I ran truss against a shell execution of the same process to see what happened and there was a little more data:
- The child was killed with signal 9.
- The last successfull syscall in the child was brk()
- The child did not successfully call exec()
6017: fork1() = 6018 6018: fork1() (returning as child ...) = 6017 6017: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] 6018: getpid() = 6018 [6017] 6018: lwp_self() = 1 6018: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] 6018: getpid() = 6018 [6017] 6018: brk(0x08077808) = 0 6017: waitid(P_PID, 6018, 0x08047A60, WEXITED|WTRAPPED|WNOWAIT) (sleeping...) 6017: waitid(P_PID, 6018, 0x08047A60, WEXITED|WTRAPPED|WNOWAIT) = 0 6017: siginfo: SIGCLD CLD_KILLED pid=6018 status=0x0009 6017: ioctl(0, TIOCGPGRP, 0x08047AD0) = 0 6017: 0x08047AD0: 6016 6017: ioctl(0, TCGETS, 0x08075EC8) = 0 6017: iflag=0026442 oflag=0014005 cflag=03200275 lflag=0105033 6017: cc: 003 034 177 025 004 377 377 000 6017: 021 023 032 031 022 017 027 026 000 000 000 6017: waitid(P_PID, 6018, 0x08047A60, WEXITED|WTRAPPED) = 0 6017: siginfo: SIGCLD CLD_KILLED pid=6018 status=0x0009
The binary itself is very small (around 100Kbytes), and quite simple, so I did not think that memory would be the problem. Also, I would have expected to get ENOMEM or some other message for a memory error. However it turned out that this small binary had a very large bss segment, which can be seen by running the ‘size’ command against the compiled binary. This memory space is allocated at startup, rather than by a malloc() call which is why the OS fails to complete the exec() call.
$ size sio 82136 + 2852 + 138212100 = 138297088
The BSS segment is the third column, and the output is in bytes, so this binary has a BSS segment of around 134Mbytes. The Solaris installation had a relatively small swap size of around 512Mb and vmstat showed 1069944 Kbytes of swap available. After running 7 copies of sio there was only 122264 Kbytes of swap available (from vmstat) and so I could not execute another copy of sio.
I was able to reproduce the problem using a small program with a large array allocation
#includeint main() { static int array[25*1024*1024]; printf("Hit return to exit"); scanf("Exit"); }
After compiling (gcc big-bss.c) the file size is very small as expected.
-rwxr-xr-x 1 little engr 6396 Aug 25 09:07 a.out
but the bss is quite large (25*1024*1024) * 4 (bytes for an int).
$ size a.out 2388 + 336 + 104857660 = 104860384
So since we have not touched anything in the array we don’t expect to have any pages in memory, and that seems to be verified by pmap.
pmap -x 6322 6322: ./a.out Address Kbytes RSS Anon Locked Mode Mapped File 08046000 8 8 8 - rwx-- [ stack ] 08050000 4 4 - - r-x-- a.out 08060000 4 4 4 - rwx-- a.out 08061000 102400 - - - rwx-- [ heap ] FEE80000 1068 756 - - r-x-- libc.so.1 FEF90000 24 12 12 - rwx-- [ anon ] FEF9B000 32 32 32 - rw--- libc.so.1 FEFA3000 8 8 8 - rw--- libc.so.1 FEFC4000 156 156 - - r-x-- ld.so.1 FEFF0000 4 4 4 - rwx-- [ anon ] FEFFB000 8 8 8 - rwx-- ld.so.1 FEFFD000 4 4 4 - rwx-- ld.so.1 -------- ------- ------- ------- ------- total Kb 103720 996 80 -