Recently we had an issue on some linux clients that would mysteriously fail to run SFS benchmarks. On closer inspection we found that the binary ‘cpp’ would exit with SIGSEGV.
# cpp
cc1: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See
For some reason that I didn’t understand, I was not able to get a core file (tried setting ulimit -c but while I could get cores for other binaries, I could not get one for cpp). Anyhow, since cpp was working on other clients, it looked like something had gone wrong with either the binary or library. So I took a look at the libraries…
# ldd /usr/bin/cpp
libc.so.6 => /lib64/tls/libc.so.6 (0x000000398f800000)
/lib64/ld-linux-x86-64.so.2 (0x000000398f600000)
Nothing peculiar there, and if libc was broken, pretty much nothing would work. So the next step was to verify the cpp binary itself. I’d initially planned to run cksum or md5 on the cpp binary on the broken machine, and on a known good machine. Then I remembered that modern package management systems often keep a checksum or hash of the binaries in the package database.
So first we need to determine which package cpp belongs to;
# rpm -qf /usr/bin/cpp
cpp-3.4.5-2
And then verify the package against the checksum/hash in the package DB using ‘rpm -V package- name, where ‘V’ (uppercase) is for ‘Verify’.
rpm -V cpp-3.4.5-2
..5..... /usr/libexec/gcc/x86_64-redhat-linux/3.4.3/cc1
The output is a bit cryptic, but what it is saying is that the file /usr/libexec/gcc/x86_64-redhat-linux/3.4.3/cc1 failed the md5 check. A more verbose output makes things a little clearer, note the lowercase ‘v’ for verbose.
# rpm -Vv cpp-3.4.5-2
........ /lib/cpp
........ /usr/bin/cpp
........ /usr/libexec/gcc
........ /usr/libexec/gcc/x86_64-redhat-linux
........ /usr/libexec/gcc/x86_64-redhat-linux/3.4.3
..5..... /usr/libexec/gcc/x86_64-redhat-linux/3.4.3/cc1
........ /usr/libexec/gcc/x86_64-redhat-linux/3.4.5
........ d /usr/share/info/cpp.info.gz
........ d /usr/share/info/cppinternals.info.gz
........ d /usr/share/man/man1/cpp.1.gz
So we have a corrupt bianry, but that binary is neither the executed binary or a linked library. Can this really be a cause of the failure? A strace of ‘cpp’ should tell us.
[root@kc95b7 ~]# strace -f cpp 2>&1 | grep cc1
stat("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", {st_mode=S_IFREG|0755, st_size=4149432, ...}) = 0
access("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", X_OK) = 0
[pid 31323] execve("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", ["/usr/libexec/gcc/x86_64-redhat-l"..., "-E", "-quiet", "-", "-mtune=k8"], [/* 27 vars */]) = 0
[pid 31323] write(2, "cc1: internal compiler error: Se"..., 48cc1: internal compiler error: Segmentation fault) = 48
Using the -f switch to strace I see that we call execve() on ‘cc1′ so cpp does call cc1. And using a full strace (without grep) shows that is is indeed cc1 that causes a SEGV
Here we see the ‘cpp’ binary calling cc1. The cpp process has PID 31377, cpp is PID 31378.
[pid 31377] wait4(31378, Process 31377 suspended
[pid 31378] execve("/usr/libexec/gcc/x86_64-redhat-linux/3.4.5/cc1", ["/usr/libexec/gcc/x86_64-redhat-l"..., "-E", "-quiet", "-", "-mtune=k8"], [/* 27 vars */]) = 0
[pid 31378] uname({sys="Linux", node="kc95b7", ...}) = 0
[pid 31378] brk(0) = 0xaf4000
Later on we see that PID 31378 (cc1) generates a segfault, and receives from the kernel a SIGSEGV, BUT the cc1 process catches the SEGV signal and deals with it directly with rt_sigaction,and prints its own error message. It does not dump core – which is why we do not get a core file
[pid 31378] mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a983e1000
[pid 31378] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
[pid 31378] rt_sigaction(SIGSEGV, {SIG_DFL}, {0x572c06, [SEGV], SA_RESTORER|SA_RESTART, 0x398f82e380}, 8) = 0
[pid 31378] open("/usr/share/locale/locale.alias", O_RDONLY) = 3
...
[pid 31378] write(2, "cc1: internal compiler error: Se"..., 48cc1: internal compiler error: Segmentation fault) = 48
[pid 31378] write(2, "\n", 1
) = 1
[pid 31378] write(2, "Please submit a full bug report,"..., 138Please submit a full bug report,
with preprocessed source if appropriate.
See
) = 138
[pid 31378] exit_group(27) = ?
Process 31377 resumed
Process 31378 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 27}], 0, NULL) = 31378
--- SIGCHLD (Child exited) @ 0 (0) ---
exit_group(1) = ?
So now we just have to figure out how the binary got corrupted….