Here’s a weird one that I’d never seen before. We were finding that at least one of our linux binaries was becoming corrupt (verified using rpm -V) however we didn’t know how the binary was becoming corrupt – because these files should never be written to. On the off-chance that the corruption might tell us something about the corruption I decided to dump the binaries out using ‘od -x’ the ‘octal dump’ command, and allowed me to use ‘diff’ to see where in the file the corruption was and to see what sort of data was in the place where the executable code of the ‘cc1′ binary ought to be.
After using ‘diff’ to find where the corruption was and seeing that the data was not just a bunch of 00′s or FF’s, I switched to ‘od -c’ to dump out the binary character by character. After peering for a while I noticed that there were some filenames embedded in the corrupted data. Interestingly the filenames were not printed out using ‘strings’ because each character was zero terminated (I guess they were stored as multibyte chars rather than single byte chars – and so did not appear in ‘strings’.
The point of interest is here :-
16175560 \f 003 u \0 s \0 e \0 r \0 d \0 i \0 f \0
16175600 f \0 . \0 L \0 O \0 G \0 \0 \0 \0 \0 \0 \0
16175620 \0 \0 \0 \0 \0 \0 \0 \0 020 \0 \0 \0 002 \0 \0 \0
If you look closely, there is what looks like a filename ‘userdiff.LOG’ It turns out that the file ‘userdiff.log’ is a Windows file!
So, how does a windows filename end up embedded inside our linux binary? Here’s what I think happened. This client is a multi-purpose lab client sometimes booted into Windows, sometimes into Linux. The client is SAN booted from a NetApp LUN.
The system is booted into Windows, and runs some tests.
The Windows system is shutdown.
Before Windows is fully shutdown, the boot LUN is swapped over in preparation for the next test, which happens to be a Linux LUN.
As Windows continues to shutdown, it writes out the last few dirty pages from memory before finally shutting down. However these pages don’t go onto the Windows LUN – as they should, but onto the Linux LUN that has been swapped in.
The Linux LUN is now corrupted.
But surely when the Linux LUN is swapped in, the writes immediately stop because the Windows LUN is ripped away? No! the Windows SCSI driver (like all SCSI drivers) just puts the blocks on the SCSI bus with a destination address. There will probably be a SCSI error, such as SCSI_TIMEOUT whilst the new LUN is swapped in, however the SCSI driver WILL RETRY when the new LUN is swapped in, and then successfully writes to the new LUN after one or more retries.