The script uses the Digest::md5 module to look for duplicate files. It takes as input a list of filenames, and returns a list of all files which share the same md5 hash/signature. Files which share the same md5 signature are very likely to be duplicate files.
lovebox:~ gjl$ cp /etc/hosts file1 lovebox:~ gjl$ cp /etc/hosts file2 lovebox:~ gjl$ dupes file1 file2 cbe7e7eb6480e869bccfa284dc8bd732 : file1 file2
This usage is almost identical to using the standard unix utility cksum, except that the md5 cheksum is less likely to give the same ‘checksum’ for a file which is actually different.
lovebox:~ gjl$ cksum file1 file2 85078130 236 file1 85078130 236 file2
For a 100m file created using ‘dd if=/dev/random’ the cost in time of the two checksum methods is about the same, even though one would exptect that the md5 checksum is more computationally expensive. But the time(1) utility seems to show that the md5 script actually uses less CPU time than the unix cksum tool.
The wall time is greater than the sum of user+sys so there must be some wait event, even though the files are cached by the filesystem.
lovebox-2:~ gjl$ time cksum file1 file2 3512382378 104857600 file1 3512382378 104857600 file2 real 0m1.833s user 0m1.660s sys 0m0.144s lovebox-2:~ gjl$ time dupes file1 file2 ed3b0c7dcfc7192186899a3c63854eb0 : file1 file2 real 0m1.181s user 0m0.916s sys 0m0.250s