dotplan

troubleshooting & performance analysis

Find duplicate files with Perl md5

Tags:

The script uses the Digest::md5 module to look for duplicate files. It takes as input a list of filenames, and returns a list of all files which share the same md5 hash/signature. Files which share the same md5 signature are very likely to be duplicate files.

lovebox:~ gjl$ cp /etc/hosts file1
lovebox:~ gjl$ cp /etc/hosts file2
lovebox:~ gjl$ dupes file1 file2
cbe7e7eb6480e869bccfa284dc8bd732 : file1 file2

This usage is almost identical to using the standard unix utility cksum, except that the md5 cheksum is less likely to give the same ‘checksum’ for a file which is actually different.

lovebox:~ gjl$ cksum file1 file2
85078130 236 file1
85078130 236 file2

For a 100m file created using ‘dd if=/dev/random’ the cost in time of the two checksum methods is about the same, even though one would exptect that the md5 checksum is more computationally expensive. But the time(1) utility seems to show that the md5 script actually uses less CPU time than the unix cksum tool.

The wall time is greater than the sum of user+sys so there must be some wait event, even though the files are cached by the filesystem.

lovebox-2:~ gjl$ time cksum file1 file2
3512382378 104857600 file1
3512382378 104857600 file2

real	0m1.833s
user	0m1.660s
sys	0m0.144s

lovebox-2:~ gjl$ time dupes file1 file2
ed3b0c7dcfc7192186899a3c63854eb0 : file1 file2

real	0m1.181s
user	0m0.916s
sys	0m0.250s

© 2009 dotplan. All Rights Reserved.

This blog is powered by Wordpress and Magatheme by Bryan Helmig.