Recommendation for a Easy to use and reliable Duplicate File remover

Started Oct 28, 2012 | Discussions thread
Shop cameras & lenses ▾
Alpha Doug Veteran Member • Posts: 9,219
Re: Recommendation for a Easy to use and reliable Duplicate File remover

Tom_N wrote:

Susiewong wrote:

Be grateful if you advise what it is meant by " checksum" ?

A checksum is a number computed by munging together the contents of a file, or the contents of the part of a network packet. You can think of it as a sort of "Social Security Number" or "Driver's License Number" for a file. E. g.,

new-host:~ tnewton$ cat >test
The mouse easily evaded the sleeping cat.
new-host:~ tnewton$ cksum test
1555693831 42 test

The typical use of checksums is in file transfer operations (where you want to make sure that the file did not get corrupted in the process of being copied) or network security (where you want to make sure that someone did not tamper with part of a packet).

For instance, if I make an exact copy of the original file, the checksum will be the same:

new-host:~ tnewton$ cp test test_copy
new-host:~ tnewton$ cksum test_copy
1555693831 42 test_copy

But if I alter the file, to simulate bad things happening during a file copy, the checksum changes. Assuming that the original checksum was downloaded with the corrupted file, there would be an excellent chance of determining that the copy was bad BEFORE someone did something silly such as "deleting the original copy that is no longer needed".

new-host:~ tnewton$ cat >> test_copy
new-host:~ tnewton$ cat test_copy
The mouse easily evaded the sleeping cat.
new-host:~ tnewton$ cksum test_copy
1164124977 56 test_copy

In your situation – finding duplicate files on a hard drive – you would want an algorithm that is better than simply doing a full-length file comparison of every file on the disk to every other file. That would involve a massive amount of I/O and take forever to complete.

One way a duplicate finder might work is to sort files by data length (and only compare files for which lengths are identical). Another might be to compute checksums for each file individually (still disk-intensive, but not nearly as bad as comparing every file to every other file). Whenever two checksums matched, that would indicate a duplicate, or at least, reason to go and conduct more expensive ("compare the contents of this file to the contents of that one") duplicate testing.

Hey Tom, thanks for explaining!  Better than I could have done.  I just wanted to convey that Gemini does a more complete job of defining a true duplicate than other duplicate removers I've tried.  That said, I still have some duplicates in my iTunes library, that apparently have no "data" attached to them and they only show up in the iTunes display, but not in the file packages.  I'm kind of stumped about where they came from.  Im thinking I might have to completely redo my library.  And I'm also a little confused about how to re-download all the "purchased" music I have bought over the years.

-- hide signature --

Only my opinion. It's worth what you paid for it. Your mileage may vary! ;-}

 Alpha Doug's gear list:Alpha Doug's gear list
Canon PowerShot S95 Sony SLT-A77 Sony Alpha NEX-6 Olympus PEN E-PM2 Sigma 17-70mm F2.8-4 DC Macro OS HSM +1 more
Post (hide subjects) Posted by
Keyboard shortcuts:
FForum PPrevious NNext WNext unread UUpvote SSubscribe RReply QQuote BBookmark MMy threads
Color scheme? Blue / Yellow