Comparing files across drives, across systems?

Kaitlyn · Dec 13, 2024

I am just finishing up moving my Windows setup over to Mac, and now have my ~4TB of files moved over. I ended up using a few different methods to move the content over, and a few of the methods failed during the process... so that was fun.

As of now I've used a combination of exposing the drives to the network share on both ends, and using FreeFileSync + Carbon Copy Cloner (on Win + Mac respectively) to compare them. And now I am at the point where their file sizes and timestamps are matching, nothing reported left to sync.

HOWEVER

Due to the huge volume of files transferred, the different methods, and some interrupted transfers... I'd really sleep a LOT better at night truly knowing the files are actually the same, copied over entirely... and not just some placeholder/representation of the file size?

FreeFileSync on Windows looks like it has an option to compare FILE CONTENTS, but it's SUPER slow. I think it's basically grabbing each file off the mac across the network and hashing it... which is effectively doing the whole copy over again. Sidenote: It seems there's basically no difference from comparing "File contents" vs a hash of the contents? And given these are my files, that I've just copied across, I'd have to imagine even an md5 hash, with POSSIBLE collisions, is more than adequate?

Is there any way I can optimize this a bit more? Sort of like a Host<>Server setup where the Windows program can just ask the Mac for the hash of the file or something, rather than generating it remotely?

To be clear: I do not wish to SYNC anything anymore, as I have an expectation that is complete... and if not, I'd probably want ti look into it manually. But some output or visual comparison saying "these files are the same", "these files are NOT the same"

XiaoDeer · Dec 13, 2024

Kaitlyn said:
Is there any way I can optimize this a bit more? Sort of like a Host<>Server setup where the Windows program can just ask the Mac for the hash of the file or something, rather than generating it remotely?

That's certainly possible but somewhat complex to set up.

A simpler approach would involve running a program on each machine what would generate a simple file containing the filename and its corresponding MD5 for every file in the directory of interest.

Then compare the two files for any differences.

You would need to allow for the slightly different pathname formats/conventions and probably different ordering of the files.

Personally, I'd do that with Perl and the Digest::MD5 module. This way it would be easy to massage the pathnames into a consistent format and output them and their MD5s sorted the same way.

Sean Nelson · Dec 13, 2024

If you want to be assured that the file contents are the same then you have no choice but to do something that reads all the files again. What you can do to minimize time is to eliminate the need to do this across the network.

Assuming that the folder trees and all the names are the same on both sides, the way I usually do this sort of thing is to use a checksum utility to create a file containing the hashes of all the files in the entire directory tree. You do this locally for both machines, and then you compare the two hash files to see if there are any differences. Just make sure that the utilities on both sides are using the same hash algorithm and creating hash files that have the same format.

If you do this in the future, the smart thing to do is to create the hash file before the copy and then verify it on the other side when the copy is done.

Eric Carlson · Dec 13, 2024

XiaoDeer said:
Kaitlyn said:

Is there any way I can optimize this a bit more? Sort of like a Host<>Server setup where the Windows program can just ask the Mac for the hash of the file or something, rather than generating it remotely?

Click to expand...

That's certainly possible but somewhat complex to set up.

A simpler approach would involve running a program on each machine what would generate a simple file containing the filename and its corresponding MD5 for every file in the directory of interest.

Then compare the two files for any differences.

You would need to allow for the slightly different pathname formats/conventions and probably different ordering of the files.

Personally, I'd do that with Perl and the Digest::MD5 module. This way it would be easy to massage the pathnames into a consistent format and output them and their MD5s sorted the same way.

Great idea!

Kaitlyn · Dec 13, 2024

Sean Nelson said:
If you want to be assured that the file contents are the same then you have no choice but to do something that reads all the files again. What you can do to minimize time is to eliminate the need to do this across the network.

Assuming that the folder trees and all the names are the same on both sides, the way I usually do this sort of thing is to use a checksum utility to create a file containing the hashes of all the files in the entire directory tree. You do this locally for both machines, and then you compare the two hash files to see if there are any differences. Just make sure that the utilities on both sides are using the same hash algorithm and creating hash files that have the same format.

If you do this in the future, the smart thing to do is to create the hash file before the copy and then verify it on the other side when the copy is done.

What commands / tools do you use to generate the hashed file tree?

Kaitlyn · Dec 13, 2024

XiaoDeer said:
Kaitlyn said:

Is there any way I can optimize this a bit more? Sort of like a Host<>Server setup where the Windows program can just ask the Mac for the hash of the file or something, rather than generating it remotely?

Click to expand...

That's certainly possible but somewhat complex to set up.

A simpler approach would involve running a program on each machine what would generate a simple file containing the filename and its corresponding MD5 for every file in the directory of interest.

Then compare the two files for any differences.

You would need to allow for the slightly different pathname formats/conventions and probably different ordering of the files.

Personally, I'd do that with Perl and the Digest::MD5 module. This way it would be easy to massage the pathnames into a consistent format and output them and their MD5s sorted the same way.

Pretty sure I’ve now finally got it going with rsync.
I’m doing it just on a test folder and it took a bit longer than expected… I thought generating the hashes would be faster :

but checking the network and disk activity it’s definitely happening locally on each computer!

my biggest fear now, running it on the full real folder, is an interruption on the check. Windows seems to randomly drop the connection when I tried things in the past… and I don’t think rsync will resume the checksums if I have to restart it?

I do think I prefer the idea of generating the md5 locally and into text files and then comparing that.

yeah, would have to massage the data at least a bit for paths… but yeah. But I spent so much effort into getting the rsync to work! Ahha

I’ve never worked with Perl and it’s been what feels like forever since I’ve done any basic scripting or programming, which was mostly PHP…

Sean Nelson · Dec 13, 2024

Kaitlyn said:
Sean Nelson said:

...the way I usually do this sort of thing is to use a checksum utility to create a file containing the hashes of all the files in the entire directory tree...

Click to expand...

What commands / tools do you use to generate the hashed file tree?

I wrote my own bespoke utility to do this, but there are plenty of others out there on the Internet.

New Wrycuda · Dec 14, 2024

Kaitlyn said:
I am just finishing up moving my Windows setup over to Mac, and now have my ~4TB of files moved over. I ended up using a few different methods to move the content over, and a few of the methods failed during the process... so that was fun.

As of now I've used a combination of exposing the drives to the network share on both ends, and using FreeFileSync + Carbon Copy Cloner (on Win + Mac respectively) to compare them. And now I am at the point where their file sizes and timestamps are matching, nothing reported left to sync.

Could you regard this “move” as a restoration from backup? I’d recommend that rather than “exercising” active disks for verification.

Surely you have the Windows files backed up, and it should be a straightforward process to move that across to the Mac, provided that the file format matches (exFAT). I do note that Macs can be a bit cagey about the location of data files.

I’ve carried out data “moves” from Windows to Windows many times, either using the backup disk, or by simply physically swapping the data disk over (I habitually have a data disk separate from the O/S and Apps disk). OK, never as much as 4TB, but a fair amount.

Something to consider.

Search

Comparing files across drives, across systems?

Kaitlyn

Leading Member

XiaoDeer

Leading Member

Sean Nelson

Forum Pro

Eric Carlson

Veteran Member

Kaitlyn

Leading Member

Kaitlyn

Leading Member

Sean Nelson

Forum Pro

New Wrycuda

Senior Member

Keyboard shortcuts