I think I'm looking for a a fast, non-cryptographically secure hash implemented in Rust that can do this:

let _n = io::copy(&mut file, &mut hasher).expect("Error hashing a file");


a way to use the apparently-more-standard hasher.write(&[u8]) syntax to efficiently hash a whole file...

@schlink By the way, I dunno if this is helpful, but a while back I developed an application where I wanted to know if I had duplicate files anywhere on a disk.

Rather than hash the entire disk just to build an index, I used some deterministic algorithm to choose a small pseudorandom subset of each file and hash that to build my index. There were collisions among non-duplicate files, but they were few and far between (and often not even among files with the same size), and it was a simple matter to do a full hash on the small subset of files with collisions.

@schlink The reason for choosing a pseudorandom subset rather than a fixed amount at the beginning or something is that a lot of files have a bunch of “header” or “footer” matter that will be identical. Choosing a random subset, you are more likely to encounter non-identical bits in similar files.

Sign in to participate in the conversation

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!