On 03/02/2022 12:00 PM @lbutlr <kremels@kreme.com> wrote:
I'm mulling over writing some code to find emails in a maildir that are duplicates, ish. That is to say that sometimes the same message doesn't quite show up as an exact match. Like some ad company may send you three identical messages, except they aren't actually EXACTLY identical, the message-IDs are different, and may the to address quoted part is different, so normal duplicate finders fail to find them.
Before I start, is this a solved problem?
Besides the fact that you've pretty much described how modern AV/AS systems work? :)
Joking aside, isn't this what Bayesian classification is essentially doing? Comparing the similarities between text (via tokens) in messages and then using Bayesian probabilities to emphasize certain terms/relationships? Although this requires training and is not comparing any messages directly...
Maybe some form of perceptual hashing (or similar idea) would work? E.g. http://phash.org/
michael