[Dovecot] Mail deduplication
Radek Novotný
radikn at seznam.cz
Tue Apr 30 12:23:59 EEST 2013
Dne 30.4.2013 03:28, Tim Groeneveld napsal:
> Hi Guys,
>
> I am wondering about mail deduplication. I am looking into the
> possibility
> of seperating out all of the message bodies with multiple parts inside
> mail
> that is recived from `dovecot` and hashing them all.
>
> The idea is that by hashing all of the parts inside the email, I will
> be
> able to ensure that each part of the email will only be saved once.
>
> This means that attachments & common parts of the body will only be
> saved once inside the storage.
>
> How achievable would this be with the current state of dovecot? Would
> it
> even be worth doing?
>
> Thanks,
> Tim
Hi Tim,
thank you for your question. I am pleasure, because I can help you. I
had the same problem in past and there wasn’t solution. So, I have
written script which count md5 hashes from receive date and message
body. Then script compare md5 hashes and delete duplicated messages.
Script uses doveadm for message manipulation and openssl for counting
md5 hashes. Deduplication is done through all user’s mailboxes. Syntax
is dedup <user> <mailbox>, for example:
dedup name at domain.cz INBOX.
If you want dedup all mailboxes, enter –A instead of mailbox name:
dedup name at domain.cz –A.
Script is attached. I made it for my own use, so it isn’t stupid proof.
If I can advise to you, work with care and make a backup ;-)
Good luck
#! /bin/sh
# Remove duplicate messages from mainbox
function dedup_mailbox ()
{
local uids=( $(doveadm -f flow fetch -u $1 "uid" mailbox "$2" all | cut
-f 2 -d =) )
if [ ${#uids[@]} -eq 0 ]; then
echo " No messages"
return
elif [ ${#uids[@]} -eq 1 ]; then
echo " Only one message"
return
fi
for (( i=0; i<${#uids[@]}; i++ )); do
local md5s_u[$i]=$(echo $(doveadm -f flow fetch -u $1 "date.received
body" mailbox "$2" uid ${uids[$i]} | openssl md5)",${uids[$i]}")
echo -en " Compute hashes: $i/${#uids[@]}(${md5s_u[$i]})\r"
done
echo -en "
\r"
local md5s=( $(echo ${md5s_u[@]} | sed 's/ /\n/g' | sort) )
x=0
i=0
while [ $i -lt $((${#md5s[@]} - 1)) ]; do
A=$(echo ${md5s[$i]} | cut -f 1 -d ,)
for (( j=$(($i + 1)); j<${#md5s[@]}; j++ )); do
B=$(echo ${md5s[$j]} | cut -f 1 -d ,)
if [ $A == $B ]; then
doveadm expunge -u $1 mailbox "$2" uid $(echo ${md5s[$j]} | cut -f 2 -d
,)
x=$(($x + 1))
else
break
fi
done
echo -en " Expunged $x message(s) from $(($j + 1))/${#md5s[@]}\r"
i=$j
done
echo ""
}
if [ $2 == "-A" ]; then
eval boxes=( $(doveadm mailbox list -u $1 | sed 's/.*/"&"/') );
else
boxes[0]=$2
fi
for (( k=0; k<${#boxes[@]}; k++ )); do
echo "${boxes[$k]}:"
dedup_mailbox $1 "${boxes[$k]}"
done
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dedup
Type: text/x-shellscript
Size: 1538 bytes
Desc: not available
URL: <http://dovecot.org/pipermail/dovecot/attachments/20130430/35f88a5e/attachment.bin>
More information about the dovecot
mailing list