Dear Dovecot people,
I've set up Dovecot replication a couple of years ago. I'm watching the general server health and having Nagios check doveadm replicator status regularly. I'm seeing accounts replicating, disk space usage obviously shows that things work in principle, however, I still wonder whether the replication actually works properly. Accounts I check work, but 1-5% of the messages could easily be missing without me noticing it. These servers have I think about 50K accounts (a lot of them dormant) and between 1 to 2TB of mail.
I'm trying to get some more confidence as to whether replication is actually working properly and whether I'm not missing anything that will burn me if I ever have to 'fallback'. Has anyone ever done some verification outside of simply watching doveadm replication stats, to see if they are missing anything ?
Eg, I could imagine a process that generates a list of accounts, and then generates hashes on both sides of the replication for each mailbox folder. If the hashes match, the folder gets removed from the list, and once all folders are removed, the account gets removed. Iterating the account list should finally reduce it to 0, or a few extremely high traffic accounts which can be checked manually or be ignored.
But I'm trying to avoid reinventing the wheel. Has anyone done anything like this, or can suggest a different approach?
With regards, Arnold Hendriks
I'm trying to get some more confidence as to whether replication is actually working properly and whether I'm not missing anything that will burn me if I ever have to 'fallback'. Has anyone ever done some verification outside of simply watching doveadm replication stats, to see if they are missing anything ?
For anyone who needs this in the future, the approach I came up with was to count messages and compare sizes for each account. Won't protect me against data corruption but gives me reasonable confidence that the sync is working.
Two server scripts:
list-users.sh:
#!/bin/bash doveadm user "*"
get-mailbox-hashes.sh:
#!/bin/bash while read account; do echo "$account $(doveadm -f flow fetch -u "$account" "mailbox-guid uid size.virtual" ALL|sort|md5sum)" done
and a script that drives this, looping until the list of differing accounts is back to 0
#!/bin/bash PRIMARY=$1 FALLBACK=$2
echo "$(date +%H:%M:%S) Requesting initial userlist"
ssh $PRIMARY /opt/container/list-users.sh | sort | tee /tmp/initialuserlist
/tmp/userlist while true; do NUMENTRIES=$(echo $(cat /tmp/userlist|wc -l)) echo "$(date +%H:%M:%S) Userlist is now $NUMENTRIES entries long (see /tmp/userlist)"
if [ "$NUMENTRIES" == "0" ]; then echo "$(date +%H:%M:%S) DONE! All are (finally?) in sync" exit 0 fi
echo "$(date +%H:%M:%S) Hashing users..." cat /tmp/userlist | ssh $PRIMARY /opt/container/get-mailbox-hashes.sh | sort > /tmp/accounthashes-primary & cat /tmp/userlist | ssh $FALLBACK /opt/container/get-mailbox-hashes.sh | sort > /tmp/accounthashes-fallback &
wait %1 %2 echo "$(date +%H:%M:%S) Comparing users..." comm -3 /tmp/accounthashes-primary /tmp/accounthashes-fallback|sed -e 's/^\t*//'|cut -d' ' -f1|sort -u> /tmp/userlist done
participants (1)
-
Arnold Hendriks