Hi Steffen,
See my responses below,
On 04/06/14 18:52, Steffen Kaiser wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, 4 Jun 2014, Bruce wrote:
Please see my responses below, I dislike top posting anyway :-)
Yeah, its habit as I deal with people who top post and reply with 'Your email had nothing in it' if I don't say there is more below.
(/mnt/spool/keepers/h/XXXXX/hamish) userid = 7053 (7053) - uidlist = 26624 Jun 3 11:38:51 brio dovecot: Dovecot for hamish@XXXXX finished, uidlist now = 26624 who is user 26624? Is the uid valid at all? If it is invalid, are there other files owned by this uid? Maybe only one of your NFS server has
Jun 3 11:38:51 brio dovecot: Dovecot postlogin.sh running as hamish@XXXXX this uid in its /etc/passwd? Is user "hamish" shared to another user somehow, either via symlinks, ACLs, ...? UID 26624 is a valid user ('info' of domain14552) but under a compleltly different domain name (hamish is under domain25367). However that user has also not logged in around the time the ownership was changed. There is no relevance to the two users, except that they exist on the system and for some reason this issue happened to the hamish user. That means: no sharing, right ?
Correct.
The NFS server does not know about the UIDs, it just provides the numeric IDs which is translated on the dovecot/exim servers by NSS and Dovecotr using the replicated MySQL database. Additionally both users have existed for some time and the databases are in sync. Customers also do not have any access to the file system so there will be no symlinks in place.
Its also not a single server that we are seeing the issue on, it maybe one Dovecot server accessing one NFS server. Then the next time its a different Dovecot server accessing a different NFS server. Because it's NFS, any server accessing this NFS export may alter the ownership, because I still do not believe it's a Dovecot IMAP/POP issue ;-)
But we would see this with any application and file that is on NFS, in this case it is only ever one particular file, and only happened after changing to Dovecot (Previous to that we have had this NFS set up in place for 7 years).
Nothing besides Courier being replaced by Dovecot has changed in the server setup (although I could be wrong there, but we are going through one component at a time and until this issue is resolved we are not moving onto the next), and the only file which is being modified is a file which only Dovecot maintains. Could there be Courier-related left-over scripts running?
Its highly unlikely, and why would a courier script want to modify the ownership of a Dovecot file.
There are hourly backups which do an rsync to another server in case of hardware failure, there are scripts which move mailboxes between NFS servers but they show up in logs. Exim has no need to touch a dovecot controlled file, and when it writes mail into the maildir its writing as the correct user. How does exim deliver the messages to the user storage? Via Dovecot-Deliver or directly. I interprete your sentence, that the messages are dropped directly. 00
Exim does it directly, once again no need for it to touch a file which Dovecot creates and maintains.
possibly something we have done in the config (which I posted in my first email). Is there a reliable way to run a script directly when a dovecot session starts and finishes so we could output the ownership before and after which may also help eliminate the session itself. Your post-login script does exactly that. That leaves the message delivery, if you do so via Dovecot deliver. LMTP seems to be off.
I thought that's what should happen but looking at timestamps of the logins after the change, the permission denied always comes after the postlogin script completes, which means something happens after then.
Jun 3 12:56:31 gettas dovecot: pop3(celia@xxx): Mailbox init failed top=0 (0 b), retr=0 (0 b), messages=0 (0 b), del=0 Jun 3 12:56:32 gettas dovecot: pop3-login: Login: user=<celia@xxx>, method=PLAIN, rip=x.x.x.x, lip=x.x.x.x, mpid=19655, TLS, session=<r2qX++P6gAB6PWyl> Jun 3 12:56:32 gettas dovecot: Dovecot postlogin.sh running as celia@xxx/celia) userid = 41410 (41410) - uidlist = 7903 Jun 3 12:56:32 gettas dovecot: Dovecot for celia@xxx finished, uidlist now = 7903 <---- POSTLOGIN.SH ENDS HERE Jun 3 12:56:32 gettas dovecot: pop3(celia@xxx/celia/Maildir/dovecot-uidlist) failed: Permission denied Jun 3 12:56:32 gettas dovecot: pop3(celia@xxx/celia/Maildir/dovecot-uidlist) failed: Permission denied Jun 3 12:56:32 gettas dovecot: pop3(celia@xxx): Error: Couldn't init INBOX: Internal error occurred. Refer to server log for more information. [2014-06-03 12:56:32] Jun 3 12:56:32 gettas dovecot: pop3(celia@xxx): Mailbox init failed top=0 (0 b), retr=0 (0 b), messages=0 (0 b), del=0
I could put this down to same microsecond and syslog recording in a different order, but its consistently in this order.
I'm not sure if inotify works via NFS as Peter suggested, but that would my next idea as well. Let inotify wait for changes on that file, then fire up a script to capture "ps gaux" and maybe more.
If you have SELinux running, maybe you can trigger a warning there? I don't know much about SELinux.
We don't have SELinux running, and unfortunately we are talking over 400,000 mailboxes (200,000 domains) which this could happen randomly too and adding inotify requires a file handle to be opened per file, which would make the servers unstable.
I have found a piece of code in the Dovecot source which I'm wondering about, so I am going to compile up a custom version and see if it continues to happen. Of course with it happening randomly being able to confirm its still happening is going to take time. So if anyone has any other ideas where to look I would be appreciative.
Cheers, Bruce