David Lee wrote:
On the whole we are pleased with our trials of dovecot to replace UW-IMAP.
But (ah!) we have hit one particular problem, in which we think dovecot could probably benefit from a resilience improvement.
We're running dovecot on Fedora Core 5 (FC5), with passwd map details supplied by NIS. We have found that "nscd" sometimes thinks that a username is invalid, even though it is valid. So when "deliver" attempts a delivery to the INBOX of that username, it receives "user unknown" from the name service, and then does a 5xx permanent failure of valid email.
From the user perspective "The System" has incorrectly rejected perfectly valid incoming email. It is rare, but it does occasionally happen on large, busy systems.
Clearly it is fundamentally an "nscd" bug. But that bug is nevertheless out there, in the wild, on such systems, potentially affecting dovecot's delivery of valid user email.
We have had a source code hack since October (in "deliver.c", simply replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has worked nicely (ported forward from rc8 towards rc22). Mail re-queues and a later delivery attempt then succeeds.
So it would be both helpful, and good for resilience against such real OS/nscd bugs (and similar), if there were a configuration option in dovecot to allow a site admin to tell deliver to use a temporary, 4xx, failure instead (if the circumstances were appropriate for the site).
Could this be considered please, Timo?
I wrote the nscd that's used on Solaris back in 1995. If the Fedora release's nscd is just bungling the lookup, no work-around is possible and you need to disable at least the passwd cache in the nscd if that's possible. On the other hand, are you sure this isn't a intermittent NIS server issue?
The problem about what a program should do if the name service isn't actually responding on the other hand, is tricky, whether that program is the nscd or postfix or dovecot. The right answer depends on the consequences of failure and what info you can get back from the name service.
Obviously, if getpwnam_r() could be convinced to return EAGAIN if one of the name services was not responding, this would be a GOOD THING, since this would map directly to a TEMPFAIL. However, there are other system services that fail miserably when the user's account into isn't available, so for those hanging until the NIS server recovers is a better choice.
[The hard thing about distributed systems is always failure semantics.]
Absent tunable nscd failure semantics, I suggest that the following may be useful alternatives for intermittent NIS server problems:
construct a redundant NIS architecture with additional slave NIS servers that fail over... this is what we use internally at Sun w/ varying degrees of success.
ypcat the passwd map periodically and map it into a local passwd file. Some scripts smarts are required to avoid hideous problems if you get a truncated passwd map... this is quite robust if done correctly. I'm one of the odd folks who has their mail delivered to their desktop; I keep a copy of my passwd entry in the local machine so I don't lose mail if the NIS server craps out again.
- Bart