[Dovecot] resilience suggestion
On the whole we are pleased with our trials of dovecot to replace UW-IMAP.
But (ah!) we have hit one particular problem, in which we think dovecot could probably benefit from a resilience improvement.
We're running dovecot on Fedora Core 5 (FC5), with passwd map details supplied by NIS. We have found that "nscd" sometimes thinks that a username is invalid, even though it is valid. So when "deliver" attempts a delivery to the INBOX of that username, it receives "user unknown" from the name service, and then does a 5xx permanent failure of valid email.
From the user perspective "The System" has incorrectly rejected perfectly valid incoming email. It is rare, but it does occasionally happen on large, busy systems.
Clearly it is fundamentally an "nscd" bug. But that bug is nevertheless out there, in the wild, on such systems, potentially affecting dovecot's delivery of valid user email.
We have had a source code hack since October (in "deliver.c", simply replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has worked nicely (ported forward from rc8 towards rc22). Mail re-queues and a later delivery attempt then succeeds.
So it would be both helpful, and good for resilience against such real OS/nscd bugs (and similar), if there were a configuration option in dovecot to allow a site admin to tell deliver to use a temporary, 4xx, failure instead (if the circumstances were appropriate for the site).
Could this be considered please, Timo?
--
: David Lee I.T. Service : : Senior Systems Programmer Computer Centre : : UNIX Team Leader Durham University : : South Road : : http://www.dur.ac.uk/t.d.lee/ Durham DH1 3LE : : Phone: +44 191 334 2752 U.K. :
On Fri, 2007-02-09 at 10:38 +0000, David Lee wrote:
On the whole we are pleased with our trials of dovecot to replace UW-IMAP.
But (ah!) we have hit one particular problem, in which we think dovecot could probably benefit from a resilience improvement.
We're running dovecot on Fedora Core 5 (FC5), with passwd map details supplied by NIS. We have found that "nscd" sometimes thinks that a username is invalid, even though it is valid. So when "deliver" attempts a delivery to the INBOX of that username, it receives "user unknown" from the name service, and then does a 5xx permanent failure of valid email.
From the user perspective "The System" has incorrectly rejected perfectly valid incoming email. It is rare, but it does occasionally happen on large, busy systems.
Clearly it is fundamentally an "nscd" bug. But that bug is nevertheless out there, in the wild, on such systems, potentially affecting dovecot's delivery of valid user email.
We have had a source code hack since October (in "deliver.c", simply replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has worked nicely (ported forward from rc8 towards rc22). Mail re-queues and a later delivery attempt then succeeds.
So it would be both helpful, and good for resilience against such real OS/nscd bugs (and similar), if there were a configuration option in dovecot to allow a site admin to tell deliver to use a temporary, 4xx, failure instead (if the circumstances were appropriate for the site).
Having been hit by numerous problems with nscd as well with many applications I'll just throw that in:
- nscd is to be prevented whenever possible
- (if) nscd is broken, complain with the vendor or better
- fix bugs at the right place
A few excerpts from a discussion about nscd on the postfix ML some time ago about exactly the same problem (postfix not finding reicipients due to nscd delivering bad information):
"nscd is a crappy piece of software that is unstable and frequently corrupts information."
"Most of the work is identifying the right problem. Much effort goes to waste solving the wrong one."
Just my 2¢ ...
-- Udo Rader
bestsolution.at EDV Systemhaus GmbH http://www.bestsolution.at
David Lee wrote:
We're running dovecot on Fedora Core 5 (FC5), with passwd map details supplied by NIS. We have found that "nscd" sometimes thinks that a username is invalid, even though it is valid. So when "deliver" attempts a delivery to the INBOX of that username, it receives "user unknown" from the name service, and then does a 5xx permanent failure of valid email.
From the user perspective "The System" has incorrectly rejected perfectly valid incoming email. It is rare, but it does occasionally happen on large, busy systems.
We don't use "deliver" (just use Exim) but we build a static passwd-file userdb from NIS overnight and use PAM for authentication (via pam_ldap to Active Directory, but it works with pam_unix too). We did this for a performance boost as Dovecot then caches the userdb, rather than having to wait for a NIS lookup each time, but I'd expect it to iron out problems with deliver/nscd as well. While the passwords could change any time, userdb information generally doesn't happen that often, and it only takes a few seconds to rebuild manually if a new user has to be added quickly.
Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
Quoting David Lee:
We have had a source code hack since October (in "deliver.c", simply replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has
I cannot speak for Timo, but if this means mail for non-existing users will stay in the MTA's queue until it times out, it's definitely wrong.
Apart from that, nscd is one of the first things I remove from freshly installed systems, because it's plain crap and caused only problems in the past. Even though I can hardly believe they do not get even such things right.
On Fri, 9 Feb 2007, Jakob Hirsch wrote:
Quoting David Lee:
We have had a source code hack since October (in "deliver.c", simply replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has
I cannot speak for Timo, but if this means mail for non-existing users will stay in the MTA's queue until it times out, it's definitely wrong.
It may, indeed, be wrong for many sites. But that's why I had also said: "option ... if the circumstances were appropriate for the site"
However, at many _other_ sites, the outlying email infrastructure may ensure that only email for existing users gets anywhere near the dovecot machine in the first place. That is, by the time the email gets to "deliver", one is 99.99% certain that the user is a valid, existing one. (Furthermore, routine monitoring of the MTA queue can draw the sys.admin's attention to any email (should usually be close to zero at such sites) that has been sitting there for too long... a problem of some sort which requires attention anyway... and by my proposed "EX_TEMPFAIL" at least this possibly valid email has not been rejected.)
Apart from that, nscd is one of the first things I remove from freshly installed systems, because it's plain crap and caused only problems in the past. Even though I can hardly believe they do not get even such things right.
Agreed!! But we have a totally different PAM authentication problem (way outside email protocols) in the FC series which seems to require us to have "nscd" positively running. (Grumble). We're addressing that, too.
Meanwhile, for some sites, in some circumstances, "return EX_TEMPFAIL" may be more appropriate than a "5xx", to have available as an option.
Hope that helps.
--
: David Lee I.T. Service : : Senior Systems Programmer Computer Centre : : UNIX Team Leader Durham University : : South Road : : http://www.dur.ac.uk/t.d.lee/ Durham DH1 3LE : : Phone: +44 191 334 2752 U.K. :
On Fri, 2007-02-09 at 13:02 +0000, David Lee wrote:
On Fri, 9 Feb 2007, Jakob Hirsch wrote:
Quoting David Lee:
We have had a source code hack since October (in "deliver.c", simply replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has
I cannot speak for Timo, but if this means mail for non-existing users will stay in the MTA's queue until it times out, it's definitely wrong.
It may, indeed, be wrong for many sites. But that's why I had also said: "option ... if the circumstances were appropriate for the site"
But why then modify dovecot instead of fixing nscd? nscd is as open source as dovecot, so I don't see any advantage here.
-- Udo Rader
bestsolution.at EDV Systemhaus GmbH http://www.bestsolution.at
On Fri, 2007-02-09 at 13:02 +0000, David Lee wrote:
However, at many _other_ sites, the outlying email infrastructure may ensure that only email for existing users gets anywhere near the dovecot machine in the first place. That is, by the time the email gets to "deliver", one is 99.99% certain that the user is a valid, existing one. (Furthermore, routine monitoring of the MTA queue can draw the sys.admin's attention to any email (should usually be close to zero at such sites) that has been sitting there for too long... a problem of some sort which requires attention anyway... and by my proposed "EX_TEMPFAIL" at least this possibly valid email has not been rejected.)
I think the short answer for this is that your MTA configuration is broken, since it should accept mail only for existent users in the first place. If this is an nscd bug it sholuld be fixed there, not by introducing some hack into deliver.
Meanwhile, for some sites, in some circumstances, "return EX_TEMPFAIL" may be more appropriate than a "5xx", to have available as an option.
I think it would be a nice addition to prevent misconfiguration or temporary outages of other components (eg. database backends) to cause legitimate mail loss while testing mail setups, but I don't think it should be provided as a means to solve the problem describe.
ciao
Luca
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Fri, 9 Feb 2007, David Lee wrote:
Besides any discussion, I do support an option "--tempfail" for deliver to degrade any error into a tempfail. Maybe, selectively by individual error. (In fact, for an older version of Dovecot I posted such a patch already.)
BTW: This should also be possible with a shell script mangling the exit code, shouldn't it?
Bye,
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBRcx6uS9SORjhbDpvAQInAgf+PuqRnbS/n5ss9MBZyDVjA6OosQFzg0+9 CsiXD3jyicNh+HJyNE5qaUH1TlwVQ1uPmnnYBWoYhJQYs06OD3IlRzPSJfLFtaEK suBvGF6wpotnxXbhTzG/Ny1cqjgFsvuMru1AJdaX3pdv2QM7aSXCcUY/bigi2At9 /xjMa1Iq7KXvfVNBXLx9GV50kek8cjeotqhH97/yEyTJ5ESoMsm5QvqWNA8qv7I3 rAszMMNMCwtF2Q0TI0RMzmZ6v6O4O4GFOkl8REVUyeNq5+sxEPzFhrdM8TiLJfLp 9YN3eri326w2kj4Xx7VsQ1mSr7F9p4CcEt6abdcav5MMvJ/YoeiwiA== =5t6e -----END PGP SIGNATURE-----
On Fri, 2007-02-09 at 14:44 +0100, Steffen Kaiser wrote:
Besides any discussion, I do support an option "--tempfail" for deliver to degrade any error into a tempfail. Maybe, selectively by individual error. (In fact, for an older version of Dovecot I posted such a patch already.)
I second that, postfix already allows you to return 4xx intead of 5xx with the soft_bounce configuration option. But to stress this once more, this feature is not enabled by default and the manpage for posconf(5) says:
soft_bounce (default: no) Safety net to keep mail queued that would otherwise be returned to the sender. This parameter disables locally-generated bounces, and prevents the Postfix SMTP server from rejectingmail permanently, by changing 5xx reply codes into 4xx. However, soft_bounce is no cure for address rewriting mistakes or mail routing mistakes.
So this should be used just to avoid mistakes while testing big changes in your setup without causing disasters.
ciao
Luca
David Lee wrote:
On the whole we are pleased with our trials of dovecot to replace UW-IMAP.
But (ah!) we have hit one particular problem, in which we think dovecot could probably benefit from a resilience improvement.
We're running dovecot on Fedora Core 5 (FC5), with passwd map details supplied by NIS. We have found that "nscd" sometimes thinks that a username is invalid, even though it is valid. So when "deliver" attempts a delivery to the INBOX of that username, it receives "user unknown" from the name service, and then does a 5xx permanent failure of valid email.
From the user perspective "The System" has incorrectly rejected perfectly valid incoming email. It is rare, but it does occasionally happen on large, busy systems.
Clearly it is fundamentally an "nscd" bug. But that bug is nevertheless out there, in the wild, on such systems, potentially affecting dovecot's delivery of valid user email.
We have had a source code hack since October (in "deliver.c", simply replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has worked nicely (ported forward from rc8 towards rc22). Mail re-queues and a later delivery attempt then succeeds.
So it would be both helpful, and good for resilience against such real OS/nscd bugs (and similar), if there were a configuration option in dovecot to allow a site admin to tell deliver to use a temporary, 4xx, failure instead (if the circumstances were appropriate for the site).
Could this be considered please, Timo?
I wrote the nscd that's used on Solaris back in 1995. If the Fedora release's nscd is just bungling the lookup, no work-around is possible and you need to disable at least the passwd cache in the nscd if that's possible. On the other hand, are you sure this isn't a intermittent NIS server issue?
The problem about what a program should do if the name service isn't actually responding on the other hand, is tricky, whether that program is the nscd or postfix or dovecot. The right answer depends on the consequences of failure and what info you can get back from the name service.
Obviously, if getpwnam_r() could be convinced to return EAGAIN if one of the name services was not responding, this would be a GOOD THING, since this would map directly to a TEMPFAIL. However, there are other system services that fail miserably when the user's account into isn't available, so for those hanging until the NIS server recovers is a better choice.
[The hard thing about distributed systems is always failure semantics.]
Absent tunable nscd failure semantics, I suggest that the following may be useful alternatives for intermittent NIS server problems:
construct a redundant NIS architecture with additional slave NIS servers that fail over... this is what we use internally at Sun w/ varying degrees of success.
ypcat the passwd map periodically and map it into a local passwd file. Some scripts smarts are required to avoid hideous problems if you get a truncated passwd map... this is quite robust if done correctly. I'm one of the odd folks who has their mail delivered to their desktop; I keep a copy of my passwd entry in the local machine so I don't lose mail if the NIS server craps out again.
- Bart
David Lee schrieb:
On the whole we are pleased with our trials of dovecot to replace UW-IMAP.
But (ah!) we have hit one particular problem, in which we think dovecot could probably benefit from a resilience improvement.
Careful there!
We're running dovecot on Fedora Core 5 (FC5), with passwd map details supplied by NIS. We have found that "nscd" sometimes thinks that a username is invalid, even though it is valid. So when "deliver" attempts a delivery to the INBOX of that username, it receives "user unknown" from the name service, and then does a 5xx permanent failure of valid email. From the user perspective "The System" has incorrectly rejected perfectly valid incoming email. It is rare, but it does occasionally happen on large, busy systems.
There are several problems to this approach here, generally plain blindness of many libc maintainers to this problem, regardless if the system has nsswitch or no. I filed NIS lookup bugs against GNU libc (not implementing TRYAGAIN=forever in nsswitch) and FreeBSD (timeout after a few minutes) literally years ago, without any tangible outcome. GNU libc maintainer rejects the bug report as a whole, it's fallen on deaf ears with FreeBSD.
The other important concern for a portable software as dovecot is portability. On some systems, temporary failure of getpwnam() is indistinguishable from permanent failure, thus the only solution to this approach is Postfix's: implement a NIS lookup client to access the password database to circumvent the many libc bugs lurking there.
Clearly it is fundamentally an "nscd" bug. But that bug is nevertheless out there, in the wild, on such systems, potentially affecting dovecot's delivery of valid user email.
You don't need nscd for unstable NIS, as laid out above :-(
We have had a source code hack since October (in "deliver.c", simply replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has worked nicely (ported forward from rc8 towards rc22). Mail re-queues and a later delivery attempt then succeeds.
And lingers around in the queue for a week if an account has been terminated? Doesn't look like a 'solution' to me.
Best regards Matthias Andree
On Wed, 2007-02-14 at 17:05 +0100, Matthias Andree wrote:
The other important concern for a portable software as dovecot is portability. On some systems, temporary failure of getpwnam() is indistinguishable from permanent failure, thus the only solution to this approach is Postfix's: implement a NIS lookup client to access the password database to circumvent the many libc bugs lurking there.
This is interesting. I hadn't thought about this before. Is it really possible to write that portably? Guess I'll have to look at Postfix's implementation some day.
On Wed, 14 Feb 2007, Matthias Andree wrote:
[...] The other important concern for a portable software as dovecot is portability. On some systems, temporary failure of getpwnam() is indistinguishable from permanent failure, thus the only solution to this approach is Postfix's: implement a NIS lookup client to access the password database to circumvent the many libc bugs lurking there. [...]
That sort of supports my suggestion for an option, default off and which a site-admin would have to be explicit in turning on, to mark 'deliver' failures as EX_TEMPFAIL.
We have had a source code hack since October (in "deliver.c", simply replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has worked nicely (ported forward from rc8 towards rc22). Mail re-queues and a later delivery attempt then succeeds.
And lingers around in the queue for a week if an account has been terminated? Doesn't look like a 'solution' to me.
(I hope that I never used the word 'solution' to describe it! But it is a workaround various real bugs in various real systems, which have had no real resolution despite some real-long timescales.)
This EX_TEMPFAIL would be an option, default off. If a site-admin explicitly, consciously, opted into this changed behaviour, that possible queue build up would, indeed be a factor to consider for such a (non-default) choice. For some sites, this might well be preferable to discarding valid email. Suppose, for instance, that rejecting valid email adversely reflected of the company's reputation and resulted in lost customers. Such a site would, one hopes, be monitoring its queues anyway for abnonmally long residencies.
(An aside, interacting with another thread: Don't let this relatively minor suggestion delay the formalising of official version 1.0!)
--
: David Lee I.T. Service : : Senior Systems Programmer Computer Centre : : UNIX Team Leader Durham University : : South Road : : http://www.dur.ac.uk/t.d.lee/ Durham DH1 3LE : : Phone: +44 191 334 2752 U.K. :
Timo Sirainen schrieb:
On Wed, 2007-02-14 at 17:12 +0000, David Lee wrote:
This EX_TEMPFAIL would be an option, default off.
The biggest problem I see with this is that there are already too many settings. I'd rather want to remove them, not add new ones which are used by one or two users..
Stuff a patch into some optional-patches directory, declare it unsupported and let those who care about it maintain it.
Best, Matthias Andree
David Lee schrieb:
(An aside, interacting with another thread: Don't let this relatively minor suggestion delay the formalising of official version 1.0!)
Quite the contrary, let's defer 1.0 until it's really stable. There is enough commercial junkware called "stable" around, why hurrying a "formalis[ation]" rather than observe the incoming bug rate?
This is an open source project, and any hurries will come back and haunt the maintainers later or perhaps sooner. I can only encourage Timo to delay Dovecot 1.0 until the day it's ready.
While no scientific or transferable data point, we did that with bogofilter (defer 1.0 until it was done) and required few patch releases afterwards.
Best regards, Matthias Andree
participants (9)
-
Bart Smaalders
-
Chris Wakelin
-
David Lee
-
Jakob Hirsch
-
Luca Corti
-
Matthias Andree
-
Steffen Kaiser
-
Timo Sirainen
-
Udo Rader