[Dovecot] pre-1.0.rc27: Index and mbox fixes
I've been doing some stress testing in the last few days and fixing all the errors I saw. I'm hoping that I've finally really fixed all the index/mbox problems. So, again I think the only thing left for v1.0 is the documentation. Please test this nightly snapshot before I'll release rc27 tomorrow:
http://dovecot.org/nightly/dovecot-latest.tar.gz
If you've seen any of the following error messages in log files, these should all be fixed now:
- index: Append with UID 56297, but next_uid = 56298
- mbox: UID inserted in the middle of mailbox /home/tss/mail/inbox (45769 > 45768, seq=28, idx_msgs=30)
- mbox: Expunged message reappeared in mailbox /home/tss/mail/inbox (UID 86957 < 86958, seq=28, idx_msgs=27)
These should be fixed for most cases, but they can still rarely happen:
- index: Extension reset without intro prefix
There's a really tiny possibility of it still happening with mmap_disable=no, but I think it's practically fixed now. A proper fix would require more changes than I want to do for v1.0 (either support for marking transaction start/end positions or getting rid of the stupid intro prefix completely).
- Cached message offset 38340 is invalid for mbox file /home/tss/mail/inbox
This should happen only if you're using in-memory indexes or mbox_min_index_size isn't 0. This happens if the mbox file's timestamp and file size are the exact same after some expunges+appends, so the only way to properly fix it would be to wait a second and update the mtime before unlocking the mbox file after expunges.
This sleep is actually already done when Dovecot notices that a) mtime and size really are the same after modification and b) mails have been moved around inside the file (ie. flag changes were done but there weren't enough empty space in the mail's own headers so mails had to be moved to get the needed space). Should be pretty rare.
Anyway, this error is completely invisible to users and it doesn't have any other side effects that mbox_dirty_syncs doesn't already have. The whole error message could even be removed if it's going to become a problem..
Other changes in the snapshot:
+ mail-log plugin: Log the mailbox name, except if it's INBOX
+ dovecot-auth: Added a lot more debug logging passdbs and userdbs
+ dovecot-auth: Added %c variable which expands to "secured" with
SSL/TLS/localhost.
+ dovecot-auth: Added %m variable which expands to auth mechanism name
- Wrong endianess index files still weren't silently rebuilt
- IMAP quota plugin: GETQUOTAROOT returned the mailbox name wrong the
namespace had a prefix or if its separator was non-default
On Sat, 2007-03-10 at 21:45 +0200, Timo Sirainen wrote:
I've been doing some stress testing in the last few days and fixing all the errors I saw. I'm hoping that I've finally really fixed all the index/mbox problems.
Promising. After running a bit less than 9 hours, zero errors with ext3 +maildir and 10 concurrent imaptest clients:
Logi List Stat Sele Fetc Fet2 Stor Dele Expu Appe Logo 3430404 1716719 1715506 3430403 3430401 4898510 1715205 1114566 3430394 1280678 6860566
There's something wrong with the logout numbers though since there are twice as many as logins :)
Timo Sirainen wrote:
Promising. After running a bit less than 9 hours, zero errors with ext3 +maildir and 10 concurrent imaptest clients:
I think dP alluded to this question earlier but I'm not sure if I see a response -- the default setting for login procs is:
#login_process_per_connection = yes
...with all the notes about security ans so forth. The question is though what's a best recommended practice for efficiency and speed? What I'm wondering is that most clients like ThunderBird spawns multiple threads, which cause multiple logins. Typically 3-5 per user on TBird.
So would it be correct to say that 50 people logged in at once would have ~150 login-auth procs running in order to get work done? At what point is a small tradeoff in security worth the recovery of CPU and process space?
If one login proc is already running for the first thread 'user=tengel', do the subsequent threads re-use that same proc instead of launching another? Right now with min=3 and only myself logged in, I see 6 imap-login procs on the stack, and I'm worried that we'll start overloading the system.
Thanks for any input -- just looking at how to best optimize here.
-te
-- Troy Engel | Systems Engineer Fluid Inc. | http://www.fluid.com
On Sun, 2007-03-11 at 09:51 -0700, Troy Engel wrote:
I think dP alluded to this question earlier but I'm not sure if I see a response -- the default setting for login procs is:
I think I replied with http://wiki.dovecot.org/LoginProcess
...with all the notes about security ans so forth. The question is though what's a best recommended practice for efficiency and speed? What I'm wondering is that most clients like ThunderBird spawns multiple threads, which cause multiple logins. Typically 3-5 per user on TBird.
So would it be correct to say that 50 people logged in at once would have ~150 login-auth procs running in order to get work done?
Right.
At what point is a small tradeoff in security worth the recovery of CPU and process space?
You can decide that. :) I don't think the login processes take all that much memory. As for CPU usage it looks like my machine can handle about 130 logins/second with login_process_per_connection=yes and 1200/sec with =no. You can try that yourself with my imaptest tool (see the mail I wrote last night) with parameters "- select=0".
If one login proc is already running for the first thread 'user=tengel', do the subsequent threads re-use that same proc instead of launching another?
No. One process per connection. There's really no way to know what user is logging in before he does it. Of course it could maybe be possible to move the SSL proxying to another login process of the same user, but that's probably a bit too much trouble.
Timo Sirainen wrote:
I think I replied with http://wiki.dovecot.org/LoginProcess
Ah! I missed that, thanks. :) Reading and absorbing...
-te
-- Troy Engel | Systems Engineer Fluid Inc. | http://www.fluid.com
On Sun, 2007-03-11 at 09:51 -0700, Troy Engel wrote:
#login_process_per_connection = yes
...with all the notes about security ans so forth. The question is though what's a best recommended practice for efficiency and speed?
I was just updating my master/config/logging rewrite code, which has been almost-working for last two years or so. Then I had an idea:
The biggest performance problem with login_process_per_connection=yes is creating, initializing and destroying the process. login_process_per_connection=no's problem then is that the attacker can steal other users' passwords and TCP sessions (if there ever is a security hole in the login process code or in OpenSSL).
If a login process is handling only a single connection at a time that takes care of the TCP session spying/stealing. Then if all the memory allocations are always cleared before freeing the memory (or old memory when realloc()ing), that takes care of the password stealing as well, since there should be nothing important left in the memory from the previous user's session.
I can think of two potential problems: the stack and the memory allocated by libraries. Luckily OpenSSL can be told to use different memory allocation functions, and I don't think there are any other library calls made that could leave anything important into memory. As for stack, I don't think I ever store anything important there.
Benchmarks show that this works pretty nicely. I get around 1350-1400 connections/second. Using multiple connections/process gives 1450-1500 connections/second. Cleaning the memory vs. not cleaning it doesn't seem to make any noticeable performance difference.
Hmm. Except there's also Dovecot's data stack which I almost forgot. I'll add a new data_stack_clean() function to be called whenever a client is deinitialized.
The benchmarks are with the rewritten code. The same behavior can be done with v1.0 by setting login_max_connections = 1, although only CVS HEAD has the cleanup code. Looks like I can get only 1150-1200 connections with CVS HEAD both with login_max_connections=1 and with a higher value. So looks like my rewrite is actually faster even though it requires doing UNIX socket connections to log and config processes.
Oh, and as for when the rewrite is in CVS, I'm not really sure. I'm still wondering if I should put it there soon and make the release after v1.0 be v2.0, or if I should first stabilize the current new CVS HEAD features (which shouldn't take too long), do a v1.2 release and only then put this code to CVS and make it v2.0..
Timo Sirainen wrote:
I've been doing some stress testing in the last few days and fixing all the errors I saw. I'm hoping that I've finally really fixed all the index/mbox problems. So, again I think the only thing left for v1.0 is the documentation. Please test this nightly snapshot before I'll release rc27 tomorrow:
I'm getting crashes from dovecot-auth every time
#0 0xff0cf600 in ?? () (gdb) bt #0 0xff0cf600 in ?? () #1 0x1ec4c in get_uid (user=0x5bc08 "/opt/RDGdovect/etc/dovecot-ldap.conf") at main.c:75 #2 0x246e8 in checkpassword_request_half_finish (request=0x5bc30) at passdb-checkpassword.c:99 #3 0x2328c in passdb_blocking_lookup_credentials (request=0x0) at passdb-blocking.c:158 #4 0x17648 in auth_cache_parse_key (pool=0x5b838, query=0x51000 "") at auth-cache.c:44 #5 0x1ff0c in mech_cram_md5_auth_initial (auth_request=0x59fa8, data=0x51400 "", data_size=0) at mech-cram-md5.c:166 #6 0x2024c in verify_credentials (request=0x0, credentials=0xffbefb54 "ÿ¾üX") at mech-digest-md5.c:180
The really odd thing is that I'm not using LDAP this time (the authentication section for this is commented out and doesn't show up in "dovecot -n").
Chris
P.S. I compiled imaptest10.c OK, but can't use it because of this!
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Sun, 2007-03-11 at 17:04 +0000, Chris Wakelin wrote:
I'm getting crashes from dovecot-auth every time
#0 0xff0cf600 in ?? () (gdb) bt #0 0xff0cf600 in ?? () #1 0x1ec4c in get_uid (user=0x5bc08 "/opt/RDGdovect/etc/dovecot-ldap.conf") at main.c:75 #2 0x246e8 in checkpassword_request_half_finish (request=0x5bc30) at passdb-checkpassword.c:99 #3 0x2328c in passdb_blocking_lookup_credentials (request=0x0) at passdb-blocking.c:158 #4 0x17648 in auth_cache_parse_key (pool=0x5b838, query=0x51000 "") at auth-cache.c:44 #5 0x1ff0c in mech_cram_md5_auth_initial (auth_request=0x59fa8, data=0x51400 "", data_size=0) at mech-cram-md5.c:166 #6 0x2024c in verify_credentials (request=0x0, credentials=0xffbefb54 "ÿ¾üX") at mech-digest-md5.c:180
It crashes at startup? That backtrace is completely corrupted. Are you sure the core file is the right one, and that you're gdbing it against the same dovecot-auth that produced it?
If you can't get a usable backtrace, could you show me dovecot -n output so I can try to reproduce the crash.
Timo Sirainen wrote:
It crashes at startup? That backtrace is completely corrupted. Are you sure the core file is the right one, and that you're gdbing it against the same dovecot-auth that produced it?
If you can't get a usable backtrace, could you show me dovecot -n output so I can try to reproduce the crash.
It crashes at login (using PAM via pam_ldap to Active Directory). And no, it was the wrong core file <sheepish grin>. Here's the proper backtrace (with "password" instead of what I actually used ;) )
Chris
#0 0xfeeb3218 in ?? () #1 0xfef08694 in ?? () #2 0x3c878 in FinalPermuteMap () #3 0x1ca74 in verify_plain_callback (result=360680, request=0xffbef5dc) at auth-worker-client.c:104 #4 0x1cb60 in auth_worker_handle_passv (client=0xf20f8, id=266560, args=0x41178
) at auth-worker-client.c:139 #5 0x21614 in mech_ntlm_auth_free (request=0xf20f8) at mech-ntlm.c:263 #6 0x2729c in password_generate_ntlm () #7 0x1b518 in get_log_str (auth_request=0xf20f8, subsystem=0x580c8 "password", format=0x22300 "\220\020", va=0x39393030) at auth-request.c:881 #8 0x2246c in mech_rpa_auth_continue (auth_request=0xf20f8, data=0x580b0 "", data_size=18) at mech-rpa.c:557 #9 0x220a4 in mech_rpa_build_token2 (request=0xf20f8, size=0x580b0) at mech-rpa.c:404 #10 0x1ad1c in is_ip_in_network (network=0xf20f8 "", ip=0x580b0) at auth-request.c:645 #11 0x1d530 in auth_worker_create () at auth-worker-server.c:84 #12 0x18ce4 in master_output (context=0x89d30) at auth-master-connection.c:213 #13 0x18ec0 in auth_master_connection_destroy (_conn=0x89d30) at auth-master-connection.c:276 #14 0x34044 in io_stream_copy (outstream=0x0, instream=0x60e98, in_size=4) at ostream-file.c:525 #15 0x33864 in get_unused_space (fstream=0x60e98) at ostream-file.c:258 #16 0x21efc in mech_rpa_build_token2 (request=0x0, size=0xffbefb6c) at mech-rpa.c:367-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Sun, 2007-03-11 at 17:23 +0000, Chris Wakelin wrote:
It crashes at login (using PAM via pam_ldap to Active Directory). And no, it was the wrong core file <sheepish grin>. Here's the proper backtrace (with "password" instead of what I actually used ;) )
Well, this one is also broken. Some things that point to an invalid backtrace:
- ntlm, rpa functions being called, while you most likely didn't use NTLM or RPA auth mechanisms
- *_free() and *_destroy() functions in the middle of the backtrace
Timo Sirainen wrote:
On Sun, 2007-03-11 at 17:23 +0000, Chris Wakelin wrote:
It crashes at login (using PAM via pam_ldap to Active Directory). And no, it was the wrong core file <sheepish grin>. Here's the proper backtrace (with "password" instead of what I actually used ;) )
Well, this one is also broken. Some things that point to an invalid backtrace:
(gdb) bt
#0 0xfeeb3218 in strlen () from /usr/lib/libc.so.1
#1 0xfef06520 in _doprnt () from /usr/lib/libc.so.1
#2 0xfef08694 in vsnprintf () from /usr/lib/libc.so.1
#3 0x3c878 in str_vprintfa (str=0x580e8, fmt=0x41178 "no passwd file: %s",
args=0xffbef5f0) at str.c:118
#4 0x1ca74 in get_log_str (auth_request=0x580e8,
subsystem=0x58188 "134.225.251.5", format=0x41178 "no passwd file: %s",
va=0xffbef5f0) at auth-request.c:1121
#5 0x1cb60 in auth_request_log_info (auth_request=0xf20f8,
subsystem=0x41140 "passwd-file", format=0x41178 "no passwd file: %s")
at auth-request.c:1152
#6 0x21614 in db_passwd_file_lookup (db=0x5ffb0, request=0xf20f8)
at db-passwd-file.c:416
#7 0x27298 in passwd_file_verify_plain (request=0xf20f8,
password=0x580c8 "password",
callback=0x1b320
and here's (more or less) the auth sections from "dovecot -n"
auth default: mechanisms: plain login username_translation: AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz passdb: driver: passwd-file args: /export/mailconfig/%u/migrating deny: yes passdb: driver: passwd-file args: /export/mailconfig/%u/proxy.%Ls passdb: driver: pam userdb: driver: passwd-file args: /export/mailconfig/%u/proxy.%Ls userdb: driver: passwd-file args: /usr/local/dovecot/etc/userdb
and the migrating and proxy.imap/proxy.pop3 files didn't exist in this case (but may do for other users).
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Sun, 2007-03-11 at 20:54 +0000, Chris Wakelin wrote:
Timo Sirainen wrote:
On Sun, 2007-03-11 at 17:23 +0000, Chris Wakelin wrote:
It crashes at login (using PAM via pam_ldap to Active Directory). And no, it was the wrong core file <sheepish grin>. Here's the proper backtrace (with "password" instead of what I actually used ;) )
Well, this one is also broken. Some things that point to an invalid backtrace:
OK this one's actually the right core file *and* the right binary (the other was from dovecot-1.0-beta7.)
Yes, this one was correct :) This should fix it: http://dovecot.org/list/dovecot-cvs/2007-March/008096.html
I'll build also a new snapshot. I guess I'll delay rc27 until tomorrow anyway.
On Sun 11 Mar 2007 at 11:04PM, Timo Sirainen wrote:
On Sun, 2007-03-11 at 20:54 +0000, Chris Wakelin wrote:
Timo Sirainen wrote:
On Sun, 2007-03-11 at 17:23 +0000, Chris Wakelin wrote:
It crashes at login (using PAM via pam_ldap to Active Directory). And no, it was the wrong core file <sheepish grin>. Here's the proper backtrace (with "password" instead of what I actually used ;) )
Well, this one is also broken. Some things that point to an invalid backtrace:
OK this one's actually the right core file *and* the right binary (the other was from dovecot-1.0-beta7.) Yes, this one was correct :) This should fix it: http://dovecot.org/list/dovecot-cvs/2007-March/008096.html
I'll build also a new snapshot. I guess I'll delay rc27 until tomorrow anyway.
I've put up the pre-rc27 onto our box and so far things are looking good. However, it's the weekend, and we don't have many users busy, and no new migrations. If you can wait until Tuesday afternoon (Pacific time) I can let you know how it survives Monday's abuse, and the migration of some more users on Monday night (which should hopefully validate the endianness bugfix). I'll go remove the index and cache removal stuff from our migration script.
-dp
-- Daniel Price - Solaris Kernel Engineering - dp@eng.sun.com - blogs.sun.com/dp
On Sun 11 Mar 2007 at 02:32PM, Dan Price wrote:
On Sun 11 Mar 2007 at 11:04PM, Timo Sirainen wrote:
On Sun, 2007-03-11 at 20:54 +0000, Chris Wakelin wrote:
Timo Sirainen wrote:
On Sun, 2007-03-11 at 17:23 +0000, Chris Wakelin wrote:
It crashes at login (using PAM via pam_ldap to Active Directory). And no, it was the wrong core file <sheepish grin>. Here's the proper backtrace (with "password" instead of what I actually used ;) )
Well, this one is also broken. Some things that point to an invalid backtrace:
OK this one's actually the right core file *and* the right binary (the other was from dovecot-1.0-beta7.) Yes, this one was correct :) This should fix it: http://dovecot.org/list/dovecot-cvs/2007-March/008096.html
I'll build also a new snapshot. I guess I'll delay rc27 until tomorrow anyway.
I've put up the pre-rc27 onto our box and so far things are looking good. However, it's the weekend, and we don't have many users busy, and no new migrations. If you can wait until Tuesday afternoon (Pacific time) I can let you know how it survives Monday's abuse, and the migration of some more users on Monday night (which should hopefully validate the endianness bugfix). I'll go remove the index and cache removal stuff from our migration script.
Timo,
Since deploying pre-rc27 over the weekend we've had one core dump (just one!), this morning. Here is the message and backtrace. Thanks for working so hard on the IMAP stability.
-dp
dovecot: Mar 13 07:14:34 Error: IMAP(foobar): file mail-cache.c: line 411 (mai l_cache_lock): assertion failed: (!cache->locked) dovecot: Mar 13 07:14:34 Error: IMAP(foobar): Raw backtrace: 0x80b9467 -> 0x80 98220 -> 0x809af69 -> 0x809b016 -> 0x80a3982 -> 0x80a4314 -> 0x80a2c31 -> 0x8085 3fd -> 0x807dd19 -> 0x80adc7e -> 0x806cb85 -> 0x8067df4 -> 0x806bb66 -> 0x806baf a -> 0x806bcb4 -> 0x80beb09 -> 0x80be541 -> 0x8072aec -> 0x806710c dovecot: Mar 13 07:14:34 Error: child 440461 (imap) killed with signal 6
(gdb) bt full #0 0xfee505a7 in _lwp_kill () from /lib/libc.so.1 No symbol table info available. #1 0xfee4d71b in thr_kill () from /lib/libc.so.1 No symbol table info available. #2 0xfee06502 in raise () from /lib/libc.so.1 No symbol table info available. #3 0xfede4b8d in abort () from /lib/libc.so.1 No symbol table info available. #4 0x080b9920 in i_internal_panic_handler (fmt=0x0, args=0x8047564 "bÛ\f\b\233\001") at failures.c:403 backtrace = 0x80eabd0 "0x80b9467 -> 0x8098220 -> 0x809af69 -> 0x809b016 -> 0x80a3982 -> 0x80a4314 -> 0x80a2c31 -> 0x80853fd -> 0x807dd19 -> 0x80adc7e -> 0x806cb85 -> 0x8067df4 -> 0x806bb66 -> 0x806bafa -> 0x806bcb4 -> 0x80b"... #5 0x080b9467 in i_panic ( format=0x80d4234 "file %s: line %d (%s): assertion failed: (%s)") at failures.c:183 No locals. #6 0x08098220 in mail_cache_lock (cache=0xfee9b000) at mail-cache.c:469 view = (struct mail_index_view *) 0x0 ext = (const struct mail_index_ext *) 0x4 i = 134510040 ret = 135174244 __PRETTY_FUNCTION__ = "mail_cache_lock" #7 0x0809af69 in mail_cache_handler_init (ctx_r=0x80475d8, cache=0x8101238) at mail-cache-sync-update.c:40 ctx = (struct mail_cache_sync_context *) 0x80fd1f8 ret = 0 #8 0x0809b016 in mail_cache_expunge_handler (sync_ctx=0xfee505a7, seq=413, data=0x8118128, context=0x80f6964) at mail-cache-sync-update.c:85 view = (struct mail_index_view *) 0x810c228 ctx = (struct mail_cache_sync_context *) 0x80fd1f8 cache = (struct mail_cache *) 0x8101238 cache_file_seq = 135318000 ret = 0 #9 0x080a3982 in mail_index_sync_record (ctx=0x8047730, hdr=0x810ee54, data=0xfed36f08) at mail-index-sync-update.c:306 rec = (const struct mail_transaction_expunge *) 0xfed36f08 end = (const struct mail_transaction_expunge *) 0xfed36f10 ret = 135312872 __PRETTY_FUNCTION__ = "mail_index_sync_record" #10 0x080a4314 in mail_index_sync_update_index (sync_ctx=0x810c228, sync_only_external=false) at mail-index-sync-update.c:895 rec = (const struct mail_index_record *) 0x0 index = (struct mail_index *) 0x8107c28 view = (struct mail_index_view *) 0x810c228 map = (struct mail_index_map *) 0x81010e0 sync_map_ctx = {view = 0x810c228, cur_ext_id = 2, ext_intro_seq = 33554433, ext_intro_offset = 24308, ext_intro_size = 40, expunge_handlers = {buffer = 0x811b658, element_size = 12}, extra_contexts = {buffer = 0x81117f0, element_size = 4}, type = MAIL_INDEX_SYNC_HANDLER_FILE, sync_handlers_initialized = 0, expunge_handlers_set = 1, expunge_handlers_used = 1, cur_ext_ignore = 0, unreliable_flags = 0, sync_only_external = 0} thdr = (const struct mail_transaction_header *) 0x810ee54 data = (const void *) 0xfed36f08 old_lock_id = 1 i = 0 first_append_uid = 66571 ret = 1 had_dirty = false skipped = false check_ext_offsets = false __PRETTY_FUNCTION__ = "mail_index_sync_update_index" #11 0x080a2c31 in mail_index_sync_commit (_ctx=0x8047804) at mail-index-sync.c:669 ctx = (struct mail_index_sync_ctx *) 0x80fd068 index = (struct mail_index *) 0x8107c28 hdr = (const struct mail_index_header *) 0x810823c seq = 33554433 offset = 28448 ret = 0 #12 0x080853fd in mbox_sync (mbox=0x8100bf0, flags=25) at mbox-sync.c:1816 expunged = false index_sync_ctx = (struct mail_index_sync_ctx *) 0x80fd068 sync_view = (struct mail_index_view *) 0x810c228 sync_ctx = {mbox = 0x8100bf0, flags = 25, input = 0x8100a44, file_input = 0x810c544, write_fd = 7, orig_mtime = 1173795269, orig_size = 22378926, index_sync_ctx = 0x80fd068, sync_view = 0x810c228, t = 0x0, hdr = 0x810c26c, header = 0x8111f98, from_line = 0x811b680, base_uid_validity = 1173290687, base_uid_last = 66570, base_uid_last_offset = 252, mails = {buffer = 0x8111ac0, element_size = 52}, syncs = {buffer = 0x8111868, element_size = 20}, sync_rec = {uid1 = 0, uid2 = 0, type = 0, add_flags = 0 '\0', remove_flags = 0 '\0', keyword_idx = 0}, mail_keyword_pool = 0x80fcee0, saved_keywords_pool = 0x8101500, prev_msg_uid = 66773, next_uid = 66774, idx_next_uid = 66571, seq = 1458, idx_seq = 1661, need_space_seq = 0, expunged_space = 0, space_diff = 0, dest_first_mail = 0, first_mail_crlf_expunged = 0, delay_writes = 0, renumber_uids = 0, moved_offsets = 0} seq = 33554433 offset = 28448 lock_id = 79 ret = 0 changed = 1 __PRETTY_FUNCTION__ = "mbox_sync" #13 0x0807dd19 in mbox_transaction_commit (_t=0x8103490, flags=3) at mbox-transaction.c:59 mbox_sync_flags = 0 t = (struct mbox_transaction_context *) 0x8103490 mbox = (struct mbox_mailbox *) 0x8100bf0 lock_id = 0 mbox_modified = false ret = 0 #14 0x080adc7e in mailbox_transaction_commit (_t=0x0, flags=3) at mail-storage.c:506 t = (struct mailbox_transaction_context *) 0x0 #15 0x0806cb85 in imap_expunge (box=0x8100bf0, next_search_arg=0x0) at imap-expunge.c:46 ctx = (struct mail_search_context *) 0x0 t = (struct mailbox_transaction_context *) 0x0 mail = (struct mail *) 0x0 search_arg = {next = 0x0, type = SEARCH_DELETED, value = { subargs = 0x0, seqset = 0x0, str = 0x0}, context = 0x0, hdr_field_name = 0x0, not = 0, match_always = 0, result = -1} failed = false #16 0x08067df4 in cmd_close (cmd=0x80f784c) at cmd-close.c:19 client = (struct client *) 0x80f7808 mailbox = (struct mailbox *) 0x8100bf0 storage = (struct mail_storage *) 0x80f7520 #17 0x0806bb66 in client_handle_input (cmd=0x80f784c) at client.c:332 client = (struct client *) 0x80f7808 __PRETTY_FUNCTION__ = "client_handle_input" #18 0x0806bafa in client_handle_input (cmd=0x80f784c) at client.c:389 client = (struct client *) 0x80f7808 __PRETTY_FUNCTION__ = "client_handle_input" #19 0x0806bcb4 in _client_input (context=0x80f7808) at client.c:432 client = (struct client *) 0x80f7808 cmd = (struct client_command_context *) 0x80f784c ret = 2 __PRETTY_FUNCTION__ = "_client_input" #20 0x080beb09 in io_loop_handler_run (ioloop=0x80f5000) at ioloop-poll.c:199 ctx = (struct ioloop_handler_context *) 0x80f5038 pollfd = (struct pollfd *) 0x2 tv = {tv_sec = 0, tv_usec = 999973} io = (struct io *) 0x80f6418 t_id = 2 msecs = 0 ret = 0 call = true #21 0x080be541 in io_loop_run (ioloop=0x80f5000) at ioloop.c:323 No locals. #22 0x08072aec in main (argc=1, argv=0x8047afc, envp=0x8047b04) at main.c:287 No locals.
-- Daniel Price - Solaris Kernel Engineering - dp@eng.sun.com - blogs.sun.com/dp
On Tue 13 Mar 2007 at 11:32AM, Dan Price wrote:
Timo,
Since deploying pre-rc27 over the weekend we've had one core dump (just one!), this morning. Here is the message and backtrace. Thanks for working so hard on the IMAP stability.
mbox stability, I mean :) Not enough coffee yet today.
-dp
-- Daniel Price - Solaris Kernel Engineering - dp@eng.sun.com - blogs.sun.com/dp
On Tue, 2007-03-13 at 11:32 -0700, Dan Price wrote:
dovecot: Mar 13 07:14:34 Error: IMAP(foobar): file mail-cache.c: line 411 (mai l_cache_lock): assertion failed: (!cache->locked)
This should fix it: http://dovecot.org/list/dovecot-cvs/2007-March/008140.html
Timo Sirainen wrote:
On Sun, 2007-03-11 at 20:54 +0000, Chris Wakelin wrote:
Timo Sirainen wrote:
On Sun, 2007-03-11 at 17:23 +0000, Chris Wakelin wrote:
It crashes at login (using PAM via pam_ldap to Active Directory). And no, it was the wrong core file <sheepish grin>. Here's the proper backtrace (with "password" instead of what I actually used ;) ) Well, this one is also broken. Some things that point to an invalid backtrace:
OK this one's actually the right core file *and* the right binary (the other was from dovecot-1.0-beta7.) Yes, this one was correct :) This should fix it: http://dovecot.org/list/dovecot-cvs/2007-March/008096.html
I'll build also a new snapshot. I guess I'll delay rc27 until tomorrow anyway.
OK that fixed it. I tried running imaptest (something like "imaptest user=username pass=password host=host.rdg.ac.uk port=5143 seed=6630 mbox=/export/mail/username/INBOX") for 300 seconds, but got
Logi List Stat Sele Fetc Fet2 Stor Dele Expu Appe Logo 100% 50% 50% 100% 100% 100% 50% 100% 100% 100% 100% 30% 5% 0 0 0 0 2 0 0 0 0 0 0 10/ 10 0 0 0 0 6 0 0 0 0 0 0 10/ 10 0 0 0 0 1 7 0 1 1 0 1 10/ 10 4 1 2 2 1 3 4 3 4 2 10 10/ 10 7 4 4 6 5 10 3 4 8 4 15 10/ 10 9 4 3 8 8 10 3 1 8 7 18 10/ 10 7 5 3 7 8 7 3 0 6 6 11 10/ 10 7 3 3 8 8 14 4 1 7 7 14 10/ 10 Panic: From-line not found at 8199 Abort (core dumped)
#0 0xff21fbe8 in _libc_kill () from /usr/lib/libc.so.1 #1 0xff1b598c in abort () from /usr/lib/libc.so.1 #2 0x1bc7c in default_panic_handler ( format=0x16
, args=0xffbef450) at failures.c:138 #3 0x1be4c in i_panic (format=0x25468 "From-line not found at %llu") at failures.c:183 #4 0x14220 in mbox_get_next_size (mbox=0x40930, size_r=0x43e00, time_r=0xffbef508) at imaptest10.c:280 #5 0x14bf8 in client_append (client=0x43d78, continued=false) at imaptest10.c:441 #6 0x16220 in client_send_next_cmd (client=0x43d78) at imaptest10.c:803 #7 0x16e7c in state_callback (client=0x43d78, cmd=0x422b0, line=0x60214 "[READ-WRITE] Select completed.", reply=REPLY_OK) at imaptest10.c:1019 #8 0x1755c in client_input (context=0x43d78) at imaptest10.c:1152 #9 0x1f024 in io_loop_handler_run (ioloop=0x408d8) at ioloop-poll.c:199 #10 0x1e844 in io_loop_run (ioloop=0x408d8) at ioloop.c:323 #11 0x19090 in main (argc=8, argv=0xffbef8ec) at imaptest10.c:1606The mailbox appears now to consist entirely of the UW-IMAP pseudo-message, except it's been truncated.
Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On 11.3.2007, at 23.55, Chris Wakelin wrote:
OK that fixed it. I tried running imaptest (something like "imaptest user=username pass=password host=host.rdg.ac.uk port=5143 seed=6630 mbox=/export/mail/username/INBOX") for 300 seconds, but got .. Panic: From-line not found at 8199 Abort (core dumped)
What is at offset 8199? (dd if=file bs=1 skip=8199)
Anyway the imaptest's mbox parsing code isn't that great. Try giving
it a better mbox, such as http://dovecot.org/archives/dovecot.mbox :)
Timo Sirainen wrote:
On 11.3.2007, at 23.55, Chris Wakelin wrote:
OK that fixed it. I tried running imaptest (something like "imaptest user=username pass=password host=host.rdg.ac.uk port=5143 seed=6630 mbox=/export/mail/username/INBOX") for 300 seconds, but got .. Panic: From-line not found at 8199 Abort (core dumped)
What is at offset 8199? (dd if=file bs=1 skip=8199)
Anyway the imaptest's mbox parsing code isn't that great. Try giving it a better mbox, such as http://dovecot.org/archives/dovecot.mbox :)
I think it was because my "mbox" was the INBOX so source and destination were the same! It hasn't crashed yet with a copy of dovecot.mbox as the source (though it does generate STORE commands with arguments that are too long).
Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Sat, 2007-03-10 at 21:45 +0200, Timo Sirainen wrote:
I've been doing some stress testing in the last few days and fixing all the errors I saw. I'm hoping that I've finally really fixed all the index/mbox problems. So, again I think the only thing left for v1.0 is the documentation. Please test this nightly snapshot before I'll release rc27 tomorrow:
I'm worried about the PAM crash that Chris reported, so I won't release rc27 before that's solved.
I updated this snapshot again. Changes since yesterday:
I added fsync_disable setting after all. It gives 20% better performance with some of my tests, and maybe even better with a lot of users.
mbox and index files should work now perfectly in out-of-quota conditions.
If some people were still having problems with Dovecot hanging, that could have been caused by sendfile() call being used when mail had CR+LF linefeeds. If the sendfile() didn't send the whole mail, Dovecot hanged.
I just also had an idea how filesystem quota could be handled with maildir, as long as only Dovecot is used to add new files to maildir. Currently saving is done in this order:
- write mails to tmp/
- lock dovecot-uidlist file
- rename() mails from tmp/ to new/
- write the new dovecot-uidlist's contents to dovecot-uidlist.lock
- rename() dovecot-uidlist.lock to dovecot-uidlist
If I switched 3 and 4, there are only rename()s that can fail. And at least my Linux allows rename()ing when there's no disk space left. I hope that's the case for "out of quota" conditions too? Or does it create a new directory entry which can in some situations cause a new disk block allocation which will fail?..
On Sun, 2007-03-11 at 21:32 +0200, Timo Sirainen wrote:
- write mails to tmp/
- lock dovecot-uidlist file
- rename() mails from tmp/ to new/
- write the new dovecot-uidlist's contents to dovecot-uidlist.lock
- rename() dovecot-uidlist.lock to dovecot-uidlist
If I switched 3 and 4, there are only rename()s that can fail.
And I of course mean that if writing dovecot-uidlist.lock fails, the whole mail saving is aborted with "ouf of disk space" error, even though the mail file itself was successfully saved to tmp.
And at least my Linux allows rename()ing when there's no disk space left. I hope that's the case for "out of quota" conditions too? Or does it create a new directory entry which can in some situations cause a new disk block allocation which will fail?..
rename()s from tmp/ to new/ can at least fail. But if that happens the whole transaction can again be aborted and all the mails unlink()ed (even the ones already in new/. that's how it currently works too).
So the only problem is if the last uidlist rename() fails. But that I hope won't fail because it's done inside the same directory, and the maildir root directory is pretty static.
On Sun, 2007-03-11 at 21:45 +0200, Timo Sirainen wrote:
So the only problem is if the last uidlist rename() fails. But that I hope won't fail because it's done inside the same directory, and the maildir root directory is pretty static.
/mnt/Maildir/tmp% mv 82 83
mv: cannot move 82' to
83': No space left on device
/mnt/Maildir/tmp% mv 82 81 /mnt/Maildir/tmp%
So looks like at least Linux+ext3 allows rename()s that replace existing files, but doesn't allow rename()s that simply change the file's name. This would be enough for the dovecot-uidlist.lock renaming.
Can someone try if this works the same with filesystem quotas, and with other operating systems and filesystems?
Easiest way to fill the disk space is:
dd if=/dev/zero of=zero
for i in seq 1 1000
; do touch $i; done
- March 11. 21:01, Timo Sirainen:
On Sun, 2007-03-11 at 21:45 +0200, Timo Sirainen wrote:
So the only problem is if the last uidlist rename() fails. But that I hope won't fail because it's done inside the same directory, and the maildir root directory is pretty static.
/mnt/Maildir/tmp% mv 82 83 mv: cannot move
82' to
83': No space left on device/mnt/Maildir/tmp% mv 82 81 /mnt/Maildir/tmp%
So looks like at least Linux+ext3 allows rename()s that replace existing files, but doesn't allow rename()s that simply change the file's name. This would be enough for the dovecot-uidlist.lock renaming.
Can someone try if this works the same with filesystem quotas, and with other operating systems and filesystems? Same here with OpenBSD and ffs:
$ mv asd0 asd1 $ echo $? 0 $ mv asd0 bsd0 mv: rename asd0 to bsd0: Disk quota exceeded $ echo $? 1
Daniel
On Sun 11 Mar 2007 at 10:01PM, Timo Sirainen wrote:
On Sun, 2007-03-11 at 21:45 +0200, Timo Sirainen wrote:
So the only problem is if the last uidlist rename() fails. But that I hope won't fail because it's done inside the same directory, and the maildir root directory is pretty static.
/mnt/Maildir/tmp% mv 82 83 mv: cannot move
82' to
83': No space left on device/mnt/Maildir/tmp% mv 82 81 /mnt/Maildir/tmp%
I think in general you can't depend upon this-- take for example any filesystem supporting snapshots (like say ZFS). ZFS counts the size of your metadata against the quota (this is a good thing). The size occupied by your snapshot is also counted against the quota-- which means that if you have /home/dp@yesterday and you are near the quota, a rename of something which is in the snapshot will of course need to allocate new data and/or metadata blocks, since the old state-of-the-world must be saved in the snapshot.
-dp
-- Daniel Price - Solaris Kernel Engineering - dp@eng.sun.com - blogs.sun.com/dp
On Mon, 2007-03-12 at 22:31 -0700, Dan Price wrote:
/mnt/Maildir/tmp% mv 82 83 mv: cannot move
82' to
83': No space left on device/mnt/Maildir/tmp% mv 82 81 /mnt/Maildir/tmp%
I think in general you can't depend upon this--
As long as it works for most systems then it's useful to implement. Otherwise I'd just have to leave it as it is now, ie. it doesn't work at all in any systems when control files are in quota partition.
Although I guess in systems where the last rename() fails it could still go and unlink the mails from new/ dir and reply with "save failed" error.
participants (5)
-
Chris Wakelin
-
Dan Price
-
Daniel
-
Timo Sirainen
-
Troy Engel