[Dovecot] Rebuilding indexes fails on inconsistent mdbox

older
[Dovecot] Dovecot IMAP/POP3 auto...

Milan Holzäpfel

24 Oct 2012 24 Oct '12

2:28 p.m.

Hello all,

I have a problem with an incosistent mdbox:

Oct 24 10:43:23 two dovecot: imap-login: Login: user=<...>, method=PLAIN, rip=..., lip=..., mpid=4977, TLS Oct 24 10:43:23 two dovecot: imap(listen@mjh.name): Error: mdbox map .../mdbox/storage/dovecot.map.index corrupted: Unexpectedly lost INBOX uid=638 map_uid=809891 Oct 24 10:43:23 two dovecot: imap(listen@mjh.name): Error: mdbox map .../mdbox/storage/dovecot.map.index corrupted: Unexpectedly lost INBOX uid=638 map_uid=809891 Oct 24 10:43:23 two dovecot: imap(listen@mjh.name): Disconnected: Internal error occurred. Refer to server log for more information. [2012-10-24 10:43:23] bytes=115/53726 Oct 24 10:43:23 two dovecot: imap(listen@mjh.name): Warning: mdbox .../mdbox/storage: Inconsistency in map index (467,31960 != 467,36768) Oct 24 10:43:23 two dovecot: imap(listen@mjh.name): Warning: mdbox .../mdbox/storage: rebuilding indexes Oct 24 10:45:19 two dovecot: imap(listen@mjh.name): Panic: file mdbox-storage-rebuild.c: line 773 (rebuild_update_refcounts): assertion failed: (map_uid < msgs[i]->map_uid) Oct 24 10:45:19 two dovecot: imap(listen@mjh.name): Error: Raw backtrace: /usr/lib/dovecot/libdovecot.so.0(+0x380ca) [0x7f99cf35b0ca] -> /usr/lib/dovecot/libdovecot.so.0(+0x3810e) [0x7f99cf35b10e] -> /usr/li b/dovecot/libdovecot.so.0(i_fatal+0) [0x7f99cf334a67] -> /usr/lib/dovecot/libdovecot-storage.so.0(mdbox_storage_rebuild_in_context+0x10a5) [0x7f99cf5f42d5] -> /usr/lib/dovecot/libdovecot-storage.so.0(mdbox_s torage_rebuild+0x24) [0x7f99cf5f4414] -> /usr/lib/dovecot/libdovecot-storage.so.0(+0x53565) [0x7f99cf5f4565] -> /usr/lib/dovecot/libdovecot-storage.so.0(mailbox_close+0x1a) [0x7f99cf5c8caa] -> /usr/lib/dovec ot/libdovecot-storage.so.0(mailbox_free+0x13) [0x7f99cf5c8cf3] -> dovecot/imap(client_destroy+0x109) [0x7f99cfaa69e9] -> dovecot/imap(client_input+0xaa) [0x7f99cfaa6dba] -> /usr/lib/dovecot/libdovecot.so.0(i o_loop_call_io+0x48) [0x7f99cf366c98] -> /usr/lib/dovecot/libdovecot.so.0(io_loop_handler_run+0xa7) [0x7f99cf367d27] -> /usr/lib/dovecot/libdovecot.so.0(io_loop_run+0x28) [0x7f99cf366c28] -> /usr/lib/dovecot /libdovecot.so.0(master_service_run+0x13) [0x7f99cf354e33] -> dovecot/imap(main+0x304) [0x7f99cfa9e554] -> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f99cef8576d] -> dovecot/imap(+0x95e5) [0 x7f99cfa9e5e5] Oct 24 10:45:19 two dovecot: master: Error: service(imap): child 4977 killed with signal 6 (core dumps disabled)

I use:

Dovecot 2.0.19-0ubuntu1 Ubuntu 12.04, x86-64, Kernel 3.2.0-32-generic local XFS filesystem for the mdbox

The problem appeared out of nowhere. Many messages been continously delivered to this mailbox on this installation since May 2012, and the mdbox was only accessed with deliver and imap/pop3 from dovecot. About four hours after the problem initially appeared, I did a hard reset of the system because it was unresponsive. The error message is exactly the same before and after the hard reset. The problem is triggered by both IMAP access and dovecot deliver access.

The whole mdbox is 6.6 GiB large and I guess that it contains about 300k-600k messages. It's an archive of public mailing lists, so I could give access to the files.

Can anybody say something about this? May the mdbox be repaired?

Regards, Milan Holzäpfel

-- Milan Holzäpfel <listen@mjh.name>

Attachments:

doveconf.txt (text/plain — 2.1 KB)

Show replies by date

Robert Schetterer

24 Oct 24 Oct

2:43 p.m.

Am 24.10.2012 13:28, schrieb Milan Holzäpfel:

...

The whole mdbox is 6.6 GiB large and I guess that it contains about 300k-600k messages. It's an archive of public mailing lists, so I could give access to the files.

Can anybody say something about this? May the mdbox be repaired?

perhaps this helps

http://wiki2.dovecot.org/Tools/Doveadm/ForceResync

however upgrading to dovecot latest might be a good idea

Best Regards MfG Robert Schetterer

-- [*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Joerg Heidrich

Milan Holzäpfel

26 Oct 26 Oct

9:28 p.m.

On Wed, 24 Oct 2012 13:43:19 +0200 Robert Schetterer <rs@sys4.de> wrote:

...

Am 24.10.2012 13:28, schrieb Milan Holzäpfel:

...
The whole mdbox is 6.6 GiB large and I guess that it contains about 300k-600k messages. It's an archive of public mailing lists, so I could give access to the files.

Can anybody say something about this? May the mdbox be repaired?

perhaps this helps

http://wiki2.dovecot.org/Tools/Doveadm/ForceResync

however upgrading to dovecot latest might be a good idea

I tried this command, but all it will do is the "rebuilding indexes" thing that Dovecot's deliver and imapd will also do. (As I mentioned, this fails.) I haven't tried a more recent version of Dovecot so far.

Regards, Milan Holzäpfel

-- Milan Holzäpfel <listen@mjh.name>

Stan Hoeppner

24 Oct 24 Oct

5:01 p.m.

On 10/24/2012 6:28 AM, Milan Holzäpfel wrote:

...

I have a problem with an incosistent mdbox: ... four hours after the problem initially appeared, I did a hard reset of the system because it was unresponsive. ... Can anybody say something about this? May the mdbox be repaired?

If the box is truly unresponsive, i.e. hard locked, then the corrupted indexes are only a symptom of the underlying problem, which is unrelated to Dovecot, UNLESS, the lack of responsiveness was due to massive disk access, which will occur when rebuilding indexes on a 6.6GB mailbox. You need to know the difference so we have accurate information to troubleshoot with.

If the there's a kernel or hardware problem, you should see related errors in dmesg. Please share those.

Neither Timo nor anyone here can fix your index problem if the cause lie elsewhere. You must fix the root problem first.

-- Stan

Charles Marcus

5:45 p.m.

On 2012-10-24 10:01 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

If the box is truly unresponsive, i.e. hard locked, then the corrupted indexes are only a symptom of the underlying problem, which is unrelated to Dovecot, UNLESS, the lack of responsiveness was due to massive disk access, which will occur when rebuilding indexes on a 6.6GB mailbox. You need to know the difference so we have accurate information to troubleshoot with.

Hmmm... I wonder would it be possible for dovecot to automatically lower the 'niceness' for index rebuilds (on systems that support such) to avoid causing such distress?

Best regards,

Charles

Robert Schetterer

7:16 p.m.

Am 24.10.2012 16:45, schrieb Charles Marcus:

...

On 2012-10-24 10:01 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
If the box is truly unresponsive, i.e. hard locked, then the corrupted indexes are only a symptom of the underlying problem, which is unrelated to Dovecot, UNLESS, the lack of responsiveness was due to massive disk access, which will occur when rebuilding indexes on a 6.6GB mailbox. You need to know the difference so we have accurate information to troubleshoot with.

Hmmm... I wonder would it be possible for dovecot to automatically lower the 'niceness' for index rebuilds (on systems that support such) to avoid causing such distress?

i think you missed Stans point ,looking for some hardware problems first

Best Regards MfG Robert Schetterer

-- [*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Joerg Heidrich

Charles Marcus

9:48 p.m.

On 2012-10-24 12:16 PM, Robert Schetterer <rs@sys4.de> wrote:

...

Am 24.10.2012 16:45, schrieb Charles Marcus:

...
On 2012-10-24 10:01 AM, Stan Hoeppner<stan@hardwarefreak.com> wrote:

...
If the box is truly unresponsive, i.e. hard locked, then the corrupted indexes are only a symptom of the underlying problem, which is unrelated to Dovecot, UNLESS, the lack of responsiveness was due to massive disk access, which will occur when rebuilding indexes on a 6.6GB mailbox. You need to know the difference so we have accurate information to troubleshoot with. Hmmm... I wonder would it be possible for dovecot to automatically lower the 'niceness' for index rebuilds (on systems that support such) to avoid causing such distress? i think you missed Stans point ,looking for some hardware problems first

No, I was simply commenting on the one point about heavy load during large index rebuilds - which is why I trimmed the quoted text... maybe I could have trimmed more?

Best regards,

Charles

Stan Hoeppner

25 Oct 25 Oct

6:48 a.m.

On 10/24/2012 9:45 AM, Charles Marcus wrote:

...

On 2012-10-24 10:01 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
If the box is truly unresponsive, i.e. hard locked, then the corrupted indexes are only a symptom of the underlying problem, which is unrelated to Dovecot, UNLESS, the lack of responsiveness was due to massive disk access, which will occur when rebuilding indexes on a 6.6GB mailbox. You need to know the difference so we have accurate information to troubleshoot with.

Hmmm... I wonder would it be possible for dovecot to automatically lower the 'niceness' for index rebuilds (on systems that support such) to avoid causing such distress?

Changing the process priority would not help. Indexing a large mailbox is an IO bound, not a compute bound, operation. With Linux, changing from the CFQ to deadline scheduler may help some with low responsiveness. But the only real solution for such a case where iowait is bringing the system to its knees is to acquire storage with far greater IOPS and concurrent IO capability. I.e. a server.

-- Stan

Charles Marcus

26 Oct 26 Oct

4:37 p.m.

On 2012-10-24 11:48 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

Changing the process priority would not help. Indexing a large mailbox is an IO bound, not a compute bound, operation. With Linux, changing from the CFQ to deadline scheduler may help some with low responsiveness. But the only real solution for such a case where iowait is bringing the system to its knees is to acquire storage with far greater IOPS and concurrent IO capability. I.e. a server.

Ok, I get it, thanks for elaborating Stan...

Best regards,

Charles

Milan Holzäpfel

9:29 p.m.

On Wed, 24 Oct 2012 09:01:24 -0500 Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

On 10/24/2012 6:28 AM, Milan Holzäpfel wrote:

...
I have a problem with an incosistent mdbox: ... four hours after the problem initially appeared, I did a hard reset of the system because it was unresponsive. ... Can anybody say something about this? May the mdbox be repaired?

If the box is truly unresponsive, i.e. hard locked, then the corrupted indexes are only a symptom of the underlying problem, which is unrelated to Dovecot, UNLESS, the lack of responsiveness was due to massive disk access, which will occur when rebuilding indexes on a 6.6GB mailbox. You need to know the difference so we have accurate information to troubleshoot with.

Thanks for your suggestion. I wasn't looking for a solution for the unresponsiveness, but I failed to make that clear.

I was not patient enough to debug the unresponsiveness issue. The box was not hard locked, but any command took very look if it would at all complete. I think that it could be massive swapping, but I wouldn't expect Dovecot to be the cause.

After the reboot, Dovecot would happily re-execute the failing index rebuild on each new incoming message, which suggests that Dovecot wasn't the cause for the unresponsiveness.

...

If the there's a kernel or hardware problem, you should see related errors in dmesg. Please share those.

The kernel had messages like

INFO: task cron:2799 blocked for more than 120 seconds.

in the dmesg. But again, I didn't mean to ask for a solution to this problem.

Regards, Milan Holzäpfel

-- Milan Holzäpfel <listen@mjh.name>

Stan Hoeppner

27 Oct 27 Oct

4:45 a.m.

On 10/26/2012 1:29 PM, Milan Holzäpfel wrote:

...

On Wed, 24 Oct 2012 09:01:24 -0500 Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
On 10/24/2012 6:28 AM, Milan Holzäpfel wrote:

...
I have a problem with an incosistent mdbox: ... four hours after the problem initially appeared, I did a hard reset of the system because it was unresponsive. ... Can anybody say something about this? May the mdbox be repaired?

If the box is truly unresponsive, i.e. hard locked, then the corrupted indexes are only a symptom of the underlying problem, which is unrelated to Dovecot, UNLESS, the lack of responsiveness was due to massive disk access, which will occur when rebuilding indexes on a 6.6GB mailbox. You need to know the difference so we have accurate information to troubleshoot with.

Thanks for your suggestion. I wasn't looking for a solution for the unresponsiveness, but I failed to make that clear.

It's likely all related. If you have already, or will continue to, hard reset the box, you will lose inflight data in the buffer cache, which may very likely corrupt your mdbox files and/or indexes. I'm a bit shocked you'd hard reset a *slow* responding server. Especially one that appears to be unresponsive due to massive disk IO. That's a recipe for disaster...

...

I was not patient enough to debug the unresponsiveness issue. The box was not hard locked, but any command took very look if it would at all complete. I think that it could be massive swapping, but I wouldn't expect Dovecot to be the cause.

This leads me to believe your filesystem root, swap partition, and Dovecot mailbox storage are all on the same disk, or small RAID set. Is this correct?

...

After the reboot, Dovecot would happily re-execute the failing index rebuild on each new incoming message, which suggests that Dovecot wasn't the cause for the unresponsiveness.

This operation is a tiny IO pattern compared to the 6.6GB re-indexing operation you mentioned before. So you can't make the simple assumption that "Dovecot wasn't the cause for the unresponsiveness". If fact Dovecot likely instigated the problem, though it likely isn't the "cause". I'll take a stab at that below.

...

...
If the there's a kernel or hardware problem, you should see related errors in dmesg. Please share those.

The kernel had messages like

INFO: task cron:2799 blocked for more than 120 seconds.

Now we're getting some meat on this plate.

...

in the dmesg. But again, I didn't mean to ask for a solution to this problem.

"blocked for more than 120 seconds" is a kernel warning message, not an error message. We see this quite often on the XFS list. Rarely, this is related to a kernel bug. Most often the cause of this warning is saturated IO. In this case it appears cron blocked for 120s because it couldn't read /var/cron/crontabs/[user]

The most likely cause of this is that so many IO requests are piled up in the queue that it took more than 2 minutes for the hardware (disks) to complete them before servicing the cron process' IO requests. Dovecot re-indexing a 6.6GB mailbox, with other IO occurring concurrently, could easily cause this situation if you don't have sufficient spindle IOPS. I.e. this IO pattern will bring a single SATA disk or mirror pair to its knees.

If you currently have everything on a single SATA disk or mirror pair, the solution for eliminating the bogging down of the system, and likely the Dovecot issues related to it, is to simply separate your root filesystem, swap, and Dovecot data files onto different physical devices. For instance, moving the root filesystem and swap to a small SSD will prevent the OS unresponsiveness, even if Dovecot is bogged down with IO to the SATA disk.

With spinning rust storage, separation of root filesystem, swap, and application data to different storage IO domains is system administration 101 kind of stuff. If you're using SSD this isn't (as) critical as it's pretty hard to saturate the IO limits of an SSD.

-- Stan

Milan Holzäpfel

26 Oct 26 Oct

9:30 p.m.

On Wed, 24 Oct 2012 13:28:11 +0200 Milan Holzäpfel <listen@mjh.name> wrote:

...

I have a problem with an incosistent mdbox: [...] The problem appeared out of nowhere. [...]

That's just wrong. Two minutes before the corruption occured for the first time, the machine was booted after power-off without prior shutdown. I didn't notice this until now, sorry for this.

The mailbox is on XFS. As far as I remember, XFS in known for leaving NULL bytes at the end of files after a system reset. At least, I found 72 bytes of NULL in a plain text log file on XFS after such an event. Do you think this may be the source of the index corruption?

Do you have any other suggestions for recovering the mailbox?

Regards, Milan Holzäpfel

-- Milan Holzäpfel <listen@mjh.name>

Stan Hoeppner

27 Oct 27 Oct

5:54 a.m.

On 10/26/2012 1:30 PM, Milan Holzäpfel wrote:

...

On Wed, 24 Oct 2012 13:28:11 +0200 Milan Holzäpfel <listen@mjh.name> wrote:

...
I have a problem with an incosistent mdbox: [...] The problem appeared out of nowhere. [...]

That's just wrong. Two minutes before the corruption occured for the first time, the machine was booted after power-off without prior shutdown. I didn't notice this until now, sorry for this.

Ahh, more critical information. Better late than never I guess.

...

The mailbox is on XFS. As far as I remember, XFS in known for leaving NULL bytes at the end of files after a system reset. At least, I found 72 bytes of NULL in a plain text log file on XFS after such an event. Do you think this may be the source of the index corruption?

Very possibly.

...

Do you have any other suggestions for recovering the mailbox?

Other than restoring from a backup, I do not. Others might. But I will offer this suggestion: Never run a server without a properly functioning UPS and shutdown scripts.

The system in question isn't a laptop is it? I'm trying to ascertain how many server 'rules' you're breaking before making any more assumptions or giving any more advice.

-- Stan

Timo Sirainen

29 Oct 29 Oct

5:31 p.m.

On 24.10.2012, at 14.28, Milan Holzäpfel wrote:

...

Oct 24 10:43:23 two dovecot: imap-login: Login: user=<...>, method=PLAIN, rip=..., lip=..., mpid=4977, TLS Oct 24 10:43:23 two dovecot: imap(listen@mjh.name): Error: mdbox map .../mdbox/storage/dovecot.map.index corrupted: Unexpectedly lost INBOX uid=638 map_uid=809891 Oct 24 10:43:23 two dovecot: imap(listen@mjh.name): Error: mdbox map .../mdbox/storage/dovecot.map.index corrupted: Unexpectedly lost INBOX uid=638 map_uid=809891 Oct 24 10:43:23 two dovecot: imap(listen@mjh.name): Disconnected: Internal error occurred. Refer to server log for more information. [2012-10-24 10:43:23] bytes=115/53726 Oct 24 10:43:23 two dovecot: imap(listen@mjh.name): Warning: mdbox .../mdbox/storage: Inconsistency in map index (467,31960 != 467,36768) Oct 24 10:43:23 two dovecot: imap(listen@mjh.name): Warning: mdbox .../mdbox/storage: rebuilding indexes

The above problems aren't too bad, since Dovecot fixes itself.

...

Oct 24 10:45:19 two dovecot: imap(listen@mjh.name): Panic: file mdbox-storage-rebuild.c: line 773 (rebuild_update_refcounts): assertion failed: (map_uid < msgs[i]->map_uid)

Now this is a bug.

...

Dovecot 2.0.19-0ubuntu1

But the bug may have already been fixed in v2.1.

...

The whole mdbox is 6.6 GiB large and I guess that it contains about 300k-600k messages. It's an archive of public mailing lists, so I could give access to the files.

I'd try first with a recent 2.1 version and if that doesn't fix the crash the easiest way for me to fix it would be to get the files. If you put the files through http://dovecot.org/tools/mdbox-obfuscate.pl they should compress pretty nicely.

Milan Holzäpfel

9 Nov 9 Nov

5:18 p.m.

On Mon, 29 Oct 2012 17:31:42 +0200 Timo Sirainen <tss@iki.fi> wrote:

...

On 24.10.2012, at 14.28, Milan Holzäpfel wrote:

...
Oct 24 10:45:19 two dovecot: imap(listen@mjh.name): Panic: file mdbox-storage-rebuild.c: line 773 (rebuild_update_refcounts): assertion failed: (map_uid < msgs[i]->map_uid)

Now this is a bug.

...
Dovecot 2.0.19-0ubuntu1

But the bug may have already been fixed in v2.1.

I downloaded Dovecot 2.1.10 and compiled from source. Accessing the mailbox still fails:

$ doveadm mailbox status all xen-devel doveadm(...): Error: mdbox map /home/test/mdbox/storage/dovecot.map.index corrupted: Unexpectedly lost xen-devel uid=108079 map_uid=809892 doveadm(...): Warning: mdbox /home/test/mdbox/storage: rebuilding indexes doveadm(...): Panic: file mdbox-storage-rebuild.c: line 743 (rebuild_update_refcounts): assertion failed: (map_uid < msgs[i]->map_uid) doveadm(...): Error: Raw backtrace: /home/test/prefix/dovecot/lib/dovecot/libdovecot.so.0(+0x45b3a) [0x7fd2942f9b3a] -> /home/test/prefix/dovecot/lib/dovecot/libdovecot.so.0(default_fatal_handler+0x2a) [0x7fd2942f9bfa] -> /home/test/prefix/dovecot/lib/dovecot/libdovecot.so.0(i_fatal+0) [0x7fd2942cdc60] -> /home/test/prefix/dovecot/lib/dovecot/libdovecot-storage.so.0(mdbox_storage_rebuild_in_context+0x105e) [0x7fd2945838be] -> /home/test/prefix/dovecot/lib/dovecot/libdovecot-storage.so.0(mdbox_storage_rebuild+0x24) [0x7fd294583a54] -> /home/test/prefix/dovecot/lib/dovecot/libdovecot-storage.so.0(+0x3b525) [0x7fd294581525] -> /home/test/prefix/dovecot/lib/dovecot/libdovecot-storage.so.0(mailbox_close+0x1a) [0x7fd2945c047a] -> /home/test/prefix/dovecot/lib/dovecot/libdovecot-storage.so.0(mailbox_free+0x13) [0x7fd2945c04c3] -> /home/test/prefix/dovecot/bin/doveadm() [0x41383d] -> /home/test/prefix/dovecot/bin/doveadm() [0x40e8e7] -> /home/test/prefix/dovecot/bin/doveadm() [0x40f3ae] -> /home/test/prefix/dovecot/bin/doveadm(doveadm_mail_try_run+0x11e) [0x40f8be] -> /home/test/prefix/dovecot/bin/doveadm(main+0x3bd) [0x40e50d] -> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fd293f1676d] -> /home/test/prefix/dovecot/bin/doveadm() [0x40e6f1] Aborted

I will send you an Email with a download link for the files.

Thanks for you help.

Regards, Milan

-- Milan Holzäpfel <listen@mjh.name>

Timo Sirainen

27 Nov 27 Nov

10:42 a.m.

On 9.11.2012, at 17.18, Milan Holzäpfel wrote:

...

...
...
Oct 24 10:45:19 two dovecot: imap(listen@mjh.name): Panic: file mdbox-storage-rebuild.c: line 773 (rebuild_update_refcounts): assertion failed: (map_uid < msgs[i]->map_uid)

Finally looked into this and fixed: http://hg.dovecot.org/dovecot-2.1/rev/8770940057b9

4743

Age (days ago)

4777

Last active (days ago)

List overview

15 comments

5 participants

participants (5)

Charles Marcus
Milan Holzäpfel
Robert Schetterer
Stan Hoeppner
Timo Sirainen