[Dovecot] dovecot servers hanging with fuse/glusterfs errors
hi.
i've got a clustered mail system with glusterfs and some postfix, squirrelmail and dovecot machines sharing the same storage system through glusterfs. each server has only postfix or squirrel or dovecot installed on it. The thing is... dovecot servers hang very often and the last thing they always log is this:
login: Unable to handle kernel paging request at 0000000000100108 RIP: [<ffffffff88020838>] :fuse:request_end+0x45/0x109 PGD 1f729067 PUD 1faae067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: ipv6 fuse dm_snapshot dm_mirror dm_mod Pid: 678, comm: glusterfs Not tainted 2.6.18-xen #1 RIP: e030:[<ffffffff88020838>] [<ffffffff88020838>] :fuse:request_end+0x45/0x109 RSP: e02b:ffff88001f04dd68 EFLAGS: 00010246 RAX: 0000000000200200 RBX: ffff88001db9fa58 RCX: ffff88001db9fa68 RDX: 0000000000100100 RSI: ffff88001db9fa58 RDI: ffff88001f676400 RBP: ffff88001f676400 R08: 00000000204abb00 R09: ffff88001db9fb58 R10: 0000000000000008 R11: ffff88001f04dcf0 R12: 0000000000000000 R13: ffff88001db9fa90 R14: ffff88001f04ddf8 R15: 0000000000000001 FS: 00002b29187c53b0(0000) GS:ffffffff804cd000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process glusterfs (pid: 678, threadinfo ffff88001f04c000, task ffff88001fc5a860) Stack: ffff88001db9fa58 ffff88001f676400 00000000fffffffe ffffffff88021056 ffff88001f04def8 000000301f04de88 ffffffff8020dd40 ffff88001f04ddb8 ffffffff80225ca3 ffff88001de23500 ffffffff803ef023 ffff88001f04de98 Call Trace: [<ffffffff88021056>] :fuse:fuse_dev_readv+0x385/0x435 [<ffffffff8020dd40>] monotonic_clock+0x35/0x7d [<ffffffff80225ca3>] deactivate_task+0x1d/0x28 [<ffffffff803ef023>] thread_return+0x0/0x120 [<ffffffff802801d3>] do_readv_writev+0x271/0x294 [<ffffffff802274c7>] default_wake_function+0x0/0xe [<ffffffff803f0976>] __down_read+0x12/0xec [<ffffffff88021120>] :fuse:fuse_dev_read+0x1a/0x1f [<ffffffff802804bc>] vfs_read+0xcb/0x171 [<ffffffff802274c7>] default_wake_function+0x0/0xe [<ffffffff8028089b>] sys_read+0x45/0x6e [<ffffffff8020a436>] system_call+0x86/0x8b [<ffffffff8020a3b0>] system_call+0x0/0x8b
Code: 48 89 42 08 48 89 10 48 c7 41 08 00 02 20 00 f6 46 30 08 48 RIP [<ffffffff88020838>] :fuse:request_end+0x45/0x109 RSP <ffff88001f04dd68> CR2: 0000000000100108 dovecot01gluster01 kernel: Oops: 0002 [1] SMP dovecot01gluster01 kernel: CR2: 0000000000100108 <3>BUG: soft lockup detected on CPU#0!
Call Trace: <IRQ> [<ffffffff80257f78>] softlockup_tick+0xd8/0xea [<ffffffff8020f110>] timer_interrupt+0x3a9/0x405 [<ffffffff80258264>] handle_IRQ_event+0x4e/0x96 [<ffffffff80258350>] __do_IRQ+0xa4/0x105 [<ffffffff8020b0e8>] call_softirq+0x1c/0x28 [<ffffffff8020cecb>] do_IRQ+0x65/0x73 [<ffffffff8034a8c1>] evtchn_do_upcall+0xac/0x12d [<ffffffff8020ac1e>] do_hypervisor_callback+0x1e/0x2c <EOI> [<ffffffff802d6f56>] dummy_inode_permission+0x0/0x3 [<ffffffff8028c9b8>] do_lookup+0x63/0x173 [<ffffffff803f1232>] .text.lock.spinlock+0x0/0x8a [<ffffffff8802144c>] :fuse:request_send+0x1b/0x2a8 [<ffffffff8028f0d6>] __link_path_walk+0xdf2/0xf3c [<ffffffff80261275>] __do_page_cache_readahead+0x8a/0x28f [<ffffffff8802249f>] :fuse:fuse_dentry_revalidate+0x94/0x120 [<ffffffff80299838>] mntput_no_expire+0x19/0x8b [<ffffffff8028f2f3>] link_path_walk+0xd3/0xe5 [<ffffffff8029571e>] __d_lookup+0xb0/0xff [<ffffffff8028ca92>] do_lookup+0x13d/0x173 [<ffffffff8028e687>] __link_path_walk+0x3a3/0xf3c [<ffffffff8028f27c>] link_path_walk+0x5c/0xe5 [<ffffffff80219b90>] do_page_fault+0xee9/0x1215 [<ffffffff8027e38d>] fd_install+0x25/0x5f [<ffffffff8025daac>] filemap_nopage+0x188/0x324 [<ffffffff8028f6df>] do_path_lookup+0x270/0x2ec [<ffffffff8028e0c6>] getname+0x15b/0x1c1 [<ffffffff8028ff52>] __user_walk_fd+0x37/0x4c [<ffffffff80288883>] vfs_stat_fd+0x1b/0x4a [<ffffffff80219b90>] do_page_fault+0xee9/0x1215 [<ffffffff8027e38d>] fd_install+0x25/0x5f [<ffffffff80239287>] do_sigaction+0x7a/0x1f3 [<ffffffff80288a4e>] sys_newstat+0x19/0x31 [<ffffffff80239495>] sys_rt_sigaction+0x59/0x98 [<ffffffff8020ab73>] error_exit+0x0/0x71 [<ffffffff8020a436>] system_call+0x86/0x8b [<ffffffff8020a3b0>] system_call+0x0/0x8b
these are basic debian etch brand new set ups, they have nothing else than dovecot and glusterfs-client installed on them. The thing is all machines i've got have the very same version of glusterfs-client, but only the dovecot ones hang.
Do you have any idea?
Thank you.
On Mon, 2008-01-28 at 11:10 +0100, Jordi Moles wrote:
hi.
i've got a clustered mail system with glusterfs and some postfix, squirrelmail and dovecot machines sharing the same storage system through glusterfs. each server has only postfix or squirrel or dovecot installed on it. The thing is... dovecot servers hang very often and the last thing they always log is this:
Just because only Dovecot exposes bugs in fuse (or glusterfs) doesn't mean we can do anything about it..
At 11:10 AM +0100 1/28/08, Jordi Moles wrote:
hi.
i've got a clustered mail system with glusterfs and some postfix, squirrelmail and dovecot machines sharing the same storage system through glusterfs.
Why?
I can see the Postfix/Dovecot justification (barely) but it seems pointless for SquirrelMail. What storage does SquirrelMail share with the other types of machines, and why? Since SquirrelMail normally accesses mail only via IMAP, there's no point in making the back end of the IMAP server visible to it.
each server has only postfix or squirrel or dovecot installed on it. The thing is... dovecot servers hang very often and the last thing they always log is this:
login: Unable to handle kernel paging request at 0000000000100108 RIP: [<ffffffff88020838>] :fuse:request_end+0x45/0x109 PGD 1f729067 PUD 1faae067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: ipv6 fuse dm_snapshot dm_mirror dm_mod Pid: 678, comm: glusterfs Not tainted 2.6.18-xen #1 RIP: e030:[<ffffffff88020838>] [<ffffffff88020838>] :fuse:request_end+0x45/0x109 RSP: e02b:ffff88001f04dd68 EFLAGS: 00010246 RAX: 0000000000200200 RBX: ffff88001db9fa58 RCX: ffff88001db9fa68 RDX: 0000000000100100 RSI: ffff88001db9fa58 RDI: ffff88001f676400 RBP: ffff88001f676400 R08: 00000000204abb00 R09: ffff88001db9fb58 R10: 0000000000000008 R11: ffff88001f04dcf0 R12: 0000000000000000 R13: ffff88001db9fa90 R14: ffff88001f04ddf8 R15: 0000000000000001 FS: 00002b29187c53b0(0000) GS:ffffffff804cd000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process glusterfs (pid: 678, threadinfo ffff88001f04c000, task ffff88001fc5a860) Stack: ffff88001db9fa58 ffff88001f676400 00000000fffffffe ffffffff88021056 ffff88001f04def8 000000301f04de88 ffffffff8020dd40 ffff88001f04ddb8 ffffffff80225ca3 ffff88001de23500 ffffffff803ef023 ffff88001f04de98 Call Trace: [<ffffffff88021056>] :fuse:fuse_dev_readv+0x385/0x435
This is almost certainly a bug in the kernel proper, the fuse module, or possibly the glusterfs client. If you're using cheap hardware it could also be caused by a hardware issue like an intermittent memory failure. This failure is at too low a level to be the fault of a misbehaving application.
[...]
these are basic debian etch brand new set ups, they have nothing else than dovecot and glusterfs-client installed on them. The thing is all machines i've got have the very same version of glusterfs-client, but only the dovecot ones hang.
Do you have any idea?
The same failure across multiple machines implies a problem in software, not hardware.
The fact that Dovecot is poking this bug where postfix and squirrelmail are not is not particularly surprising, and doesn't really indicate a Dovecot flaw. A mailstore server like Dovecot makes more complex demands for proper behavior from a filesystem than an MTA like Postfix.
--
Bill Cole
bill@scconsult.com
hi,
actually, squirrelmail is not mounting any fuse device. You are right... why? :) it's all stored in a database.
as for the cheap hardware.... i'm using virtual machines from Xen to build both nodes and clients. The thing is that i've already tried to run nodes and dovecots in non-virtual servers, but i get the same error sooner or later.
Finally, thanks to help of the people in the glusterfs channel on the IRC, i upgraded all the packages to the latest development version and i'll let you know if that works for me.
Thank you all.
En/na Bill Cole ha escrit:
At 11:10 AM +0100 1/28/08, Jordi Moles wrote:
hi.
i've got a clustered mail system with glusterfs and some postfix, squirrelmail and dovecot machines sharing the same storage system through glusterfs.
Why?
I can see the Postfix/Dovecot justification (barely) but it seems pointless for SquirrelMail. What storage does SquirrelMail share with the other types of machines, and why? Since SquirrelMail normally accesses mail only via IMAP, there's no point in making the back end of the IMAP server visible to it.
each server has only postfix or squirrel or dovecot installed on it. The thing is... dovecot servers hang very often and the last thing they always log is this:
login: Unable to handle kernel paging request at 0000000000100108 RIP: [<ffffffff88020838>] :fuse:request_end+0x45/0x109 PGD 1f729067 PUD 1faae067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: ipv6 fuse dm_snapshot dm_mirror dm_mod Pid: 678, comm: glusterfs Not tainted 2.6.18-xen #1 RIP: e030:[<ffffffff88020838>] [<ffffffff88020838>] :fuse:request_end+0x45/0x109 RSP: e02b:ffff88001f04dd68 EFLAGS: 00010246 RAX: 0000000000200200 RBX: ffff88001db9fa58 RCX: ffff88001db9fa68 RDX: 0000000000100100 RSI: ffff88001db9fa58 RDI: ffff88001f676400 RBP: ffff88001f676400 R08: 00000000204abb00 R09: ffff88001db9fb58 R10: 0000000000000008 R11: ffff88001f04dcf0 R12: 0000000000000000 R13: ffff88001db9fa90 R14: ffff88001f04ddf8 R15: 0000000000000001 FS: 00002b29187c53b0(0000) GS:ffffffff804cd000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process glusterfs (pid: 678, threadinfo ffff88001f04c000, task ffff88001fc5a860) Stack: ffff88001db9fa58 ffff88001f676400 00000000fffffffe ffffffff88021056 ffff88001f04def8 000000301f04de88 ffffffff8020dd40 ffff88001f04ddb8 ffffffff80225ca3 ffff88001de23500 ffffffff803ef023 ffff88001f04de98 Call Trace: [<ffffffff88021056>] :fuse:fuse_dev_readv+0x385/0x435
This is almost certainly a bug in the kernel proper, the fuse module, or possibly the glusterfs client. If you're using cheap hardware it could also be caused by a hardware issue like an intermittent memory failure. This failure is at too low a level to be the fault of a misbehaving application.
[...]
these are basic debian etch brand new set ups, they have nothing else than dovecot and glusterfs-client installed on them. The thing is all machines i've got have the very same version of glusterfs-client, but only the dovecot ones hang.
Do you have any idea?
The same failure across multiple machines implies a problem in software, not hardware.
The fact that Dovecot is poking this bug where postfix and squirrelmail are not is not particularly surprising, and doesn't really indicate a Dovecot flaw. A mailstore server like Dovecot makes more complex demands for proper behavior from a filesystem than an MTA like Postfix.
participants (3)
-
Bill Cole
-
Jordi Moles
-
Timo Sirainen