[Dovecot] 2.0, hourly performance stats
I'm getting constantly high numbers of page reclaims & involuntary context switches for dovecot/auth.
page reclaims = minor faults = cpu switching back to system-mode, But why is the auth process doing that so excessively? Same for the large number of involuntary context switches...
Attached is my "dovecot -n" output.
Date: Sun, 07 Nov 2010 20:40:34 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 205.93 1.32 2.76 25280 0 0 0 0 0 261 1425 ssl-params 288.71 0.40 0.48 468 0 0 8 0 0 8 1 pop3 0.59 0.80 0.20 651 0 0 48 16 0 12 13 managesieve 0.51 0.00 0.28 576 0 0 416 0 0 13 7 imap 4677.82 0.19 0.89 1480 0 0 488 16 0 23550 136 pop3-login 181.77 0.32 0.28 780 0 0 8 0 0 46 9
Date: Sun, 07 Nov 2010 21:00:02 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 303.62 1.10 2.20 26208 0 0 0 0 0 235 1362 ssl-params 301.12 0.00 0.12 467 0 0 8 0 0 7 1 pop3 0.26 0.40 0.36 618 0 0 0 8 0 5 18 managesieve 0.53 0.00 0.28 576 0 0 352 0 0 13 7 imap 2856.91 0.12 0.46 655 0 0 0 0 0 13465 105 pop3-login 228.84 0.76 0.24 779 0 0 0 0 0 90 14
Date: Sun, 07 Nov 2010 22:00:02 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 243.12 0.99 4.56 23762 0 0 0 0 0 222 1419 ssl-params 301.25 0.40 0.80 471 0 0 8 0 0 6 4 pop3 0.43 0.40 0.80 623 0 0 0 8 0 6 4 managesieve 0.43 0.00 0.20 577 0 0 536 0 0 14 7 imap 3.63 0.16 0.36 669 1 0 856 0 0 63 93 pop3-login 181.81 0.60 0.40 782 0 0 0 0 0 52 53
Date: Sun, 07 Nov 2010 23:00:02 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 232.83 1.72 2.26 25613 0 0 0 0 0 254 1175 ssl-params 149.63 0.00 0.16 469 0 0 40 0 0 10 1 pop3 25.68 0.28 0.12 741 0 0 9696 16 0 109 14 managesieve 0.25 0.40 0.28 576 0 0 8 0 0 11 2 imap 0.32 0.00 0.16 625 0 0 0 0 0 11 11 pop3-login 69.88 0.48 0.24 779 0 0 0 0 0 27 6
Date: Mon, 08 Nov 2010 00:00:02 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs managesieve 0.29 0.00 0.36 576 0 0 8 0 0 7 16 imap 0.58 0.80 0.36 637 0 0 0 0 0 10 9
Date: Mon, 08 Nov 2010 01:00:01 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 421.98 1.32 1.66 24216 0 0 0 0 0 267 723 ssl-params 538.47 0.00 0.16 470 0 0 8 0 0 7 1 pop3 0.19 0.80 0.96 671 1 0 760 8 0 22 221 managesieve 0.63 0.40 0.13 578 0 0 192 0 0 9 11 managesieve-lo 18616.9 86.77 32.49 319768 1 0 168 0 0 63448 48838 imap 42.28 0.00 0.52 676 0 0 0 0 0 182 63 pop3-login 361.68 0.76 0.40 780 0 0 0 0 0 52 2
Date: Mon, 08 Nov 2010 02:00:02 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 592.76 0.97 1.40 23640 8 0 8272 0 0 307 321 ssl-params 676.93 0.00 0.24 468 0 0 136 0 0 18 10 pop3 6.10 0.24 0.44 672 1 0 1688 8 0 35 40 managesieve 0.71 0.12 0.14 572 5 0 1632 0 0 17 38 managesieve-lo 1023.50 1.45 0.46 6066 1 0 184 0 0 1104 317 imap 0.87 0.80 0.44 674 1 0 640 0 0 18 25 pop3-login 361.79 0.72 0.32 780 0 0 112 0 0 56 11
Date: Mon, 08 Nov 2010 03:00:02 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 600.71 0.96 1.18 22902 0 0 56 0 0 235 212 ssl-params 242.66 0.00 0.16 471 0 0 136 0 0 15 6 pop3 0.20 0.16 0.28 675 1 0 1312 8 0 32 15 managesieve 0.46 0.00 0.36 580 1 0 480 0 0 11 12 managesieve-lo 286.21 0.60 0.44 1048 1 0 184 0 0 50 88 imap 0.34 0.80 0.28 675 0 0 0 0 0 9 20 pop3-login 361.43 0.84 0.32 780 0 0 112 0 0 55 13
Date: Mon, 08 Nov 2010 04:00:01 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 587.34 0.90 1.52 22075 14 0 14288 0 0 443 375 ssl-params 275.27 0.00 0.12 469 0 0 136 0 0 14 15 pop3 23.90 0.37 0.80 2906 5 0 36856 24 0 3885 136 managesieve 0.45 0.12 0.16 577 1 0 472 0 0 10 20 managesieve-lo 357.38 0.14 0.52 1271 1 0 184 0 0 100 28 imap 0.42 0.00 0.40 671 0 0 0 0 0 9 56 pop3-login 363.16 0.12 0.64 782 0 0 112 0 0 263 23
Date: Mon, 08 Nov 2010 05:00:01 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 1005.53 0.91 1.60 21583 0 0 0 0 0 238 92 ssl-params 172.31 0.40 0.80 469 0 0 0 0 0 6 3 pop3 0.11 0.16 0.28 667 0 0 0 8 0 11 18 managesieve 0.42 0.80 0.20 577 0 0 520 0 0 13 4 managesieve-lo 84.84 0.80 0.52 1049 0 0 0 0 0 45 11 imap 0.51 0.12 0.76 664 0 0 0 0 0 10 11 pop3-login 361.78 0.64 0.28 773 0 0 0 0 0 13 2
Date: Mon, 08 Nov 2010 06:00:01 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 467.35 0.16 0.17 797 0 0 0 0 0 8 9 ssl-params 273.33 0.00 0.16 470 0 0 0 0 0 5 3 pop3 0.14 0.12 0.36 667 0 0 0 8 0 14 1 managesieve 0.20 0.00 0.28 576 0 0 0 0 0 10 1 managesieve-lo 640.83 0.13 0.44 1272 0 0 0 0 0 87 17 imap 0.33 0.40 0.32 665 0 0 0 0 0 10 47 pop3-login 362.87 0.60 0.24 774 0 0 0 0 0 13 3
Date: Mon, 08 Nov 2010 07:00:02 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 468.95 0.96 1.34 21628 0 0 0 0 0 243 218 ssl-params 83.87 0.00 0.12 469 0 0 0 0 0 7 1 pop3 0.11 0.16 0.24 667 0 0 0 8 0 11 6 managesieve 0.83 0.00 0.88 577 0 0 536 0 0 9 12 managesieve-lo 231.47 0.18 0.96 1381 0 0 0 0 0 106 17 imap 0.71 0.12 0.80 665 0 0 0 0 0 10 47 pop3-login 292.13 0.56 0.28 779 0 0 0 0 0 46 7
Date: Mon, 08 Nov 2010 08:00:02 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 452.63 0.83 1.44 20188 0 0 16 0 0 286 121 ssl-params 289.68 0.00 0.80 470 0 0 136 0 0 14 6 pop3 1.37 0.80 0.17 681 1 0 1936 16 0 55 28 managesieve 0.24 0.40 0.20 580 0 0 256 0 0 14 5 managesieve-lo 1952.41 4.93 1.58 18647 0 0 8 0 0 3673 1339 imap 0.17 0.40 0.19 665 0 0 0 0 0 6 48 pop3-login 96.75 0.60 0.28 777 0 0 112 0 0 41 14
-- Ralf Hildebrandt Geschäftsbereich IT | Abteilung Netzwerk Charité - Universitätsmedizin Berlin Campus Benjamin Franklin Hindenburgdamm 30 | D-12203 Berlin Tel. +49 30 450 570 155 | Fax: +49 30 450 570 962 ralf.hildebrandt@charite.de | http://www.charite.de
On 11/7/2010 11:34 PM, Ralf Hildebrandt wrote:
Attached is my "dovecot -n" output.
Here's another attempt to insert lower supportive appendage into open breathing & consumption orifice.
Just comparing your doveconf -n output with mine...
Why does it appear you have multiple "listen" lines - and they're not IP specific? I don't have any at all.
Based on your original post, and inferring from your 1.2 config, it looks like you're trying to run the "high-performance" login mode. I only saw the service_count parm listed in your 2.0 config - maybe specifying a couple of the others may have an impact? Can you try removing your "service imap-login" section, or specifying "service_count = 1" to, and try running in the default "high-security" mode and see if that makes a difference?
Daniel
- Daniel L. Miller dmiller@amfes.com:
On 11/7/2010 11:34 PM, Ralf Hildebrandt wrote:
Attached is my "dovecot -n" output.
Here's another attempt to insert lower supportive appendage into open breathing & consumption orifice.
Just comparing your doveconf -n output with mine...
- Why does it appear you have multiple "listen" lines - and they're not IP specific? I don't have any at all.
You mean those?
listen(default): * listen(imap): * listen(pop3): * listen(managesieve): *:4190
- Based on your original post, and inferring from your 1.2 config, it looks like you're trying to run the "high-performance" login mode.
Yes
I only saw the service_count parm listed in your 2.0 config
1.2 was also using the "high-performance" login mode
- maybe specifying a couple of the others may have an impact? Can you try removing your "service imap-login" section, or specifying "service_count = 1" to, and try running in the default "high-security" mode and see if that makes a difference?
Right now I had to turn 2.0 off because load is getting uneasy :)
-- Ralf Hildebrandt Geschäftsbereich IT | Abteilung Netzwerk Charité - Universitätsmedizin Berlin Campus Benjamin Franklin Hindenburgdamm 30 | D-12203 Berlin Tel. +49 30 450 570 155 | Fax: +49 30 450 570 962 ralf.hildebrandt@charite.de | http://www.charite.de
- Ralf Hildebrandt Ralf.Hildebrandt@charite.de:
I'm getting constantly high numbers of page reclaims & involuntary context switches for dovecot/auth.
page reclaims = minor faults = cpu switching back to system-mode, But why is the auth process doing that so excessively? Same for the large number of involuntary context switches...
Some additions:
The last time we had 2.0 at the start we came into big trouble which could also be seen on the VMware ESX side. The CPU load was about 95% constantly and on the VM side the processes showed up in top at mainly using kernel space (system load).
Now we didn't have that high load in the morning, of course processes had been in the kernel space too often. But: until the load isn't getting too high the ESX doesn't show any problems, even the stats went up and down (what they didn't do the last time we had the real problems, they just stayed in an even upper line...).
Of course we could test it during the main noon time but in that case the mailsystem begins to stumble on high load and users might complain. We also have no real test scenario because it's not easy to get a "real" pressure on the machine, so we have to test it in the production line. But I cannot switch on 2.0 permanently this would cause too many problems.
Anyway, even if it runs without making problems on the ESX side we can see the processes in the kernel space. They're way too long there and Ralf seems to find the reason: too many page faults. That's all we can say now.
Regards,
Udo
Udo Wolter Geschäftsbereich IT | Abt. System Charité - Universitätsmedizin Berlin Campus Benjamin Franklin Hindenburgdamm 30 | D-12203 Berlin Tel. +49 30 450 570847 | Fax +49 30 450 7570600 Udo.Wolter@Charite.de | http://www.charite.de
On 8.11.2010, at 7.34, Ralf Hildebrandt wrote:
I'm getting constantly high numbers of page reclaims & involuntary context switches for dovecot/auth.
Is it intentional that you have both passdb shadow and passdb pam enabled?
page reclaims = minor faults = cpu switching back to system-mode, But why is the auth process doing that so excessively? Same for the large number of involuntary context switches...
I'm not sure. But:
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 205.93 1.32 2.76 25280 0 0 0 0 0 261 1425
It's still only 2.76 seconds of system CPU from the total of 205 seconds it was running, so that can't really cause the big system CPU usages you see..
ssl-params 288.71 0.40 0.48 468 0 0 8 0 0 8 1 pop3 0.59 0.80 0.20 651 0 0 48 16 0 12 13 managesieve 0.51 0.00 0.28 576 0 0 416 0 0 13 7 imap 4677.82 0.19 0.89 1480 0 0 488 16 0 23550 136 pop3-login 181.77 0.32 0.28 780 0 0 8 0 0 46 9
Same for the rest of these processes.
What about after you stop Dovecot? Then you should see lines from the permanently running processes (master, anvil, log).
- Timo Sirainen tss@iki.fi:
On 8.11.2010, at 7.34, Ralf Hildebrandt wrote:
I'm getting constantly high numbers of page reclaims & involuntary context switches for dovecot/auth.
Is it intentional that you have both passdb shadow and passdb pam enabled?
If I leave out on of them, the masteruser feature isn't working anymore. Which one should I be using?
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 205.93 1.32 2.76 25280 0 0 0 0 0 261 1425
It's still only 2.76 seconds of system CPU from the total of 205 seconds it was running, so that can't really cause the big system CPU usages you see..
Please note that I was measuring this in the morning. I'll have to keep it running a whole day in order to get "crtiical" values. Then I'll suffer a beating from my colleagues for thrashing the server :)
What about after you stop Dovecot? Then you should see lines from the permanently running processes (master, anvil, log).
-- Ralf Hildebrandt Geschäftsbereich IT | Abteilung Netzwerk Charité - Universitätsmedizin Berlin Campus Benjamin Franklin Hindenburgdamm 30 | D-12203 Berlin Tel. +49 30 450 570 155 | Fax: +49 30 450 570 962 ralf.hildebrandt@charite.de | http://www.charite.de
On Mon, 2010-11-08 at 08:34 +0100, Ralf Hildebrandt wrote:
I'm getting constantly high numbers of page reclaims & involuntary context switches for dovecot/auth.
page reclaims = minor faults = cpu switching back to system-mode, But why is the auth process doing that so excessively? Same for the large number of involuntary context switches...
Hmm. "A page reclaim occurs when a requested page exists on the free list. A page reclaim results in a page fault being satisfied in memory."
Date: Mon, 08 Nov 2010 01:00:01 +0100
type real user sys reclaim faults swaps bin bout signals volcs involcs auth 421.98 1.32 1.66 24216 0 0 0 0 0 267 723 managesieve-lo 18616.9 86.77 32.49 319768 1 0 168 0 0 63448 48838
The managesieve-login is pretty high here too. Much worse than auth process. Were there tons of logins at that time? Or some brute force password guessing, or some other DoSing? Perhaps the problem is actually managesieve-login alone? Did you set also service managesieve-login { service_count=0 } ?
I'm currently seeing in my test machine where imaptest is runnng something like:
type real user sys recla faults swaps bin bout signals volcs involcs master 1252.14 0.58 2.70 138271 0 0 0 16 0 30101 69 anvil 1252.13 0.27 0.23 336 0 0 0 0 0 16739 6 imap 2.34 0.11 0.20 1548 0 0 0 856 0 22 144 imap-lo 0.11 0.00 0.80 622 0 0 0 0 0 6 19 auth 1248.45 1.26 0.91 841 16 0 3248 0 0 51559 118 log 1252.84 0.86 1.12 347 0 0 8 4560 0 47245 27 config 1252.12 13.57 0.59 1061 0 0 0 0 0 36574 727 lmtp 41.43 0.40 0.80 495 11 0 2328 0 0 25 14
The config process's high user CPU% is expected. Master is doing a lot of page reclaims, which I'd guess is because it's forking a lot.
On Mon, 2010-11-08 at 19:07 +0000, Timo Sirainen wrote:
I'm currently seeing in my test machine where imaptest is runnng something like:
Well, that was stupid. Looks like I suck at Perl and couldn't write an actually working script. I was looking at the low user/sys CPU usages, but since they looked ok for some processes, I just thought that Dovecot is so awesomely fast elsewhere too :) So the problem was that it reported only the first process's values, which worked well for long running processes but not for short lived processes..
Attached a new script. Here's another output of what I get myself:
type real user sys reclaims faults swaps bin bout signals volcs involcs count master 1252.14 0.58 2.70 138271 0 0 0 16 0 30101 69 1 anvil 1252.13 0.27 0.23 336 0 0 0 0 0 16739 6 1 imap 21566.3 2166.63 2718.56 17362595 124 0 14088 6512656 0 212187 771221 9224 imap-lo 2271.96 4332.52 3043.32 5730026 3 0 328 0 0 59963 187152 9220 auth 3585.92 3.52 2.58 1729 54 0 5904 0 0 145214 244 2 log 1252.84 0.86 1.12 347 0 0 8 4560 0 47245 27 1 config 3590.06 36.98 1.89 2739 366 0 20872 0 0 103559 2052 2 lmtp 604.16 2.16 1.04 2749 11 0 2328 264 0 235 42 4
Most of this data is for running "rm -rf mdbox;imaptest secs=2" in a loop and looks like imap-login was with service_count=1.
- Timo Sirainen tss@iki.fi:
Attached a new script. Here's another output of what I get myself:
I switched back to 2.0 and will try keeping it going until 12:00 tomorrow
Ralf Hildebrandt Geschäftsbereich IT | Abteilung Netzwerk Charité - Universitätsmedizin Berlin Campus Benjamin Franklin Hindenburgdamm 30 | D-12203 Berlin Tel. +49 30 450 570 155 | Fax: +49 30 450 570 962 ralf.hildebrandt@charite.de | http://www.charite.de
- Timo Sirainen tss@iki.fi:
The managesieve-login is pretty high here too. Much worse than auth process. Were there tons of logins at that time? Or some brute force password guessing, or some other DoSing?
No, the port is not reachable from the outside.
Perhaps the problem is actually managesieve-login alone? Did you set also service managesieve-login { service_count=0 } ?
I'm currently seeing in my test machine where imaptest is runnng something like:
type real user sys recla faults swaps bin bout signals volcs involcs master 1252.14 0.58 2.70 138271 0 0 0 16 0 30101 69 anvil 1252.13 0.27 0.23 336 0 0 0 0 0 16739 6 imap 2.34 0.11 0.20 1548 0 0 0 856 0 22 144 imap-lo 0.11 0.00 0.80 622 0 0 0 0 0 6 19 auth 1248.45 1.26 0.91 841 16 0 3248 0 0 51559 118 log 1252.84 0.86 1.12 347 0 0 8 4560 0 47245 27 config 1252.12 13.57 0.59 1061 0 0 0 0 0 36574 727 lmtp 41.43 0.40 0.80 495 11 0 2328 0 0 25 14
The config process's high user CPU% is expected. Master is doing a lot of page reclaims, which I'd guess is because it's forking a lot.
-- Ralf Hildebrandt Geschäftsbereich IT | Abteilung Netzwerk Charité - Universitätsmedizin Berlin Campus Benjamin Franklin Hindenburgdamm 30 | D-12203 Berlin Tel. +49 30 450 570 155 | Fax: +49 30 450 570 962 ralf.hildebrandt@charite.de | http://www.charite.de
participants (4)
-
Daniel L. Miller
-
Ralf Hildebrandt
-
Timo Sirainen
-
Udo Wolter