We are occasionally experiencing trouble where the NFS server's load will shoot over 60+. (Normal of sub 1.0).
I have been hunting this for a while, and I believe it comes down to "deliver".
System setup:
NFS servers: x4540 Solaris 10 x64 ZFS over NFS. NFS clients: Solaris 10 x64 postfix-2.4.1 with dovecot-1.1.11 deliver.
What appears to happen, when I check for nfsstat per process, is that I see 4 processes (in this case on vmx04) taking up majority of NFS ops:
root@vmx04:/var/tmp# ./nfsclientstats.pl process read write readdir getattr setattr lookup access create remove rename mkdir rmdir 24303 0 0 1 19 0 190 171 0 0 0 0 0 24551 0 0 1 18 0 180 162 0 0 0 0 0 26099 0 0 1 18 0 180 162 0 0 0 0 0 295 0 0 1 18 0 180 162 0 0 0 0 0 6793 3 0 0 0 0 5 5 0 0 0 0 0 7234 0 1 0 2 0 9 9 0 1 0 0 0
Checking what these processes are doing, I find the following happening:
26099: getdents64(8, 0xCE7A4000, 8192) = 8136 26099: stat64("/export/censored/mail/cur/1223013930.V4700010I69f93eM483098.vmx02.unix:2,S", 0x08047930) = 0 26099: stat64("/export/censored/mail/cur/1222066290.V4700007I67562bM241839.vmx04.unix:2,S", 0x08047930) = 0 26099: stat64("/export/censored/mail/cur/1225325373.V4700008I94a1f9M286935.vmx04.unix:2,S", 0x08047930) = 0 26099: stat64("/export/censored/mail/cur/1236170307.V4700002I67ca03M310418.vmx06.unix:2,", 0x08047930) = 0 26099: stat64("/export/censored/mail/cur/1223581462.V4700011I69ffd6M720814.vmx02.unix:2,S", 0x08047930) = 0
Very well, so it is rebuilding the dovecot.index, or recalculating the user's quota usage.
Is the directory large?
root@vmx04# ls -l /export/censored/mail/cur/|wc -l 199626
You bet! But what is annoying is that if I also check process 24303, 24551 and 295, they are scanning the SAME user's directory.
295: stat64("/export/censored/mail/cur/1230544947.V4700004I11d1d7fM492433.vmx04.unix:2,S", 0x08047930) = 0 295: stat64("/export/censored/mail/cur/1223003964.V4700007I68932dM763546.vmx04.unix:2,S", 0x08047930) = 0
So, in vmx04 we have 4 processes working in one user's giant directory, and on the other vmx clients, many more.
Could the semantics to 're-computing dovecot.index' be done such that the first "deliver" process locks it to do the work, and sub-sequent deliver processes will return temporary failures, until the work has finished.
Has it been already addresses in dovecot-1.1.18?
Advice please.
Lund
-- Jorgen Lundman | lundman@lundman.net Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)