Btrfs RAID-10 performance
Hello,
I sent this into the Linux Kernel Btrfs mailing list and I got reply: "RAID-1 would be preferable" (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112@lec...). May I ask you for the comments as from people around the Dovecot?
We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It takes about 50 minutes to finish.
# uname -a Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux
RAID is a composition of 16 harddrives. Harddrives are connected via AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS 2.5" 15k drives.
Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, Mailbox format, LMTP delivery.
We run 'rsync' to remote NAS daily. It takes about 6.5 hours to finish, 12'265'387 files last night.
Last half year, we encoutered into performace troubles. Server load grows up to 30 in rush hours, due to IO waits. We tried to attach next harddrives (the 838G ones in a list below) and increase a free space by rebalace. I think, it helped a little bit, not not so rapidly.
Is this a reasonable setup and use case for btrfs RAID-10? If so, are there some recommendations to achieve better performance?
Thank you. With kind regards Milo
# megaclisas-status -- Controller information -- -- ID | H/W Model | RAM | Temp | BBU | Firmware c0 | AVAGO MegaRAID SAS 9361-8i | 1024MB | 72C | Good | FW: 24.16.0-0082
-- Array information -- -- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS Path | CacheCade |InProgress c0u0 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdq | None |None c0u1 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sda | None |None c0u2 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdb | None |None c0u3 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdc | None |None c0u4 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdd | None |None c0u5 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sde | None |None c0u6 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdf | None |None c0u7 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdg | None |None c0u8 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdh | None |None c0u9 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdi | None |None c0u10 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdj | None |None c0u11 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdk | None |None c0u12 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdl | None |None c0u13 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdm | None |None c0u14 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdn | None |None c0u15 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | /dev/sdr | None |None
-- Disk information -- -- ID | Type | Drive Model | Size | Status | Speed | Temp | Slot ID | LSI ID c0u0p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q3S3 | 837.8 Gb | Online, Spun Up | 12.0Gb/s | 53C | [8:14] | 32 c0u1p0 | HDD | HGST HUC156060CSS200 A3800XV250TJ | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 38C | [8:0] | 12 c0u2p0 | HDD | HGST HUC156060CSS200 A3800XV3XT4J | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 43C | [8:1] | 11 c0u3p0 | HDD | HGST HUC156060CSS200 ADB05ZG4XLZU | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 46C | [8:2] | 25 c0u4p0 | HDD | HGST HUC156060CSS200 A3800XV3DWRL | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 48C | [8:3] | 14 c0u5p0 | HDD | HGST HUC156060CSS200 A3800XV3XZTL | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 52C | [8:4] | 18 c0u6p0 | HDD | HGST HUC156060CSS200 A3800XV3VSKJ | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 55C | [8:5] | 15 c0u7p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWKE | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 56C | [8:6] | 28 c0u8p0 | HDD | HGST HUC156060CSS200 A3800XV3XTDJ | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 55C | [8:7] | 20 c0u9p0 | HDD | HGST HUC156060CSS200 A3800XV3T8XL | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 57C | [8:8] | 19 c0u10p0 | HDD | HGST HUC156060CSS200 A7030XHL0ZYP | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 61C | [8:9] | 23 c0u11p0 | HDD | HGST HUC156060CSS200 ADB05ZG4VR3P | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 60C | [8:10] | 24 c0u12p0 | HDD | SEAGATE ST600MP0006 N003WAF195KA | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 60C | [8:11] | 29 c0u13p0 | HDD | SEAGATE ST600MP0006 N003WAF1LTZW | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 56C | [8:12] | 26 c0u14p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWH6 | 558.4 Gb | Online, Spun Up | 12.0Gb/s | 55C | [8:13] | 27 c0u15p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q414 | 837.8 Gb | Online, Spun Up | 12.0Gb/s | 47C | [8:15] | 33
# btrfs --version btrfs-progs v4.7.3
# btrfs fi show Label: 'DATA' uuid: 5b285a46-e55d-4191-924f-0884fa06edd8 Total devices 16 FS bytes used 3.49TiB devid 1 size 558.41GiB used 448.66GiB path /dev/sda devid 2 size 558.41GiB used 448.66GiB path /dev/sdb devid 4 size 558.41GiB used 448.66GiB path /dev/sdd devid 5 size 558.41GiB used 448.66GiB path /dev/sde devid 7 size 558.41GiB used 448.66GiB path /dev/sdg devid 8 size 558.41GiB used 448.66GiB path /dev/sdh devid 9 size 558.41GiB used 448.66GiB path /dev/sdf devid 10 size 558.41GiB used 448.66GiB path /dev/sdi devid 11 size 558.41GiB used 448.66GiB path /dev/sdj devid 13 size 558.41GiB used 448.66GiB path /dev/sdk devid 14 size 558.41GiB used 448.66GiB path /dev/sdc devid 15 size 558.41GiB used 448.66GiB path /dev/sdl devid 16 size 558.41GiB used 448.66GiB path /dev/sdm devid 17 size 558.41GiB used 448.66GiB path /dev/sdn devid 18 size 837.84GiB used 448.66GiB path /dev/sdr devid 19 size 837.84GiB used 448.66GiB path /dev/sdq
# btrfs fi df /data/ Data, RAID10: total=3.48TiB, used=3.47TiB System, RAID10: total=256.00MiB, used=320.00KiB Metadata, RAID10: total=21.00GiB, used=18.17GiB GlobalReserve, single: total=512.00MiB, used=0.00B
I do not attach whole dmesg log. It is almost empty, without errors. Only lines about BTRFS are about relocations, like:
BTRFS info (device sda): relocating block group 29435663220736 flags 65 BTRFS info (device sda): found 54460 extents BTRFS info (device sda): found 54459 extents
On 7. Sep 2020, at 12.38, Miloslav Hůla <miloslav.hula@gmail.com> wrote:
Hello,
I sent this into the Linux Kernel Btrfs mailing list and I got reply: "RAID-1 would be preferable" (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112@lec...). May I ask you for the comments as from people around the Dovecot?
We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It takes about 50 minutes to finish.
# uname -a Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux
RAID is a composition of 16 harddrives. Harddrives are connected via AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS 2.5" 15k drives.
Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, Mailbox format, LMTP delivery.
does "Mailbox format" mean mbox?
If so, then there is your bottleneck. mbox is the slowest possible mailbox format there is.
Sami
Dne 07.09.2020 v 12:43 Sami Ketola napsal(a):
On 7. Sep 2020, at 12.38, Miloslav Hůla <miloslav.hula@gmail.com> wrote:
Hello,
I sent this into the Linux Kernel Btrfs mailing list and I got reply: "RAID-1 would be preferable" (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112@lec...). May I ask you for the comments as from people around the Dovecot?
We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It takes about 50 minutes to finish.
# uname -a Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux
RAID is a composition of 16 harddrives. Harddrives are connected via AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS 2.5" 15k drives.
Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, Mailbox format, LMTP delivery.
does "Mailbox format" mean mbox?
If so, then there is your bottleneck. mbox is the slowest possible mailbox format there is.
Sami
Sorry, no, it is a typo. We are using "Maildir".
"doveconf -a" attached
Milo
# 2.2.27 (c0f36b0): /etc/dovecot/dovecot.conf # Pigeonhole version 0.4.16 (fed8554) # OS: Linux 4.9.0-12-amd64 x86_64 Debian 9.13 # NOTE: Send doveconf -n output instead when asking for help. auth_anonymous_username = anonymous auth_cache_negative_ttl = 30 secs auth_cache_size = 100 M auth_cache_ttl = 30 secs auth_debug = no auth_debug_passwords = no auth_default_realm = auth_failure_delay = 2 secs auth_gssapi_hostname = auth_krb5_keytab = auth_master_user_separator = auth_mechanisms = plain auth_policy_hash_mech = sha256 auth_policy_hash_nonce = auth_policy_hash_truncate = 12 auth_policy_reject_on_fail = no auth_policy_request_attributes = login=%{orig_username} pwhash=%{hashed_password} remote=%{real_rip} auth_policy_server_api_header = auth_policy_server_timeout_msecs = 2000 auth_policy_server_url = auth_proxy_self = auth_realms = auth_socket_path = auth-userdb auth_ssl_require_client_cert = no auth_ssl_username_from_cert = no auth_stats = no auth_use_winbind = no auth_username_chars = abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890.-_@ auth_username_format = %Lu auth_username_translation = auth_verbose = no auth_verbose_passwords = no auth_winbind_helper_path = /usr/bin/ntlm_auth auth_worker_max_count = 30 base_dir = /var/run/dovecot config_cache_size = 1 M debug_log_path = default_client_limit = 1000 default_idle_kill = 1 mins default_internal_user = dovecot default_login_user = dovenull default_process_limit = 100 default_vsz_limit = 256 M deliver_log_format = msgid=%m: %$ dict_db_config = director_consistent_hashing = no director_doveadm_port = 0 director_flush_socket = director_mail_servers = director_servers = director_user_expire = 15 mins director_user_kick_delay = 2 secs director_username_hash = %u disable_plaintext_auth = yes dotlock_use_excl = yes doveadm_allowed_commands = doveadm_api_key = doveadm_password = doveadm_port = 0 doveadm_socket_path = doveadm-server doveadm_username = doveadm doveadm_worker_count = 0 dsync_alt_char = _ dsync_features = dsync_remote_cmd = ssh -l%{login} %{host} doveadm dsync-server -u%u -U first_valid_gid = 1 first_valid_uid = 109 haproxy_timeout = 3 secs haproxy_trusted_networks = hostname = imap_capability = imap_client_workarounds = imap_hibernate_timeout = 0 imap_id_log = * imap_id_send = name * imap_idle_notify_interval = 2 mins imap_logout_format = in=%i out=%o imap_max_line_length = 64 k imap_metadata = no imap_urlauth_host = imap_urlauth_logout_format = in=%i out=%o imap_urlauth_port = 143 imapc_cmd_timeout = 5 mins imapc_features = imapc_host = imapc_list_prefix = imapc_master_user = imapc_max_idle_time = 29 mins imapc_max_line_length = 0 imapc_password = imapc_port = 143 imapc_rawlog_dir = imapc_sasl_mechanisms = imapc_ssl = no imapc_ssl_verify = yes imapc_user = import_environment = TZ CORE_OUTOFMEM CORE_ERROR LISTEN_PID LISTEN_FDS info_log_path = instance_name = dovecot last_valid_gid = 0 last_valid_uid = 0 lda_mailbox_autocreate = no lda_mailbox_autosubscribe = no lda_original_recipient_header = libexec_dir = /usr/lib/dovecot listen = *, :: lmtp_address_translate = lmtp_hdr_delivery_address = final lmtp_proxy = no lmtp_rcpt_check_quota = no lmtp_save_to_detail_mailbox = no lmtp_user_concurrency_limit = 0 lock_method = fcntl log_path = syslog log_timestamp = "%b %d %H:%M:%S " login_access_sockets = login_greeting = Dovecot ready. login_log_format = %$: %s login_log_format_elements = user=<%u> method=%m rip=%r lip=%l mpid=%e %c session=<%{session}> login_plugin_dir = /usr/lib/dovecot/modules/login login_plugins = login_proxy_max_disconnect_delay = 0 login_source_ips = login_trusted_networks = mail_access_groups = mail_always_cache_fields = mail_attachment_dir = mail_attachment_fs = sis posix mail_attachment_hash = %{sha1} mail_attachment_min_size = 128 k mail_attribute_dict = mail_cache_fields = flags mail_cache_min_mail_count = 0 mail_chroot = mail_debug = no mail_fsync = optimized mail_full_filesystem_access = no mail_gid = vmail mail_home = /data/vmail/user/%n mail_location = maildir:/data/vmail/user/%n/Maildir mail_log_prefix = "%s(%u): " mail_max_keyword_length = 50 mail_max_lock_timeout = 0 mail_max_userip_connections = 10 mail_never_cache_fields = imap.envelope mail_nfs_index = no mail_nfs_storage = no mail_plugin_dir = /usr/lib/dovecot/modules mail_plugins = mail_prefetch_count = 0 mail_privileged_group = mail_save_crlf = yes mail_server_admin = mail_server_comment = mail_shared_explicit_inbox = no mail_temp_dir = /tmp mail_temp_scan_interval = 1 weeks mail_uid = vmail mailbox_idle_check_interval = 30 secs mailbox_list_index = no mailbox_list_index_very_dirty_syncs = no maildir_broken_filename_sizes = no maildir_copy_with_hardlinks = yes maildir_empty_new = no maildir_stat_dirs = no maildir_very_dirty_syncs = no managesieve_client_workarounds = managesieve_implementation_string = Dovecot Pigeonhole managesieve_logout_format = bytes=%i/%o managesieve_max_compile_errors = 5 managesieve_max_line_length = 65536 managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex copy include variables body enotify environment mailbox date index ihave duplicate mime foreverypart extracttext vacation-seconds imapflags notify master_user_separator = mbox_dirty_syncs = yes mbox_dotlock_change_timeout = 2 mins mbox_lazy_writes = yes mbox_lock_timeout = 5 mins mbox_md5 = apop3d mbox_min_index_size = 0 mbox_read_locks = fcntl mbox_very_dirty_syncs = no mbox_write_locks = fcntl dotlock mdbox_preallocate_space = no mdbox_purge_preserve_alt = no mdbox_rotate_interval = 0 mdbox_rotate_size = 2 M mmap_disable = no namespace { disabled = no hidden = no ignore_on_failure = no inbox = no list = children location = maildir:/data/vmail/user/%%n/Maildir:INDEXPVT=/data/vmail/user/%n/Maildir/Shared/%%n:SUBSCRIPTIONS=../../%n/Maildir/Shared/subscriptions-%%n order = 0 prefix = user.%%n. separator = . subscriptions = yes type = shared } namespace inbox { disabled = no hidden = no ignore_on_failure = no inbox = yes list = yes location = mailbox Archives { auto = no autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Archive } mailbox Drafts { auto = subscribe autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Drafts } mailbox Junk { auto = no autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Junk } mailbox Sent { auto = subscribe autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Sent } mailbox "Sent Messages" { auto = no autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Sent } mailbox Trash { auto = subscribe autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Trash } mailbox spam { auto = subscribe autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Junk } order = 0 prefix = INBOX. separator = . subscriptions = yes type = private } passdb { args = /etc/dovecot/deny-users auth_verbose = default default_fields = deny = yes driver = passwd-file master = no name = override_fields = pass = no result_failure = continue result_internalfail = continue result_success = return-ok skip = never } passdb { args = /etc/dovecot/dovecot-ldap.conf.ext auth_verbose = default default_fields = deny = no driver = ldap master = no name = override_fields = pass = no result_failure = continue result_internalfail = continue result_success = return-ok skip = never } plugin { acl = vfile acl_shared_dict = file:/data/vmail/global/shared-mailboxes sieve = file:/data/vmail/user/%n/sieve;active=/data/vmail/user/%n/enabled.sieve sieve_extensions = +notify +imapflags -imap4flags +vacation-seconds sieve_vacation_min_period = 0 } pop3_client_workarounds = pop3_delete_type = default pop3_deleted_flag = pop3_enable_last = no pop3_fast_size_lookups = no pop3_lock_session = no pop3_logout_format = top=%t/%p, retr=%r/%b, del=%d/%m, size=%s pop3_no_flag_updates = no pop3_reuse_xuidl = no pop3_save_uidl = no pop3_uidl_duplicates = allow pop3_uidl_format = %v.%u pop3c_host = pop3c_master_user = pop3c_password = pop3c_port = 110 pop3c_quick_received_date = no pop3c_rawlog_dir = pop3c_ssl = no pop3c_ssl_verify = yes pop3c_user = %u postmaster_address = postmaster@%d protocols = " imap lmtp sieve pop3" quota_full_tempfail = no rawlog_dir = recipient_delimiter = + rejection_reason = Your message to <%t> was automatically rejected:%n%r rejection_subject = Rejected: %s replication_dsync_parameters = -d -N -l 30 -U replication_full_sync_interval = 1 days replication_max_conns = 10 replicator_host = replicator replicator_port = 0 sendmail_path = /usr/sbin/sendmail service aggregator { chroot = . client_limit = 0 drop_priv_before_exec = no executable = aggregator extra_groups = fifo_listener replication-notify-fifo { group = mode = 0600 user = } group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = unix_listener replication-notify { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service anvil { chroot = empty client_limit = 0 drop_priv_before_exec = no executable = anvil extra_groups = group = idle_kill = 4294967295 secs privileged_group = process_limit = 1 process_min_avail = 1 protocol = service_count = 0 type = anvil unix_listener anvil-auth-penalty { group = mode = 0600 user = } unix_listener anvil { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service auth-worker { chroot = client_limit = 1 drop_priv_before_exec = no executable = auth -w extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 1 type = unix_listener auth-worker { group = mode = 0600 user = $default_internal_user } user = vsz_limit = 18446744073709551615 B } service auth { chroot = client_limit = 0 drop_priv_before_exec = no executable = auth extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener auth-client { group = mode = 0600 user = $default_internal_user } unix_listener auth-login { group = mode = 0600 user = $default_internal_user } unix_listener auth-master { group = mode = 0600 user = } unix_listener auth-userdb { group = mode = 0666 user = $default_internal_user } unix_listener login/login { group = mode = 0666 user = } unix_listener token-login/tokenlogin { group = mode = 0666 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service config { chroot = client_limit = 0 drop_priv_before_exec = no executable = config extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = config unix_listener config { group = mode = 0600 user = } user = vsz_limit = 18446744073709551615 B } service dict-async { chroot = client_limit = 0 drop_priv_before_exec = no executable = dict extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = unix_listener dict-async { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service dict { chroot = client_limit = 1 drop_priv_before_exec = no executable = dict extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = unix_listener dict { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service director { chroot = . client_limit = 0 drop_priv_before_exec = no executable = director extra_groups = fifo_listener login/proxy-notify { group = mode = 00 user = } group = idle_kill = 4294967295 secs inet_listener { address = haproxy = no port = 0 reuse_port = no ssl = no } privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener director-admin { group = mode = 0600 user = } unix_listener director-userdb { group = mode = 0600 user = } unix_listener login/director { group = mode = 00 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service dns_client { chroot = client_limit = 1 drop_priv_before_exec = no executable = dns-client extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = unix_listener dns-client { group = mode = 0666 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service doveadm { chroot = client_limit = 1 drop_priv_before_exec = no executable = doveadm-server extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 1 type = unix_listener doveadm-server { group = mode = 0600 user = } user = vsz_limit = 18446744073709551615 B } service imap-hibernate { chroot = client_limit = 0 drop_priv_before_exec = no executable = imap-hibernate extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = imap service_count = 0 type = unix_listener imap-hibernate { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service imap-login { chroot = login client_limit = 0 drop_priv_before_exec = no executable = imap-login extra_groups = group = idle_kill = 0 inet_listener imap { address = haproxy = no port = 143 reuse_port = no ssl = no } inet_listener imaps { address = haproxy = no port = 993 reuse_port = no ssl = yes } privileged_group = process_limit = 0 process_min_avail = 10 protocol = imap service_count = 0 type = login user = $default_login_user vsz_limit = 18446744073709551615 B } service imap-urlauth-login { chroot = token-login client_limit = 0 drop_priv_before_exec = no executable = imap-urlauth-login extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = imap service_count = 1 type = login unix_listener imap-urlauth { group = mode = 0666 user = } user = $default_login_user vsz_limit = 18446744073709551615 B } service imap-urlauth-worker { chroot = client_limit = 1 drop_priv_before_exec = no executable = imap-urlauth-worker extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1024 process_min_avail = 0 protocol = imap service_count = 1 type = unix_listener imap-urlauth-worker { group = mode = 0600 user = $default_internal_user } user = vsz_limit = 18446744073709551615 B } service imap-urlauth { chroot = client_limit = 1 drop_priv_before_exec = no executable = imap-urlauth extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1024 process_min_avail = 0 protocol = imap service_count = 1 type = unix_listener token-login/imap-urlauth { group = mode = 0666 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service imap { chroot = client_limit = 1 drop_priv_before_exec = no executable = imap extra_groups = group = idle_kill = 0 privileged_group = process_limit = 2048 process_min_avail = 0 protocol = imap service_count = 1 type = unix_listener imap-master { group = mode = 0600 user = } unix_listener login/imap { group = mode = 0666 user = } user = vsz_limit = 18446744073709551615 B } service indexer-worker { chroot = client_limit = 1 drop_priv_before_exec = no executable = indexer-worker extra_groups = group = idle_kill = 0 privileged_group = process_limit = 10 process_min_avail = 0 protocol = service_count = 0 type = unix_listener indexer-worker { group = mode = 0600 user = $default_internal_user } user = vsz_limit = 18446744073709551615 B } service indexer { chroot = client_limit = 0 drop_priv_before_exec = no executable = indexer extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener indexer { group = mode = 0666 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service ipc { chroot = empty client_limit = 0 drop_priv_before_exec = no executable = ipc extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener ipc { group = mode = 0600 user = } unix_listener login/ipc-proxy { group = mode = 0600 user = $default_login_user } user = $default_internal_user vsz_limit = 18446744073709551615 B } service lmtp { chroot = client_limit = 1 drop_priv_before_exec = no executable = lmtp extra_groups = group = idle_kill = 0 privileged_group = process_limit = 120 process_min_avail = 15 protocol = lmtp service_count = 0 type = unix_listener /var/spool/postfix/private/dovecot-lmtp { group = postfix mode = 0600 user = postfix } unix_listener lmtp { group = mode = 0666 user = } user = vmail vsz_limit = 18446744073709551615 B } service log { chroot = client_limit = 0 drop_priv_before_exec = no executable = log extra_groups = group = idle_kill = 4294967295 secs privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = log unix_listener log-errors { group = mode = 0600 user = } user = vsz_limit = 18446744073709551615 B } service managesieve-login { chroot = login client_limit = 0 drop_priv_before_exec = no executable = managesieve-login extra_groups = group = idle_kill = 0 inet_listener sieve { address = 127.0.0.1 haproxy = no port = 4190 reuse_port = no ssl = no } privileged_group = process_limit = 0 process_min_avail = 0 protocol = sieve service_count = 1 type = login user = $default_login_user vsz_limit = 18446744073709551615 B } service managesieve { chroot = client_limit = 1 drop_priv_before_exec = no executable = managesieve extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = sieve service_count = 1 type = unix_listener login/sieve { group = mode = 0666 user = } user = vsz_limit = 18446744073709551615 B } service pop3-login { chroot = login client_limit = 0 drop_priv_before_exec = no executable = pop3-login extra_groups = group = idle_kill = 0 inet_listener pop3 { address = haproxy = no port = 110 reuse_port = no ssl = no } inet_listener pop3s { address = haproxy = no port = 995 reuse_port = no ssl = yes } privileged_group = process_limit = 0 process_min_avail = 0 protocol = pop3 service_count = 1 type = login user = $default_login_user vsz_limit = 18446744073709551615 B } service pop3 { chroot = client_limit = 1 drop_priv_before_exec = no executable = pop3 extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1024 process_min_avail = 0 protocol = pop3 service_count = 1 type = unix_listener login/pop3 { group = mode = 0666 user = } user = vsz_limit = 18446744073709551615 B } service replicator { chroot = client_limit = 0 drop_priv_before_exec = no executable = replicator extra_groups = group = idle_kill = 4294967295 secs privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener replicator-doveadm { group = mode = 00 user = $default_internal_user } unix_listener replicator { group = mode = 0600 user = $default_internal_user } user = vsz_limit = 18446744073709551615 B } service ssl-params { chroot = client_limit = 0 drop_priv_before_exec = no executable = ssl-params extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = startup unix_listener login/ssl-params { group = mode = 0666 user = } unix_listener ssl-params { group = mode = 0666 user = } user = vsz_limit = 18446744073709551615 B } service stats { chroot = empty client_limit = 0 drop_priv_before_exec = no executable = stats extra_groups = fifo_listener stats-mail { group = mode = 0600 user = } fifo_listener stats-user { group = mode = 0600 user = } group = idle_kill = 4294967295 secs privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener stats { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service tcpwrap { chroot = client_limit = 1 drop_priv_before_exec = no executable = tcpwrap extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = user = $default_internal_user vsz_limit = 18446744073709551615 B } shutdown_clients = yes ssl = required ssl_ca = ssl_cert = </etc/dovecot/private/imap.chained.crt ssl_cert_username_field = commonName ssl_cipher_list = ALL:!LOW:!SSLv2:!EXP:!aNULL ssl_client_ca_dir = /etc/ssl/certs ssl_client_ca_file = ssl_client_cert = ssl_client_key = ssl_crypto_device = ssl_dh_parameters_length = 1024 ssl_key = # hidden, use -P to show it ssl_key_password = ssl_options = ssl_parameters_regenerate = 0 ssl_prefer_server_ciphers = no ssl_protocols = !SSLv3 ssl_require_crl = yes ssl_verify_client_cert = no state_dir = /var/lib/dovecot stats_carbon_interval = 30 secs stats_carbon_name = stats_carbon_server = stats_command_min_time = 1 mins stats_domain_min_time = 12 hours stats_ip_min_time = 12 hours stats_memory_limit = 16 M stats_session_min_time = 15 mins stats_user_min_time = 1 hours submission_host = syslog_facility = mail userdb { args = username_format=%n /data/vmail/global/users auth_verbose = default default_fields = home=/data/vmail/user/%n uid=vmail gid=vmail driver = passwd-file name = override_fields = result_failure = continue result_internalfail = continue result_success = return-ok skip = never } valid_chroot_dirs = verbose_proctitle = no verbose_ssl = no version_ignore = no protocol lmtp { mail_plugins = " sieve" postmaster_address = postmaster@... } protocol imap { mail_plugins = " acl imap_acl" }
Here's a few tips:
- I assume that's a 2U format -24 bays. You only have 1 raid card for all 24 disks ? Granted you only have 16, but usually you should assign 1 card per 8 drives. In our standard 2U chassis we have 3 hba's per 8 drives. Your backplane should support that.
- Add more drives
- Get a pci nvme ssd card and move the indexes/control/sieve files there.
On Monday, 07/09/2020 at 08:16 Miloslav Hůla wrote:
Dne 07.09.2020 v 12:43 Sami Ketola napsal(a):
On 7. Sep 2020, at 12.38, Miloslav Hůla wrote:
Hello,
I sent this into the Linux Kernel Btrfs mailing list and I got reply: "RAID-1 would be preferable" (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112@lec...). May I ask you for the comments as from people around the Dovecot?
We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It takes about 50 minutes to finish.
# uname -a Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux
RAID is a composition of 16 harddrives. Harddrives are connected via AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS 2.5" 15k drives.
Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, Mailbox format, LMTP delivery.
does "Mailbox format" mean mbox?
If so, then there is your bottleneck. mbox is the slowest possible mailbox format there is.
Sami
Sorry, no, it is a typo. We are using "Maildir".
"doveconf -a" attached
Milo
# 2.2.27 (c0f36b0): /etc/dovecot/dovecot.conf # Pigeonhole version 0.4.16 (fed8554) # OS: Linux 4.9.0-12-amd64 x86_64 Debian 9.13 # NOTE: Send doveconf -n output instead when asking for help. auth_anonymous_username = anonymous auth_cache_negative_ttl = 30 secs auth_cache_size = 100 M auth_cache_ttl = 30 secs auth_debug = no auth_debug_passwords = no auth_default_realm = auth_failure_delay = 2 secs auth_gssapi_hostname = auth_krb5_keytab = auth_master_user_separator = auth_mechanisms = plain auth_policy_hash_mech = sha256 auth_policy_hash_nonce = auth_policy_hash_truncate = 12 auth_policy_reject_on_fail = no auth_policy_request_attributes = login=%{orig_username} pwhash=%{hashed_password} remote=%{real_rip} auth_policy_server_api_header = auth_policy_server_timeout_msecs = 2000 auth_policy_server_url = auth_proxy_self = auth_realms = auth_socket_path = auth-userdb auth_ssl_require_client_cert = no auth_ssl_username_from_cert = no auth_stats = no auth_use_winbind = no auth_username_chars = abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890.-_@ auth_username_format = %Lu auth_username_translation = auth_verbose = no auth_verbose_passwords = no auth_winbind_helper_path = /usr/bin/ntlm_auth auth_worker_max_count = 30 base_dir = /var/run/dovecot config_cache_size = 1 M debug_log_path = default_client_limit = 1000 default_idle_kill = 1 mins default_internal_user = dovecot default_login_user = dovenull default_process_limit = 100 default_vsz_limit = 256 M deliver_log_format = msgid=%m: %$ dict_db_config = director_consistent_hashing = no director_doveadm_port = 0 director_flush_socket = director_mail_servers = director_servers = director_user_expire = 15 mins director_user_kick_delay = 2 secs director_username_hash = %u disable_plaintext_auth = yes dotlock_use_excl = yes doveadm_allowed_commands = doveadm_api_key = doveadm_password = doveadm_port = 0 doveadm_socket_path = doveadm-server doveadm_username = doveadm doveadm_worker_count = 0 dsync_alt_char = _ dsync_features = dsync_remote_cmd = ssh -l%{login} %{host} doveadm dsync-server -u%u -U first_valid_gid = 1 first_valid_uid = 109 haproxy_timeout = 3 secs haproxy_trusted_networks = hostname = imap_capability = imap_client_workarounds = imap_hibernate_timeout = 0 imap_id_log = * imap_id_send = name * imap_idle_notify_interval = 2 mins imap_logout_format = in=%i out=%o imap_max_line_length = 64 k imap_metadata = no imap_urlauth_host = imap_urlauth_logout_format = in=%i out=%o imap_urlauth_port = 143 imapc_cmd_timeout = 5 mins imapc_features = imapc_host = imapc_list_prefix = imapc_master_user = imapc_max_idle_time = 29 mins imapc_max_line_length = 0 imapc_password = imapc_port = 143 imapc_rawlog_dir = imapc_sasl_mechanisms = imapc_ssl = no imapc_ssl_verify = yes imapc_user = import_environment = TZ CORE_OUTOFMEM CORE_ERROR LISTEN_PID LISTEN_FDS info_log_path = instance_name = dovecot last_valid_gid = 0 last_valid_uid = 0 lda_mailbox_autocreate = no lda_mailbox_autosubscribe = no lda_original_recipient_header = libexec_dir = /usr/lib/dovecot listen = *, :: lmtp_address_translate = lmtp_hdr_delivery_address = final lmtp_proxy = no lmtp_rcpt_check_quota = no lmtp_save_to_detail_mailbox = no lmtp_user_concurrency_limit = 0 lock_method = fcntl log_path = syslog log_timestamp = "%b %d %H:%M:%S " login_access_sockets = login_greeting = Dovecot ready. login_log_format = %$: %s login_log_format_elements = user= method=%m rip=%r lip=%l mpid=%e %c session= login_plugin_dir = /usr/lib/dovecot/modules/login login_plugins = login_proxy_max_disconnect_delay = 0 login_source_ips = login_trusted_networks = mail_access_groups = mail_always_cache_fields = mail_attachment_dir = mail_attachment_fs = sis posix mail_attachment_hash = %{sha1} mail_attachment_min_size = 128 k mail_attribute_dict = mail_cache_fields = flags mail_cache_min_mail_count = 0 mail_chroot = mail_debug = no mail_fsync = optimized mail_full_filesystem_access = no mail_gid = vmail mail_home = /data/vmail/user/%n mail_location = maildir:/data/vmail/user/%n/Maildir mail_log_prefix = "%s(%u): " mail_max_keyword_length = 50 mail_max_lock_timeout = 0 mail_max_userip_connections = 10 mail_never_cache_fields = imap.envelope mail_nfs_index = no mail_nfs_storage = no mail_plugin_dir = /usr/lib/dovecot/modules mail_plugins = mail_prefetch_count = 0 mail_privileged_group = mail_save_crlf = yes mail_server_admin = mail_server_comment = mail_shared_explicit_inbox = no mail_temp_dir = /tmp mail_temp_scan_interval = 1 weeks mail_uid = vmail mailbox_idle_check_interval = 30 secs mailbox_list_index = no mailbox_list_index_very_dirty_syncs = no maildir_broken_filename_sizes = no maildir_copy_with_hardlinks = yes maildir_empty_new = no maildir_stat_dirs = no maildir_very_dirty_syncs = no managesieve_client_workarounds = managesieve_implementation_string = Dovecot Pigeonhole managesieve_logout_format = bytes=%i/%o managesieve_max_compile_errors = 5 managesieve_max_line_length = 65536 managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex copy include variables body enotify environment mailbox date index ihave duplicate mime foreverypart extracttext vacation-seconds imapflags notify master_user_separator = mbox_dirty_syncs = yes mbox_dotlock_change_timeout = 2 mins mbox_lazy_writes = yes mbox_lock_timeout = 5 mins mbox_md5 = apop3d mbox_min_index_size = 0 mbox_read_locks = fcntl mbox_very_dirty_syncs = no mbox_write_locks = fcntl dotlock mdbox_preallocate_space = no mdbox_purge_preserve_alt = no mdbox_rotate_interval = 0 mdbox_rotate_size = 2 M mmap_disable = no namespace { disabled = no hidden = no ignore_on_failure = no inbox = no list = children location = maildir:/data/vmail/user/%%n/Maildir:INDEXPVT=/data/vmail/user/%n/Maildir/Shared/%%n:SUBSCRIPTIONS=../../%n/Maildir/Shared/subscriptions-%%n order = 0 prefix = user.%%n. separator = . subscriptions = yes type = shared } namespace inbox { disabled = no hidden = no ignore_on_failure = no inbox = yes list = yes location = mailbox Archives { auto = no autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Archive } mailbox Drafts { auto = subscribe autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Drafts } mailbox Junk { auto = no autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Junk } mailbox Sent { auto = subscribe autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Sent } mailbox "Sent Messages" { auto = no autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Sent } mailbox Trash { auto = subscribe autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Trash } mailbox spam { auto = subscribe autoexpunge = 0 autoexpunge_max_mails = 0 comment = driver = special_use = \Junk } order = 0 prefix = INBOX. separator = . subscriptions = yes type = private } passdb { args = /etc/dovecot/deny-users auth_verbose = default default_fields = deny = yes driver = passwd-file master = no name = override_fields = pass = no result_failure = continue result_internalfail = continue result_success = return-ok skip = never } passdb { args = /etc/dovecot/dovecot-ldap.conf.ext auth_verbose = default default_fields = deny = no driver = ldap master = no name = override_fields = pass = no result_failure = continue result_internalfail = continue result_success = return-ok skip = never } plugin { acl = vfile acl_shared_dict = file:/data/vmail/global/shared-mailboxes sieve = file:/data/vmail/user/%n/sieve;active=/data/vmail/user/%n/enabled.sieve sieve_extensions = +notify +imapflags -imap4flags +vacation-seconds sieve_vacation_min_period = 0 } pop3_client_workarounds = pop3_delete_type = default pop3_deleted_flag = pop3_enable_last = no pop3_fast_size_lookups = no pop3_lock_session = no pop3_logout_format = top=%t/%p, retr=%r/%b, del=%d/%m, size=%s pop3_no_flag_updates = no pop3_reuse_xuidl = no pop3_save_uidl = no pop3_uidl_duplicates = allow pop3_uidl_format = %v.%u pop3c_host = pop3c_master_user = pop3c_password = pop3c_port = 110 pop3c_quick_received_date = no pop3c_rawlog_dir = pop3c_ssl = no pop3c_ssl_verify = yes pop3c_user = %u postmaster_address = postmaster@%d protocols = " imap lmtp sieve pop3" quota_full_tempfail = no rawlog_dir = recipient_delimiter = + rejection_reason = Your message to was automatically rejected:%n%r rejection_subject = Rejected: %s replication_dsync_parameters = -d -N -l 30 -U replication_full_sync_interval = 1 days replication_max_conns = 10 replicator_host = replicator replicator_port = 0 sendmail_path = /usr/sbin/sendmail service aggregator { chroot = . client_limit = 0 drop_priv_before_exec = no executable = aggregator extra_groups = fifo_listener replication-notify-fifo { group = mode = 0600 user = } group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = unix_listener replication-notify { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service anvil { chroot = empty client_limit = 0 drop_priv_before_exec = no executable = anvil extra_groups = group = idle_kill = 4294967295 secs privileged_group = process_limit = 1 process_min_avail = 1 protocol = service_count = 0 type = anvil unix_listener anvil-auth-penalty { group = mode = 0600 user = } unix_listener anvil { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service auth-worker { chroot = client_limit = 1 drop_priv_before_exec = no executable = auth -w extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 1 type = unix_listener auth-worker { group = mode = 0600 user = $default_internal_user } user = vsz_limit = 18446744073709551615 B } service auth { chroot = client_limit = 0 drop_priv_before_exec = no executable = auth extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener auth-client { group = mode = 0600 user = $default_internal_user } unix_listener auth-login { group = mode = 0600 user = $default_internal_user } unix_listener auth-master { group = mode = 0600 user = } unix_listener auth-userdb { group = mode = 0666 user = $default_internal_user } unix_listener login/login { group = mode = 0666 user = } unix_listener token-login/tokenlogin { group = mode = 0666 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service config { chroot = client_limit = 0 drop_priv_before_exec = no executable = config extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = config unix_listener config { group = mode = 0600 user = } user = vsz_limit = 18446744073709551615 B } service dict-async { chroot = client_limit = 0 drop_priv_before_exec = no executable = dict extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = unix_listener dict-async { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service dict { chroot = client_limit = 1 drop_priv_before_exec = no executable = dict extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = unix_listener dict { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service director { chroot = . client_limit = 0 drop_priv_before_exec = no executable = director extra_groups = fifo_listener login/proxy-notify { group = mode = 00 user = } group = idle_kill = 4294967295 secs inet_listener { address = haproxy = no port = 0 reuse_port = no ssl = no } privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener director-admin { group = mode = 0600 user = } unix_listener director-userdb { group = mode = 0600 user = } unix_listener login/director { group = mode = 00 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service dns_client { chroot = client_limit = 1 drop_priv_before_exec = no executable = dns-client extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = unix_listener dns-client { group = mode = 0666 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service doveadm { chroot = client_limit = 1 drop_priv_before_exec = no executable = doveadm-server extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 1 type = unix_listener doveadm-server { group = mode = 0600 user = } user = vsz_limit = 18446744073709551615 B } service imap-hibernate { chroot = client_limit = 0 drop_priv_before_exec = no executable = imap-hibernate extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = imap service_count = 0 type = unix_listener imap-hibernate { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service imap-login { chroot = login client_limit = 0 drop_priv_before_exec = no executable = imap-login extra_groups = group = idle_kill = 0 inet_listener imap { address = haproxy = no port = 143 reuse_port = no ssl = no } inet_listener imaps { address = haproxy = no port = 993 reuse_port = no ssl = yes } privileged_group = process_limit = 0 process_min_avail = 10 protocol = imap service_count = 0 type = login user = $default_login_user vsz_limit = 18446744073709551615 B } service imap-urlauth-login { chroot = token-login client_limit = 0 drop_priv_before_exec = no executable = imap-urlauth-login extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = imap service_count = 1 type = login unix_listener imap-urlauth { group = mode = 0666 user = } user = $default_login_user vsz_limit = 18446744073709551615 B } service imap-urlauth-worker { chroot = client_limit = 1 drop_priv_before_exec = no executable = imap-urlauth-worker extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1024 process_min_avail = 0 protocol = imap service_count = 1 type = unix_listener imap-urlauth-worker { group = mode = 0600 user = $default_internal_user } user = vsz_limit = 18446744073709551615 B } service imap-urlauth { chroot = client_limit = 1 drop_priv_before_exec = no executable = imap-urlauth extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1024 process_min_avail = 0 protocol = imap service_count = 1 type = unix_listener token-login/imap-urlauth { group = mode = 0666 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service imap { chroot = client_limit = 1 drop_priv_before_exec = no executable = imap extra_groups = group = idle_kill = 0 privileged_group = process_limit = 2048 process_min_avail = 0 protocol = imap service_count = 1 type = unix_listener imap-master { group = mode = 0600 user = } unix_listener login/imap { group = mode = 0666 user = } user = vsz_limit = 18446744073709551615 B } service indexer-worker { chroot = client_limit = 1 drop_priv_before_exec = no executable = indexer-worker extra_groups = group = idle_kill = 0 privileged_group = process_limit = 10 process_min_avail = 0 protocol = service_count = 0 type = unix_listener indexer-worker { group = mode = 0600 user = $default_internal_user } user = vsz_limit = 18446744073709551615 B } service indexer { chroot = client_limit = 0 drop_priv_before_exec = no executable = indexer extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener indexer { group = mode = 0666 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service ipc { chroot = empty client_limit = 0 drop_priv_before_exec = no executable = ipc extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener ipc { group = mode = 0600 user = } unix_listener login/ipc-proxy { group = mode = 0600 user = $default_login_user } user = $default_internal_user vsz_limit = 18446744073709551615 B } service lmtp { chroot = client_limit = 1 drop_priv_before_exec = no executable = lmtp extra_groups = group = idle_kill = 0 privileged_group = process_limit = 120 process_min_avail = 15 protocol = lmtp service_count = 0 type = unix_listener /var/spool/postfix/private/dovecot-lmtp { group = postfix mode = 0600 user = postfix } unix_listener lmtp { group = mode = 0666 user = } user = vmail vsz_limit = 18446744073709551615 B } service log { chroot = client_limit = 0 drop_priv_before_exec = no executable = log extra_groups = group = idle_kill = 4294967295 secs privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = log unix_listener log-errors { group = mode = 0600 user = } user = vsz_limit = 18446744073709551615 B } service managesieve-login { chroot = login client_limit = 0 drop_priv_before_exec = no executable = managesieve-login extra_groups = group = idle_kill = 0 inet_listener sieve { address = 127.0.0.1 haproxy = no port = 4190 reuse_port = no ssl = no } privileged_group = process_limit = 0 process_min_avail = 0 protocol = sieve service_count = 1 type = login user = $default_login_user vsz_limit = 18446744073709551615 B } service managesieve { chroot = client_limit = 1 drop_priv_before_exec = no executable = managesieve extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = sieve service_count = 1 type = unix_listener login/sieve { group = mode = 0666 user = } user = vsz_limit = 18446744073709551615 B } service pop3-login { chroot = login client_limit = 0 drop_priv_before_exec = no executable = pop3-login extra_groups = group = idle_kill = 0 inet_listener pop3 { address = haproxy = no port = 110 reuse_port = no ssl = no } inet_listener pop3s { address = haproxy = no port = 995 reuse_port = no ssl = yes } privileged_group = process_limit = 0 process_min_avail = 0 protocol = pop3 service_count = 1 type = login user = $default_login_user vsz_limit = 18446744073709551615 B } service pop3 { chroot = client_limit = 1 drop_priv_before_exec = no executable = pop3 extra_groups = group = idle_kill = 0 privileged_group = process_limit = 1024 process_min_avail = 0 protocol = pop3 service_count = 1 type = unix_listener login/pop3 { group = mode = 0666 user = } user = vsz_limit = 18446744073709551615 B } service replicator { chroot = client_limit = 0 drop_priv_before_exec = no executable = replicator extra_groups = group = idle_kill = 4294967295 secs privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener replicator-doveadm { group = mode = 00 user = $default_internal_user } unix_listener replicator { group = mode = 0600 user = $default_internal_user } user = vsz_limit = 18446744073709551615 B } service ssl-params { chroot = client_limit = 0 drop_priv_before_exec = no executable = ssl-params extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = startup unix_listener login/ssl-params { group = mode = 0666 user = } unix_listener ssl-params { group = mode = 0666 user = } user = vsz_limit = 18446744073709551615 B } service stats { chroot = empty client_limit = 0 drop_priv_before_exec = no executable = stats extra_groups = fifo_listener stats-mail { group = mode = 0600 user = } fifo_listener stats-user { group = mode = 0600 user = } group = idle_kill = 4294967295 secs privileged_group = process_limit = 1 process_min_avail = 0 protocol = service_count = 0 type = unix_listener stats { group = mode = 0600 user = } user = $default_internal_user vsz_limit = 18446744073709551615 B } service tcpwrap { chroot = client_limit = 1 drop_priv_before_exec = no executable = tcpwrap extra_groups = group = idle_kill = 0 privileged_group = process_limit = 0 process_min_avail = 0 protocol = service_count = 0 type = user = $default_internal_user vsz_limit = 18446744073709551615 B } shutdown_clients = yes ssl = required ssl_ca = ssl_cert =
Thanks for the tips!
Dne 07.09.2020 v 15:24 Scott Q. napsal(a):
- I assume that's a 2U format -24 bays. You only have 1 raid card for all 24 disks ? Granted you only have 16, but usually you should assign 1 card per 8 drives. In our standard 2U chassis we have 3 hba's per 8 drives. Your backplane should support that.
Exactly. And what's the reason/bottleneck? PCIe or card throughput?
- Add more drives
We can add 2 next drives, and we actually did yesterday, but we keep free slots to be able replace drives by double-capacity ones.
- Get a pci nvme ssd card and move the indexes/control/sieve files there.
It complicates current backup and restore a little bit, but I'll probably try that.
Thank you, Milo
On Monday, 07/09/2020 at 08:16 Miloslav Hůla wrote:
Dne 07.09.2020 v 12:43 Sami Ketola napsal(a): >> On 7. Sep 2020, at 12.38, Miloslav Hůla <miloslav.hula@gmail.com <mailto:miloslav.hula@gmail.com>> wrote: >> >> Hello, >> >> I sent this into the Linux Kernel Btrfs mailing list and I got reply: "RAID-1 would be preferable" (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112@lechevalier.se/T/). May I ask you for the comments as from people around the Dovecot? >> >> >> We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It takes about 50 minutes to finish. >> >> # uname -a >> Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux >> >> RAID is a composition of 16 harddrives. Harddrives are connected via AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS 2.5" 15k drives. >> >> Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, Mailbox format, LMTP delivery. > > does "Mailbox format" mean mbox? > > If so, then there is your bottleneck. mbox is the slowest possible mailbox format there is. > > Sami Sorry, no, it is a typo. We are using "Maildir". "doveconf -a" attached Milo
"Miloslav" == Miloslav Hůla <miloslav.hula@gmail.com> writes:
Miloslav> Hello, Miloslav> I sent this into the Linux Kernel Btrfs mailing list and I got reply: Miloslav> "RAID-1 would be preferable" Miloslav> (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112@lec...). Miloslav> May I ask you for the comments as from people around the Dovecot?
Miloslav> We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro Miloslav> server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. Miloslav> We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It Miloslav> takes about 50 minutes to finish.
Miloslav> # uname -a Miloslav> Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 Miloslav> GNU/Linux
Miloslav> RAID is a composition of 16 harddrives. Harddrives are connected via Miloslav> AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS Miloslav> 2.5" 15k drives.
Can you post the output of "cat /proc/mdstat" or since you say you're using btrfs, are you using their own RAID0 setup? If so, please post the output of 'btrfs stats' or whatever the command is you use to view layout info?
Miloslav> Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, Miloslav> Mailbox format, LMTP delivery.
How ofter are these accounts hitting the server?
Miloslav> We run 'rsync' to remote NAS daily. It takes about 6.5 hours to finish, Miloslav> 12'265'387 files last night.
That's.... sucky. So basically you're hitting the drives hard with random IOPs and you're probably running out of performance. How much space are you using on the filesystem?
And why not use brtfs send to ship off snapshots instead of using rsync? I'm sure that would be an improvement...
Miloslav> Last half year, we encoutered into performace Miloslav> troubles. Server load grows up to 30 in rush hours, due to Miloslav> IO waits. We tried to attach next harddrives (the 838G ones Miloslav> in a list below) and increase a free space by rebalace. I Miloslav> think, it helped a little bit, not not so rapidly.
If you're IOPs bound, but not space bound, then you *really* want to get an SSD in there for the indexes and such. Basically the stuff that gets written/read from all the time no matter what, but which isn't large in terms of space.
Also, adding in another controller card or two would also probably help spread the load across more PCI channels, and reduce contention on the SATA/SAS bus as well.
Miloslav> Is this a reasonable setup and use case for btrfs RAID-10? Miloslav> If so, are there some recommendations to achieve better Miloslav> performance?
- move HOT data to SSD based volume RAID 1 pair. On a seperate controller.
- add more controllers, which also means you're more redundant in case one controller fails.
- Clone the system and put Dovecot IMAP director in from of the setup.
- Stop using rsync for copying to your DR site, use the btrfs snap send, or whatever the commands are.
- check which dovecot backend you're using and think about moving to one which doesn't involve nearly as many files.
- Find out who your biggest users are, in terms of emails and move them to SSDs if step 1 is too hard to do at first.
Can you also grab some 'iostat -dhm 30 60' output, which is 30 minutes of data over 30 second intervals? That should help you narrow down which (if any) disk is your hotspot.
It's not clear to me if you have one big btrfs filesystem, or a bunch of smaller ones stiched together. In any case, it should be very easy to get better performance here.
I think someone else mentioned that you should look at your dovecot backend, and you should move to the fastest one you can find.
Good luck! John
Miloslav> # megaclisas-status Miloslav> -- Controller information -- Miloslav> -- ID | H/W Model | RAM | Temp | BBU | Firmware Miloslav> c0 | AVAGO MegaRAID SAS 9361-8i | 1024MB | 72C | Good | FW: Miloslav> 24.16.0-0082
Miloslav> -- Array information -- Miloslav> -- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS Miloslav> Path | CacheCade |InProgress Miloslav> c0u0 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdq | None |None Miloslav> c0u1 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sda | None |None Miloslav> c0u2 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdb | None |None Miloslav> c0u3 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdc | None |None Miloslav> c0u4 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdd | None |None Miloslav> c0u5 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sde | None |None Miloslav> c0u6 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdf | None |None Miloslav> c0u7 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdg | None |None Miloslav> c0u8 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdh | None |None Miloslav> c0u9 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdi | None |None Miloslav> c0u10 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdj | None |None Miloslav> c0u11 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdk | None |None Miloslav> c0u12 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdl | None |None Miloslav> c0u13 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdm | None |None Miloslav> c0u14 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdn | None |None Miloslav> c0u15 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdr | None |None
Miloslav> -- Disk information -- Miloslav> -- ID | Type | Drive Model | Size | Status Miloslav> | Speed | Temp | Slot ID | LSI ID Miloslav> c0u0p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q3S3 | 837.8 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 53C | [8:14] | 32 Miloslav> c0u1p0 | HDD | HGST HUC156060CSS200 A3800XV250TJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 38C | [8:0] | 12 Miloslav> c0u2p0 | HDD | HGST HUC156060CSS200 A3800XV3XT4J | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 43C | [8:1] | 11 Miloslav> c0u3p0 | HDD | HGST HUC156060CSS200 ADB05ZG4XLZU | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 46C | [8:2] | 25 Miloslav> c0u4p0 | HDD | HGST HUC156060CSS200 A3800XV3DWRL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 48C | [8:3] | 14 Miloslav> c0u5p0 | HDD | HGST HUC156060CSS200 A3800XV3XZTL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 52C | [8:4] | 18 Miloslav> c0u6p0 | HDD | HGST HUC156060CSS200 A3800XV3VSKJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:5] | 15 Miloslav> c0u7p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWKE | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 56C | [8:6] | 28 Miloslav> c0u8p0 | HDD | HGST HUC156060CSS200 A3800XV3XTDJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:7] | 20 Miloslav> c0u9p0 | HDD | HGST HUC156060CSS200 A3800XV3T8XL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 57C | [8:8] | 19 Miloslav> c0u10p0 | HDD | HGST HUC156060CSS200 A7030XHL0ZYP | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 61C | [8:9] | 23 Miloslav> c0u11p0 | HDD | HGST HUC156060CSS200 ADB05ZG4VR3P | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 60C | [8:10] | 24 Miloslav> c0u12p0 | HDD | SEAGATE ST600MP0006 N003WAF195KA | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 60C | [8:11] | 29 Miloslav> c0u13p0 | HDD | SEAGATE ST600MP0006 N003WAF1LTZW | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 56C | [8:12] | 26 Miloslav> c0u14p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWH6 | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:13] | 27 Miloslav> c0u15p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q414 | 837.8 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 47C | [8:15] | 33
Miloslav> # btrfs --version Miloslav> btrfs-progs v4.7.3
Miloslav> # btrfs fi show Miloslav> Label: 'DATA' uuid: 5b285a46-e55d-4191-924f-0884fa06edd8 Miloslav> Total devices 16 FS bytes used 3.49TiB Miloslav> devid 1 size 558.41GiB used 448.66GiB path /dev/sda Miloslav> devid 2 size 558.41GiB used 448.66GiB path /dev/sdb Miloslav> devid 4 size 558.41GiB used 448.66GiB path /dev/sdd Miloslav> devid 5 size 558.41GiB used 448.66GiB path /dev/sde Miloslav> devid 7 size 558.41GiB used 448.66GiB path /dev/sdg Miloslav> devid 8 size 558.41GiB used 448.66GiB path /dev/sdh Miloslav> devid 9 size 558.41GiB used 448.66GiB path /dev/sdf Miloslav> devid 10 size 558.41GiB used 448.66GiB path /dev/sdi Miloslav> devid 11 size 558.41GiB used 448.66GiB path /dev/sdj Miloslav> devid 13 size 558.41GiB used 448.66GiB path /dev/sdk Miloslav> devid 14 size 558.41GiB used 448.66GiB path /dev/sdc Miloslav> devid 15 size 558.41GiB used 448.66GiB path /dev/sdl Miloslav> devid 16 size 558.41GiB used 448.66GiB path /dev/sdm Miloslav> devid 17 size 558.41GiB used 448.66GiB path /dev/sdn Miloslav> devid 18 size 837.84GiB used 448.66GiB path /dev/sdr Miloslav> devid 19 size 837.84GiB used 448.66GiB path /dev/sdq
Miloslav> # btrfs fi df /data/ Miloslav> Data, RAID10: total=3.48TiB, used=3.47TiB Miloslav> System, RAID10: total=256.00MiB, used=320.00KiB Miloslav> Metadata, RAID10: total=21.00GiB, used=18.17GiB Miloslav> GlobalReserve, single: total=512.00MiB, used=0.00B
Miloslav> I do not attach whole dmesg log. It is almost empty, without errors. Miloslav> Only lines about BTRFS are about relocations, like:
Miloslav> BTRFS info (device sda): relocating block group 29435663220736 flags 65 Miloslav> BTRFS info (device sda): found 54460 extents Miloslav> BTRFS info (device sda): found 54459 extents
Hi, thank you for your reply. I'll continue inline...
Dne 09.09.2020 v 3:15 John Stoffel napsal(a):
Miloslav> Hello, Miloslav> I sent this into the Linux Kernel Btrfs mailing list and I got reply: Miloslav> "RAID-1 would be preferable" Miloslav> (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112@lec...). Miloslav> May I ask you for the comments as from people around the Dovecot?
Miloslav> We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro Miloslav> server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. Miloslav> We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It Miloslav> takes about 50 minutes to finish.
Miloslav> # uname -a Miloslav> Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 Miloslav> GNU/Linux
Miloslav> RAID is a composition of 16 harddrives. Harddrives are connected via Miloslav> AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS Miloslav> 2.5" 15k drives.
Can you post the output of "cat /proc/mdstat" or since you say you're using btrfs, are you using their own RAID0 setup? If so, please post the output of 'btrfs stats' or whatever the command is you use to view layout info?
There is a one PCIe RAID controller in a chasis. AVAGO MegaRAID SAS 9361-8i. And 16x SAS 15k drives conneced to it. Because the controller does not support pass-through for the drives, we use 16x RAID-0 on controller. So, we get /dev/sda ... /dev/sdp (roughly) in OS. And over that we have single btrfs RAID-10, composed of 16 devices, mounted as /data.
We have chosen this wiring for severeal reasons:
- easy to increase a capacity
- easy to replace drives by larger ones
- due to checksuming, btrfs does not need fsck in case of power failure
- btrfs scrub discovers failing drive sooner than S.M.A.R.T. or RAID controller
Miloslav> Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, Miloslav> Mailbox format, LMTP delivery.
How ofter are these accounts hitting the server?
IMAP serves for a univesity. So there are typical rush hours from 7AM to 3PM. Lowers during the evening, almost not used during the night.
Miloslav> We run 'rsync' to remote NAS daily. It takes about 6.5 hours to finish, Miloslav> 12'265'387 files last night.
That's.... sucky. So basically you're hitting the drives hard with random IOPs and you're probably running out of performance. How much space are you using on the filesystem?
It's not so sucky how it seems. rsync runs during the night. And even reading is high, server load stays low. We have problems with writes.
And why not use brtfs send to ship off snapshots instead of using rsync? I'm sure that would be an improvement...
We run backup to external NAS (NetApp) for a disaster recovery scenario. Moreover NAS is spreaded across multiple locations. Then we create NAS snapshot, tens days backward. All snapshots easily available via NFS mount. And NAS capacity is cheaper.
Miloslav> Last half year, we encoutered into performace Miloslav> troubles. Server load grows up to 30 in rush hours, due to Miloslav> IO waits. We tried to attach next harddrives (the 838G ones Miloslav> in a list below) and increase a free space by rebalace. I Miloslav> think, it helped a little bit, not not so rapidly.
If you're IOPs bound, but not space bound, then you *really* want to get an SSD in there for the indexes and such. Basically the stuff that gets written/read from all the time no matter what, but which isn't large in terms of space.
Yes. We are now on 66% capacity. Adding SSD for indexes is our next step.
Also, adding in another controller card or two would also probably help spread the load across more PCI channels, and reduce contention on the SATA/SAS bus as well.
Probably we will wait how SSD helps first, but as you wrote, it is possible next step.
Miloslav> Is this a reasonable setup and use case for btrfs RAID-10? Miloslav> If so, are there some recommendations to achieve better Miloslav> performance?
- move HOT data to SSD based volume RAID 1 pair. On a seperate controller.
OK
- add more controllers, which also means you're more redundant in case one controller fails.
OK
- Clone the system and put Dovecot IMAP director in from of the setup.
I still hope that one server can handle 4105 accounts.
- Stop using rsync for copying to your DR site, use the btrfs snap send, or whatever the commands are.
I hope it is not needed in our scenario.
- check which dovecot backend you're using and think about moving to one which doesn't involve nearly as many files.
Maildir is comfortable for us. Time to time, users call us with: "I accidentally deleted the folder" and it is super easy to copy it back from backup.
- Find out who your biggest users are, in terms of emails and move them to SSDs if step 1 is too hard to do at first.
OK
Can you also grab some 'iostat -dhm 30 60' output, which is 30 minutes of data over 30 second intervals? That should help you narrow down which (if any) disk is your hotspot.
OK, thanks for the tip.
It's not clear to me if you have one big btrfs filesystem, or a bunch of smaller ones stiched together. In any case, it should be very easy to get better performance here.
I hope I've made it clear above.
I think someone else mentioned that you should look at your dovecot backend, and you should move to the fastest one you can find.
Good luck! John
Thank you for your time and advices!
Kind regards Milo
Miloslav> # megaclisas-status Miloslav> -- Controller information -- Miloslav> -- ID | H/W Model | RAM | Temp | BBU | Firmware Miloslav> c0 | AVAGO MegaRAID SAS 9361-8i | 1024MB | 72C | Good | FW: Miloslav> 24.16.0-0082
Miloslav> -- Array information -- Miloslav> -- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS Miloslav> Path | CacheCade |InProgress Miloslav> c0u0 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdq | None |None Miloslav> c0u1 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sda | None |None Miloslav> c0u2 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdb | None |None Miloslav> c0u3 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdc | None |None Miloslav> c0u4 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdd | None |None Miloslav> c0u5 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sde | None |None Miloslav> c0u6 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdf | None |None Miloslav> c0u7 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdg | None |None Miloslav> c0u8 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdh | None |None Miloslav> c0u9 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdi | None |None Miloslav> c0u10 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdj | None |None Miloslav> c0u11 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdk | None |None Miloslav> c0u12 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdl | None |None Miloslav> c0u13 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdm | None |None Miloslav> c0u14 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdn | None |None Miloslav> c0u15 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdr | None |None
Miloslav> -- Disk information -- Miloslav> -- ID | Type | Drive Model | Size | Status Miloslav> | Speed | Temp | Slot ID | LSI ID Miloslav> c0u0p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q3S3 | 837.8 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 53C | [8:14] | 32 Miloslav> c0u1p0 | HDD | HGST HUC156060CSS200 A3800XV250TJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 38C | [8:0] | 12 Miloslav> c0u2p0 | HDD | HGST HUC156060CSS200 A3800XV3XT4J | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 43C | [8:1] | 11 Miloslav> c0u3p0 | HDD | HGST HUC156060CSS200 ADB05ZG4XLZU | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 46C | [8:2] | 25 Miloslav> c0u4p0 | HDD | HGST HUC156060CSS200 A3800XV3DWRL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 48C | [8:3] | 14 Miloslav> c0u5p0 | HDD | HGST HUC156060CSS200 A3800XV3XZTL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 52C | [8:4] | 18 Miloslav> c0u6p0 | HDD | HGST HUC156060CSS200 A3800XV3VSKJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:5] | 15 Miloslav> c0u7p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWKE | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 56C | [8:6] | 28 Miloslav> c0u8p0 | HDD | HGST HUC156060CSS200 A3800XV3XTDJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:7] | 20 Miloslav> c0u9p0 | HDD | HGST HUC156060CSS200 A3800XV3T8XL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 57C | [8:8] | 19 Miloslav> c0u10p0 | HDD | HGST HUC156060CSS200 A7030XHL0ZYP | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 61C | [8:9] | 23 Miloslav> c0u11p0 | HDD | HGST HUC156060CSS200 ADB05ZG4VR3P | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 60C | [8:10] | 24 Miloslav> c0u12p0 | HDD | SEAGATE ST600MP0006 N003WAF195KA | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 60C | [8:11] | 29 Miloslav> c0u13p0 | HDD | SEAGATE ST600MP0006 N003WAF1LTZW | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 56C | [8:12] | 26 Miloslav> c0u14p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWH6 | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:13] | 27 Miloslav> c0u15p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q414 | 837.8 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 47C | [8:15] | 33
Miloslav> # btrfs --version Miloslav> btrfs-progs v4.7.3
Miloslav> # btrfs fi show Miloslav> Label: 'DATA' uuid: 5b285a46-e55d-4191-924f-0884fa06edd8 Miloslav> Total devices 16 FS bytes used 3.49TiB Miloslav> devid 1 size 558.41GiB used 448.66GiB path /dev/sda Miloslav> devid 2 size 558.41GiB used 448.66GiB path /dev/sdb Miloslav> devid 4 size 558.41GiB used 448.66GiB path /dev/sdd Miloslav> devid 5 size 558.41GiB used 448.66GiB path /dev/sde Miloslav> devid 7 size 558.41GiB used 448.66GiB path /dev/sdg Miloslav> devid 8 size 558.41GiB used 448.66GiB path /dev/sdh Miloslav> devid 9 size 558.41GiB used 448.66GiB path /dev/sdf Miloslav> devid 10 size 558.41GiB used 448.66GiB path /dev/sdi Miloslav> devid 11 size 558.41GiB used 448.66GiB path /dev/sdj Miloslav> devid 13 size 558.41GiB used 448.66GiB path /dev/sdk Miloslav> devid 14 size 558.41GiB used 448.66GiB path /dev/sdc Miloslav> devid 15 size 558.41GiB used 448.66GiB path /dev/sdl Miloslav> devid 16 size 558.41GiB used 448.66GiB path /dev/sdm Miloslav> devid 17 size 558.41GiB used 448.66GiB path /dev/sdn Miloslav> devid 18 size 837.84GiB used 448.66GiB path /dev/sdr Miloslav> devid 19 size 837.84GiB used 448.66GiB path /dev/sdq
Miloslav> # btrfs fi df /data/ Miloslav> Data, RAID10: total=3.48TiB, used=3.47TiB Miloslav> System, RAID10: total=256.00MiB, used=320.00KiB Miloslav> Metadata, RAID10: total=21.00GiB, used=18.17GiB Miloslav> GlobalReserve, single: total=512.00MiB, used=0.00B
Miloslav> I do not attach whole dmesg log. It is almost empty, without errors. Miloslav> Only lines about BTRFS are about relocations, like:
Miloslav> BTRFS info (device sda): relocating block group 29435663220736 flags 65 Miloslav> BTRFS info (device sda): found 54460 extents Miloslav> BTRFS info (device sda): found 54459 extents
The 9361-8i does support passthrough ( JBOD mode ). Make sure you have the latest firmware.
On Wednesday, 09/09/2020 at 03:55 Miloslav Hůla wrote:
Hi, thank you for your reply. I'll continue inline...
Miloslav> Hello, Miloslav> I sent this into the Linux Kernel Btrfs mailing list and I got reply: Miloslav> "RAID-1 would be preferable" Miloslav> (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112@lec...). Miloslav> May I ask you for the comments as from people around the Dovecot?
Miloslav> We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro Miloslav> server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. Miloslav> We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It Miloslav> takes about 50 minutes to finish.
Miloslav> # uname -a Miloslav> Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 Miloslav> GNU/Linux
Miloslav> RAID is a composition of 16 harddrives. Harddrives are connected via Miloslav> AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS Miloslav> 2.5" 15k drives.
Can you post the output of "cat /proc/mdstat" or since you say you're using btrfs, are you using their own RAID0 setup? If so, please
Dne 09.09.2020 v 3:15 John Stoffel napsal(a): post
the output of 'btrfs stats' or whatever the command is you use to view layout info?
There is a one PCIe RAID controller in a chasis. AVAGO MegaRAID SAS 9361-8i. And 16x SAS 15k drives conneced to it. Because the controller
does not support pass-through for the drives, we use 16x RAID-0 on controller. So, we get /dev/sda ... /dev/sdp (roughly) in OS. And over
that we have single btrfs RAID-10, composed of 16 devices, mounted as /data.
We have chosen this wiring for severeal reasons:
- easy to increase a capacity
- easy to replace drives by larger ones
- due to checksuming, btrfs does not need fsck in case of power failure
- btrfs scrub discovers failing drive sooner than S.M.A.R.T. or RAID controller
Miloslav> Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, Miloslav> Mailbox format, LMTP delivery.
How ofter are these accounts hitting the server?
IMAP serves for a univesity. So there are typical rush hours from 7AM to 3PM. Lowers during the evening, almost not used during the night.
Miloslav> We run 'rsync' to remote NAS daily. It takes about 6.5 hours to finish, Miloslav> 12'265'387 files last night.
That's.... sucky. So basically you're hitting the drives hard with random IOPs and you're probably running out of performance. How much space are you using on the filesystem?
It's not so sucky how it seems. rsync runs during the night. And even reading is high, server load stays low. We have problems with writes.
And why not use brtfs send to ship off snapshots instead of using rsync? I'm sure that would be an improvement...
We run backup to external NAS (NetApp) for a disaster recovery scenario. Moreover NAS is spreaded across multiple locations. Then we create NAS
snapshot, tens days backward. All snapshots easily available via NFS mount. And NAS capacity is cheaper.
Miloslav> Last half year, we encoutered into performace Miloslav> troubles. Server load grows up to 30 in rush hours, due to Miloslav> IO waits. We tried to attach next harddrives (the 838G ones Miloslav> in a list below) and increase a free space by rebalace. I Miloslav> think, it helped a little bit, not not so rapidly.
If you're IOPs bound, but not space bound, then you *really* want to get an SSD in there for the indexes and such. Basically the stuff that gets written/read from all the time no matter what, but which isn't large in terms of space.
Yes. We are now on 66% capacity. Adding SSD for indexes is our next step.
Also, adding in another controller card or two would also probably help spread the load across more PCI channels, and reduce contention on the SATA/SAS bus as well.
Probably we will wait how SSD helps first, but as you wrote, it is possible next step.
Miloslav> Is this a reasonable setup and use case for btrfs RAID-10? Miloslav> If so, are there some recommendations to achieve better Miloslav> performance?
- move HOT data to SSD based volume RAID 1 pair. On a seperate controller.
OK
- add more controllers, which also means you're more redundant in case one controller fails.
OK
- Clone the system and put Dovecot IMAP director in from of the setup.
I still hope that one server can handle 4105 accounts.
- Stop using rsync for copying to your DR site, use the btrfs snap send, or whatever the commands are.
I hope it is not needed in our scenario.
- check which dovecot backend you're using and think about moving to one which doesn't involve nearly as many files.
Maildir is comfortable for us. Time to time, users call us with: "I accidentally deleted the folder" and it is super easy to copy it back from backup.
- Find out who your biggest users are, in terms of emails and move them to SSDs if step 1 is too hard to do at first.
OK
Can you also grab some 'iostat -dhm 30 60' output, which is 30 minutes of data over 30 second intervals? That should help you narrow down which (if any) disk is your hotspot.
OK, thanks for the tip.
It's not clear to me if you have one big btrfs filesystem, or a bunch of smaller ones stiched together. In any case, it should be very easy to get better performance here.
I hope I've made it clear above.
I think someone else mentioned that you should look at your dovecot backend, and you should move to the fastest one you can find.
Good luck! John
Thank you for your time and advices!
Kind regards Milo
Miloslav> # megaclisas-status Miloslav> -- Controller information -- Miloslav> -- ID | H/W Model | RAM | Temp | BBU | Firmware Miloslav> c0 | AVAGO MegaRAID SAS 9361-8i | 1024MB | 72C | Good | FW: Miloslav> 24.16.0-0082
Miloslav> -- Array information -- Miloslav> -- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS Miloslav> Path | CacheCade |InProgress Miloslav> c0u0 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdq | None |None Miloslav> c0u1 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sda | None |None Miloslav> c0u2 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdb | None |None Miloslav> c0u3 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdc | None |None Miloslav> c0u4 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdd | None |None Miloslav> c0u5 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sde | None |None Miloslav> c0u6 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdf | None |None Miloslav> c0u7 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdg | None |None Miloslav> c0u8 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdh | None |None Miloslav> c0u9 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdi | None |None Miloslav> c0u10 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdj | None |None Miloslav> c0u11 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdk | None |None Miloslav> c0u12 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdl | None |None Miloslav> c0u13 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdm | None |None Miloslav> c0u14 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdn | None |None Miloslav> c0u15 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdr | None |None
Miloslav> -- Disk information -- Miloslav> -- ID | Type | Drive Model | Size | Status Miloslav> | Speed | Temp | Slot ID | LSI ID Miloslav> c0u0p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q3S3 | 837.8 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 53C | [8:14] | 32 Miloslav> c0u1p0 | HDD | HGST HUC156060CSS200 A3800XV250TJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 38C | [8:0] | 12 Miloslav> c0u2p0 | HDD | HGST HUC156060CSS200 A3800XV3XT4J | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 43C | [8:1] | 11 Miloslav> c0u3p0 | HDD | HGST HUC156060CSS200 ADB05ZG4XLZU | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 46C | [8:2] | 25 Miloslav> c0u4p0 | HDD | HGST HUC156060CSS200 A3800XV3DWRL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 48C | [8:3] | 14 Miloslav> c0u5p0 | HDD | HGST HUC156060CSS200 A3800XV3XZTL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 52C | [8:4] | 18 Miloslav> c0u6p0 | HDD | HGST HUC156060CSS200 A3800XV3VSKJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:5] | 15 Miloslav> c0u7p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWKE | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 56C | [8:6] | 28 Miloslav> c0u8p0 | HDD | HGST HUC156060CSS200 A3800XV3XTDJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:7] | 20 Miloslav> c0u9p0 | HDD | HGST HUC156060CSS200 A3800XV3T8XL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 57C | [8:8] | 19 Miloslav> c0u10p0 | HDD | HGST HUC156060CSS200 A7030XHL0ZYP | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 61C | [8:9] | 23 Miloslav> c0u11p0 | HDD | HGST HUC156060CSS200 ADB05ZG4VR3P | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 60C | [8:10] | 24 Miloslav> c0u12p0 | HDD | SEAGATE ST600MP0006 N003WAF195KA | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 60C | [8:11] | 29 Miloslav> c0u13p0 | HDD | SEAGATE ST600MP0006 N003WAF1LTZW | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 56C | [8:12] | 26 Miloslav> c0u14p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWH6 | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:13] | 27 Miloslav> c0u15p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q414 | 837.8 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 47C | [8:15] | 33
Miloslav> # btrfs --version Miloslav> btrfs-progs v4.7.3
Miloslav> # btrfs fi show Miloslav> Label: 'DATA' uuid: 5b285a46-e55d-4191-924f-0884fa06edd8 Miloslav> Total devices 16 FS bytes used 3.49TiB Miloslav> devid 1 size 558.41GiB used 448.66GiB path /dev/sda Miloslav> devid 2 size 558.41GiB used 448.66GiB path /dev/sdb Miloslav> devid 4 size 558.41GiB used 448.66GiB path /dev/sdd Miloslav> devid 5 size 558.41GiB used 448.66GiB path /dev/sde Miloslav> devid 7 size 558.41GiB used 448.66GiB path /dev/sdg Miloslav> devid 8 size 558.41GiB used 448.66GiB path /dev/sdh Miloslav> devid 9 size 558.41GiB used 448.66GiB path /dev/sdf Miloslav> devid 10 size 558.41GiB used 448.66GiB path /dev/sdi Miloslav> devid 11 size 558.41GiB used 448.66GiB path /dev/sdj Miloslav> devid 13 size 558.41GiB used 448.66GiB path /dev/sdk Miloslav> devid 14 size 558.41GiB used 448.66GiB path /dev/sdc Miloslav> devid 15 size 558.41GiB used 448.66GiB path /dev/sdl Miloslav> devid 16 size 558.41GiB used 448.66GiB path /dev/sdm Miloslav> devid 17 size 558.41GiB used 448.66GiB path /dev/sdn Miloslav> devid 18 size 837.84GiB used 448.66GiB path /dev/sdr Miloslav> devid 19 size 837.84GiB used 448.66GiB path /dev/sdq
Miloslav> # btrfs fi df /data/ Miloslav> Data, RAID10: total=3.48TiB, used=3.47TiB Miloslav> System, RAID10: total=256.00MiB, used=320.00KiB Miloslav> Metadata, RAID10: total=21.00GiB, used=18.17GiB Miloslav> GlobalReserve, single: total=512.00MiB, used=0.00B
Miloslav> I do not attach whole dmesg log. It is almost empty, without errors. Miloslav> Only lines about BTRFS are about relocations, like:
Miloslav> BTRFS info (device sda): relocating block group 29435663220736 flags 65 Miloslav> BTRFS info (device sda): found 54460 extents Miloslav> BTRFS info (device sda): found 54459 extents
Some controllers has direct option "pass through to OS" for a drive, that's what I meant. I can't recall why we have chosen RAID-0 instead of JBOD, there was some reason, but I hope there is no difference with single drive.
Thank you Milo
Dne 09.09.2020 v 15:51 Scott Q. napsal(a):
The 9361-8i does support passthrough ( JBOD mode ). Make sure you have the latest firmware.
Actually there is, filesystems like ZFS/BTRFS prefer to see the drive directly, not a virtual drive.
I'm not sure you can change it now anymore but in the future, always use JBOD.
It's also possible that you don't have the latest firmware on the 9361-8i. If I recall correctly, they only added in the JBOD option in the last firmware update
On Thursday, 10/09/2020 at 08:52 Miloslav Hůla wrote:
Some controllers has direct option "pass through to OS" for a drive, that's what I meant. I can't recall why we have chosen RAID-0 instead of JBOD, there was some reason, but I hope there is no difference with single drive.
Thank you Milo
Dne 09.09.2020 v 15:51 Scott Q. napsal(a):
The 9361-8i does support passthrough ( JBOD mode ). Make sure you have the latest firmware.
I cannot verify it, but I think that even JBOD is propagated as a virtual device. If you create JBOD from 3 different disks, low level parameters may differ.
And probably old firmware is the reason we used RAID-0 two or three years before.
Thank you for the ideas.
Kind regards Milo
Dne 10.09.2020 v 16:15 Scott Q. napsal(a):
Actually there is, filesystems like ZFS/BTRFS prefer to see the drive directly, not a virtual drive.
I'm not sure you can change it now anymore but in the future, always use JBOD.
It's also possible that you don't have the latest firmware on the 9361-8i. If I recall correctly, they only added in the JBOD option in the last firmware update
On Thursday, 10/09/2020 at 08:52 Miloslav Hůla wrote:
Some controllers has direct option "pass through to OS" for a drive, that's what I meant. I can't recall why we have chosen RAID-0 instead of JBOD, there was some reason, but I hope there is no difference with single drive. Thank you Milo Dne 09.09.2020 v 15:51 Scott Q. napsal(a): > The 9361-8i does support passthrough ( JBOD mode ). Make sure you have > the latest firmware.
On 2020/09/10 07:40, Miloslav Hůla wrote:
I cannot verify it, but I think that even JBOD is propagated as a virtual device. If you create JBOD from 3 different disks, low level parameters may differ.
JBOD allows each disk to be seen by the OS, as is. You wouldn't
create JBOD disk from 3 different disks -- JBOD would give you 3 separate JBOD disks for the 3 separate disks.
So for your 16 disks, you are using 1 long RAID0? You realize
1 disk goes out, the entire array needs to be reconstructed. Also all of your spindles can be tied up by long read/writes -- optimal speed would come from a read 16 stripes wide spread over the 16 disks.
What would be better, IMO, is going with a RAID-10 like your subject
says, using 8-pairs of mirrors and strip those. Set your stripe unit for 64K to allow the disks to operate independently. You don't want a long 16-disk stripe, as that's far from optimal for your mailbox load. What you want is the ability to have multiple I/O ops going at the same time -- independently. I think as it stands now, you are far more likely to get contention as different mailboxes are accessed with contention happening within the span, vs. letting each 2 disk mirror potentially doing a different task -- which would likely have the effect of raising your I/O ops/s.
Running raid10 on top of raid0 seems really wasteful
On 2020.09.15. 11:22, Linda A. Walsh wrote:
On 2020/09/10 07:40, Miloslav Hůla wrote:
I cannot verify it, but I think that even JBOD is propagated as a virtual device. If you create JBOD from 3 different disks, low level parameters may differ.
JBOD allows each disk to be seen by the OS, as is. You wouldn't create JBOD disk from 3 different disks -- JBOD would give you 3 separate JBOD disks for the 3 separate disks.
So for your 16 disks, you are using 1 long RAID0? You realize 1 disk goes out, the entire array needs to be reconstructed. Also all of your spindles can be tied up by long read/writes -- optimal speed would come from a read 16 stripes wide spread over the 16 disks.
What would be better, IMO, is going with a RAID-10 like your subject says, using 8-pairs of mirrors and strip those. Set your stripe unit for 64K to allow the disks to operate independently. You don't want a long 16-disk stripe, as that's far from optimal for your mailbox load. What you want is the ability to have multiple I/O ops going at the same time -- independently. I think as it stands now, you are far more likely to get contention as different mailboxes are accessed with contention happening within the span, vs. letting each 2 disk mirror potentially doing a different task -- which would likely have the effect of raising your I/O ops/s. Running raid10 on top of raid0 seems really wasteful
You create individual raid0 from each individual disk, write buffers off, of course. That is how it's going on sh***y controllers. For some controllers, firmware upgrade will add JBOD, for some you need to flash IT firmware, for some you can switch to HBA mode. But anyway - use HBA or GOOD RAID controller.
-- KSB
Dne 15.09.2020 v 10:22 Linda A. Walsh napsal(a):
On 2020/09/10 07:40, Miloslav Hůla wrote:
I cannot verify it, but I think that even JBOD is propagated as a virtual device. If you create JBOD from 3 different disks, low level parameters may differ.
JBOD allows each disk to be seen by the OS, as is. You wouldn't create JBOD disk from 3 different disks -- JBOD would give you 3 separate JBOD disks for the 3 separate disks.
Yes. If I create 3 JBOD configurations from 3 100GB disks, I get 3 100GB devices in OS. If I create 1 JBOD configuration from 3 100GB disks, I get 1 300GB device in OS.
So for your 16 disks, you are using 1 long RAID0? You realize 1 disk goes out, the entire array needs to be reconstructed. Also all of your spindles can be tied up by long read/writes -- optimal speed would come from a read 16 stripes wide spread over the 16 disks.
No. I have 16 RAID-0 configurations from 16 disks. As I wrote, there was no other option of how to propagate 16 disks as 16 devices into OS few years before.
What would be better, IMO, is going with a RAID-10 like your subject says, using 8-pairs of mirrors and strip those. Set your stripe unit for 64K to allow the disks to operate independently. You don't want a long 16-disk stripe, as that's far from optimal for your mailbox load. What you want is the ability to have multiple I/O ops going at the same time -- independently. I think as it stands now, you are far more likely to get contention as different mailboxes are accessed with contention happening within the span, vs. letting each 2 disk mirror potentially doing a different task -- which would likely have the effect of raising your I/O ops/s.
The reason to not create RAID-10 in controller was, that btrfs scrubbing detects slowly degrading disk much sooner than controller (verified many times). And if I create RAID-10 in controller, btrfs scrub detects soon too, but I'm not able to recognize on which disk.
Running raid10 on top of raid0 seems really wasteful
I'm not doing that.
"Miloslav" == Miloslav Hůla <miloslav.hula@gmail.com> writes:
Miloslav> Hi, thank you for your reply. I'll continue inline...
Me too... please look for further comments. Esp about 'fio' and Netapp useage.
Miloslav> Dne 09.09.2020 v 3:15 John Stoffel napsal(a): Miloslav> Hello, Miloslav> I sent this into the Linux Kernel Btrfs mailing list and I got reply: Miloslav> "RAID-1 would be preferable" Miloslav> (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112@lec...). Miloslav> May I ask you for the comments as from people around the Dovecot?
Miloslav> We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro Miloslav> server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. Miloslav> We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It Miloslav> takes about 50 minutes to finish.
Miloslav> # uname -a Miloslav> Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 Miloslav> GNU/Linux
Miloslav> RAID is a composition of 16 harddrives. Harddrives are connected via Miloslav> AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS Miloslav> 2.5" 15k drives.
Can you post the output of "cat /proc/mdstat" or since you say you're using btrfs, are you using their own RAID0 setup? If so, please post the output of 'btrfs stats' or whatever the command is you use to view layout info?
Miloslav> There is a one PCIe RAID controller in a chasis. AVAGO Miloslav> MegaRAID SAS 9361-8i. And 16x SAS 15k drives conneced to Miloslav> it. Because the controller does not support pass-through for Miloslav> the drives, we use 16x RAID-0 on controller. So, we get Miloslav> /dev/sda ... /dev/sdp (roughly) in OS. And over that we have Miloslav> single btrfs RAID-10, composed of 16 devices, mounted as Miloslav> /data.
I will bet that this is one of your bottlenecks as well. Get a secord or third controller and split your disks across them evenly.
Miloslav> We have chosen this wiring for severeal reasons: Miloslav> - easy to increase a capacity Miloslav> - easy to replace drives by larger ones Miloslav> - due to checksuming, btrfs does not need fsck in case of power failure Miloslav> - btrfs scrub discovers failing drive sooner than S.M.A.R.T. or RAID Miloslav> controller
Miloslav> Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, Miloslav> Mailbox format, LMTP delivery.
How ofter are these accounts hitting the server?
Miloslav> IMAP serves for a univesity. So there are typical rush hours from 7AM to Miloslav> 3PM. Lowers during the evening, almost not used during the night.
I can understand this, I used to work at a Uni so I can understand the population needs.
Miloslav> We run 'rsync' to remote NAS daily. It takes about 6.5 hours to finish, Miloslav> 12'265'387 files last night.
That's.... sucky. So basically you're hitting the drives hard with random IOPs and you're probably running out of performance. How much space are you using on the filesystem?
Miloslav> It's not so sucky how it seems. rsync runs during the Miloslav> night. And even reading is high, server load stays low. We Miloslav> have problems with writes.
Ok. So putting in an SSD pair to cache things should help.
And why not use brtfs send to ship off snapshots instead of using rsync? I'm sure that would be an improvement...
Miloslav> We run backup to external NAS (NetApp) for a disaster Miloslav> recovery scenario. Moreover NAS is spreaded across multiple Miloslav> locations. Then we create NAS snapshot, tens days Miloslav> backward. All snapshots easily available via NFS mount. And Miloslav> NAS capacity is cheaper.
So why not run the backend storage on the Netapp, and just keep the indexes and such local to the system? I've run Netapps for many years and they work really well. And then you'd get automatic backups using schedule snapshots.
Keep the index files local on disk/SSDs and put the maildirs out to NFSv3 volume(s) on the Netapp(s). Should do wonders. And you'll stop needing to do rsync at night.
Miloslav> Last half year, we encoutered into performace Miloslav> troubles. Server load grows up to 30 in rush hours, due to Miloslav> IO waits. We tried to attach next harddrives (the 838G ones Miloslav> in a list below) and increase a free space by rebalace. I Miloslav> think, it helped a little bit, not not so rapidly.
If you're IOPs bound, but not space bound, then you *really* want to get an SSD in there for the indexes and such. Basically the stuff that gets written/read from all the time no matter what, but which isn't large in terms of space.
Miloslav> Yes. We are now on 66% capacity. Adding SSD for indexes is Miloslav> our next step.
This *should* give you a boost in performance. But finding a way to take before and after latency/performance measurements is key. I would look into using 'fio' to test your latency numbers. You might also want to try using XFS or even ext4 as your filesystem. I understand not wanting to 'fsck', so that might be right out.
Which leads me back to suggesting you use the Netapp as your primary storage, assuming the Netapp isn't bogged down with other work. Again, use 'fio' to run some tests and see how things look.
Also, adding in another controller card or two would also probably help spread the load across more PCI channels, and reduce contention on the SATA/SAS bus as well.
Miloslav> Probably we will wait how SSD helps first, but as you wrote, it is Miloslav> possible next step.
Miloslav> Is this a reasonable setup and use case for btrfs RAID-10? Miloslav> If so, are there some recommendations to achieve better Miloslav> performance?
- move HOT data to SSD based volume RAID 1 pair. On a seperate controller.
Miloslav> OK
- add more controllers, which also means you're more redundant in case one controller fails.
Miloslav> OK
- Clone the system and put Dovecot IMAP director in from of the setup.
Miloslav> I still hope that one server can handle 4105 accounts.
- Stop using rsync for copying to your DR site, use the btrfs snap send, or whatever the commands are.
Miloslav> I hope it is not needed in our scenario.
- check which dovecot backend you're using and think about moving to one which doesn't involve nearly as many files.
Miloslav> Maildir is comfortable for us. Time to time, users call us with: "I Miloslav> accidentally deleted the folder" and it is super easy to copy it back Miloslav> from backup.
- Find out who your biggest users are, in terms of emails and move them to SSDs if step 1 is too hard to do at first.
Miloslav> OK
Can you also grab some 'iostat -dhm 30 60' output, which is 30 minutes of data over 30 second intervals? That should help you narrow down which (if any) disk is your hotspot.
Miloslav> OK, thanks for the tip.
It's not clear to me if you have one big btrfs filesystem, or a bunch of smaller ones stiched together. In any case, it should be very easy to get better performance here.
Miloslav> I hope I've made it clear above.
I think someone else mentioned that you should look at your dovecot backend, and you should move to the fastest one you can find.
Good luck! John
Miloslav> Thank you for your time and advices!
Miloslav> Kind regards Miloslav> Milo
Miloslav> # megaclisas-status Miloslav> -- Controller information -- Miloslav> -- ID | H/W Model | RAM | Temp | BBU | Firmware Miloslav> c0 | AVAGO MegaRAID SAS 9361-8i | 1024MB | 72C | Good | FW: Miloslav> 24.16.0-0082
Miloslav> -- Array information -- Miloslav> -- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS Miloslav> Path | CacheCade |InProgress Miloslav> c0u0 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdq | None |None Miloslav> c0u1 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sda | None |None Miloslav> c0u2 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdb | None |None Miloslav> c0u3 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdc | None |None Miloslav> c0u4 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdd | None |None Miloslav> c0u5 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sde | None |None Miloslav> c0u6 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdf | None |None Miloslav> c0u7 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdg | None |None Miloslav> c0u8 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdh | None |None Miloslav> c0u9 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdi | None |None Miloslav> c0u10 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdj | None |None Miloslav> c0u11 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdk | None |None Miloslav> c0u12 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdl | None |None Miloslav> c0u13 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdm | None |None Miloslav> c0u14 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdn | None |None Miloslav> c0u15 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdr | None |None
Miloslav> -- Disk information -- Miloslav> -- ID | Type | Drive Model | Size | Status Miloslav> | Speed | Temp | Slot ID | LSI ID Miloslav> c0u0p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q3S3 | 837.8 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 53C | [8:14] | 32 Miloslav> c0u1p0 | HDD | HGST HUC156060CSS200 A3800XV250TJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 38C | [8:0] | 12 Miloslav> c0u2p0 | HDD | HGST HUC156060CSS200 A3800XV3XT4J | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 43C | [8:1] | 11 Miloslav> c0u3p0 | HDD | HGST HUC156060CSS200 ADB05ZG4XLZU | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 46C | [8:2] | 25 Miloslav> c0u4p0 | HDD | HGST HUC156060CSS200 A3800XV3DWRL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 48C | [8:3] | 14 Miloslav> c0u5p0 | HDD | HGST HUC156060CSS200 A3800XV3XZTL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 52C | [8:4] | 18 Miloslav> c0u6p0 | HDD | HGST HUC156060CSS200 A3800XV3VSKJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:5] | 15 Miloslav> c0u7p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWKE | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 56C | [8:6] | 28 Miloslav> c0u8p0 | HDD | HGST HUC156060CSS200 A3800XV3XTDJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:7] | 20 Miloslav> c0u9p0 | HDD | HGST HUC156060CSS200 A3800XV3T8XL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 57C | [8:8] | 19 Miloslav> c0u10p0 | HDD | HGST HUC156060CSS200 A7030XHL0ZYP | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 61C | [8:9] | 23 Miloslav> c0u11p0 | HDD | HGST HUC156060CSS200 ADB05ZG4VR3P | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 60C | [8:10] | 24 Miloslav> c0u12p0 | HDD | SEAGATE ST600MP0006 N003WAF195KA | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 60C | [8:11] | 29 Miloslav> c0u13p0 | HDD | SEAGATE ST600MP0006 N003WAF1LTZW | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 56C | [8:12] | 26 Miloslav> c0u14p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWH6 | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:13] | 27 Miloslav> c0u15p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q414 | 837.8 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 47C | [8:15] | 33
Miloslav> # btrfs --version Miloslav> btrfs-progs v4.7.3
Miloslav> # btrfs fi show Miloslav> Label: 'DATA' uuid: 5b285a46-e55d-4191-924f-0884fa06edd8 Miloslav> Total devices 16 FS bytes used 3.49TiB Miloslav> devid 1 size 558.41GiB used 448.66GiB path /dev/sda Miloslav> devid 2 size 558.41GiB used 448.66GiB path /dev/sdb Miloslav> devid 4 size 558.41GiB used 448.66GiB path /dev/sdd Miloslav> devid 5 size 558.41GiB used 448.66GiB path /dev/sde Miloslav> devid 7 size 558.41GiB used 448.66GiB path /dev/sdg Miloslav> devid 8 size 558.41GiB used 448.66GiB path /dev/sdh Miloslav> devid 9 size 558.41GiB used 448.66GiB path /dev/sdf Miloslav> devid 10 size 558.41GiB used 448.66GiB path /dev/sdi Miloslav> devid 11 size 558.41GiB used 448.66GiB path /dev/sdj Miloslav> devid 13 size 558.41GiB used 448.66GiB path /dev/sdk Miloslav> devid 14 size 558.41GiB used 448.66GiB path /dev/sdc Miloslav> devid 15 size 558.41GiB used 448.66GiB path /dev/sdl Miloslav> devid 16 size 558.41GiB used 448.66GiB path /dev/sdm Miloslav> devid 17 size 558.41GiB used 448.66GiB path /dev/sdn Miloslav> devid 18 size 837.84GiB used 448.66GiB path /dev/sdr Miloslav> devid 19 size 837.84GiB used 448.66GiB path /dev/sdq
Miloslav> # btrfs fi df /data/ Miloslav> Data, RAID10: total=3.48TiB, used=3.47TiB Miloslav> System, RAID10: total=256.00MiB, used=320.00KiB Miloslav> Metadata, RAID10: total=21.00GiB, used=18.17GiB Miloslav> GlobalReserve, single: total=512.00MiB, used=0.00B
Miloslav> I do not attach whole dmesg log. It is almost empty, without errors. Miloslav> Only lines about BTRFS are about relocations, like:
Miloslav> BTRFS info (device sda): relocating block group 29435663220736 flags 65 Miloslav> BTRFS info (device sda): found 54460 extents Miloslav> BTRFS info (device sda): found 54459 extents
Dne 09.09.2020 v 17:52 John Stoffel napsal(a):
Miloslav> There is a one PCIe RAID controller in a chasis. AVAGO Miloslav> MegaRAID SAS 9361-8i. And 16x SAS 15k drives conneced to Miloslav> it. Because the controller does not support pass-through for Miloslav> the drives, we use 16x RAID-0 on controller. So, we get Miloslav> /dev/sda ... /dev/sdp (roughly) in OS. And over that we have Miloslav> single btrfs RAID-10, composed of 16 devices, mounted as Miloslav> /data.
I will bet that this is one of your bottlenecks as well. Get a secord or third controller and split your disks across them evenly.
That's plan for a next step.
Miloslav> We run 'rsync' to remote NAS daily. It takes about 6.5 hours to finish, Miloslav> 12'265'387 files last night.
That's.... sucky. So basically you're hitting the drives hard with random IOPs and you're probably running out of performance. How much space are you using on the filesystem?
Miloslav> It's not so sucky how it seems. rsync runs during the Miloslav> night. And even reading is high, server load stays low. We Miloslav> have problems with writes.
Ok. So putting in an SSD pair to cache things should help.
And why not use brtfs send to ship off snapshots instead of using rsync? I'm sure that would be an improvement...
Miloslav> We run backup to external NAS (NetApp) for a disaster Miloslav> recovery scenario. Moreover NAS is spreaded across multiple Miloslav> locations. Then we create NAS snapshot, tens days Miloslav> backward. All snapshots easily available via NFS mount. And Miloslav> NAS capacity is cheaper.
So why not run the backend storage on the Netapp, and just keep the indexes and such local to the system? I've run Netapps for many years and they work really well. And then you'd get automatic backups using schedule snapshots.
Keep the index files local on disk/SSDs and put the maildirs out to NFSv3 volume(s) on the Netapp(s). Should do wonders. And you'll stop needing to do rsync at night.
It's the option we have in minds. As you wrote, NetApp is very solid. The main reason for local storage is, that IMAP server is completely isolated from network. But maybe one day will use it.
Miloslav> Last half year, we encoutered into performace Miloslav> troubles. Server load grows up to 30 in rush hours, due to Miloslav> IO waits. We tried to attach next harddrives (the 838G ones Miloslav> in a list below) and increase a free space by rebalace. I Miloslav> think, it helped a little bit, not not so rapidly.
If you're IOPs bound, but not space bound, then you *really* want to get an SSD in there for the indexes and such. Basically the stuff that gets written/read from all the time no matter what, but which isn't large in terms of space.
Miloslav> Yes. We are now on 66% capacity. Adding SSD for indexes is Miloslav> our next step.
This *should* give you a boost in performance. But finding a way to take before and after latency/performance measurements is key. I would look into using 'fio' to test your latency numbers. You might also want to try using XFS or even ext4 as your filesystem. I understand not wanting to 'fsck', so that might be right out.
Unfortunately, to quickly fix the problem and make server usable again, we already added SSD and moved indexes on it. So we have no measurements in old state.
Situation is better, but I guess, problem still exists. I takes some time to load be growing. We will see.
Thank you for the fio tip. Definetly I'll try that.
Kind regards Milo
"Miloslav" == Miloslav Hůla <miloslav.hula@gmail.com> writes:
Miloslav> Dne 09.09.2020 v 17:52 John Stoffel napsal(a): Miloslav> There is a one PCIe RAID controller in a chasis. AVAGO Miloslav> MegaRAID SAS 9361-8i. And 16x SAS 15k drives conneced to Miloslav> it. Because the controller does not support pass-through for Miloslav> the drives, we use 16x RAID-0 on controller. So, we get Miloslav> /dev/sda ... /dev/sdp (roughly) in OS. And over that we have Miloslav> single btrfs RAID-10, composed of 16 devices, mounted as Miloslav> /data.
I will bet that this is one of your bottlenecks as well. Get a secord or third controller and split your disks across them evenly.
Miloslav> That's plan for a next step.
Miloslav> We run 'rsync' to remote NAS daily. It takes about 6.5 hours to finish, Miloslav> 12'265'387 files last night.
That's.... sucky. So basically you're hitting the drives hard with random IOPs and you're probably running out of performance. How much space are you using on the filesystem?
Miloslav> It's not so sucky how it seems. rsync runs during the Miloslav> night. And even reading is high, server load stays low. We Miloslav> have problems with writes.
Ok. So putting in an SSD pair to cache things should help.
And why not use brtfs send to ship off snapshots instead of using rsync? I'm sure that would be an improvement...
Miloslav> We run backup to external NAS (NetApp) for a disaster Miloslav> recovery scenario. Moreover NAS is spreaded across multiple Miloslav> locations. Then we create NAS snapshot, tens days Miloslav> backward. All snapshots easily available via NFS mount. And Miloslav> NAS capacity is cheaper.
So why not run the backend storage on the Netapp, and just keep the indexes and such local to the system? I've run Netapps for many years and they work really well. And then you'd get automatic backups using schedule snapshots.
Keep the index files local on disk/SSDs and put the maildirs out to NFSv3 volume(s) on the Netapp(s). Should do wonders. And you'll stop needing to do rsync at night.
Miloslav> It's the option we have in minds. As you wrote, NetApp is very solid. Miloslav> The main reason for local storage is, that IMAP server is completely Miloslav> isolated from network. But maybe one day will use it.
It's not completely isolated, it can rsync data to another host that has access to the Netapp. *grin*
Miloslav> Unfortunately, to quickly fix the problem and make server Miloslav> usable again, we already added SSD and moved indexes on Miloslav> it. So we have no measurements in old state.
That's ok, if it's better, then its better. How is the load now? Looking at the output of 'iostat -x 30' might be a good thing.
Miloslav> Situation is better, but I guess, problem still exists. I Miloslav> takes some time to load be growing. We will see.
Hmm... how did you setup the new indexes volume? Did you just use btrfs again? Did you mirror your SSDs as well?
Do the indexes fill the SSD, or is there 20-30% free space? When an SSD gets fragmented, it's performance can drop quite a bit. Did you put the SSDs onto a seperate controller? Probably not. So now you've just increased the load on the single controller, when you really should be spreading it out more to improve things.
Another possible hack would be to move some stuff to a RAM disk, assuming your server is on a UPS/Generator incase of power loss. But that's an unsafe hack.
Also, do you have quotas turned on? That's a performance hit for sure.
Miloslav> Thank you for the fio tip. Definetly I'll try that.
It's a good way to test and measure how the system will react. Unfortunately, you will need to do your testing outside of normal work hours so as to not impact your users too much.
Good luck! Please post some numbers if you get them. If you see only a few disks are 75% or more busy, then *maybe* you have a bad disk in the system, and moving off that disk or replacing it might help. Again, hard to know.
Rebalancing btrfs might also help, especially now that you've moved the indexes off that volume.
John
Dne 10.09.2020 v 17:40 John Stoffel napsal(a):
So why not run the backend storage on the Netapp, and just keep the indexes and such local to the system? I've run Netapps for many years and they work really well. And then you'd get automatic backups using schedule snapshots.
Keep the index files local on disk/SSDs and put the maildirs out to NFSv3 volume(s) on the Netapp(s). Should do wonders. And you'll stop needing to do rsync at night.
Miloslav> It's the option we have in minds. As you wrote, NetApp is very solid. Miloslav> The main reason for local storage is, that IMAP server is completely Miloslav> isolated from network. But maybe one day will use it.
It's not completely isolated, it can rsync data to another host that has access to the Netapp. *grin*
:o)
Miloslav> Unfortunately, to quickly fix the problem and make server Miloslav> usable again, we already added SSD and moved indexes on Miloslav> it. So we have no measurements in old state.
That's ok, if it's better, then its better. How is the load now? Looking at the output of 'iostat -x 30' might be a good thing.
Load is between 1 and 2. We can live with that for now.
Miloslav> Situation is better, but I guess, problem still exists. I Miloslav> takes some time to load be growing. We will see.
Hmm... how did you setup the new indexes volume? Did you just use btrfs again? Did you mirror your SSDs as well?
Yes. Just two SSD into free slots, propagate them as two RAID-0 into OS and btrfs RAID-1.
It is a nasty, I know, but without outage. It is a just quick attempt to improve the situation. Our next plan is to buy more controllers, schedule an outage on weekend and do it properly.
Do the indexes fill the SSD, or is there 20-30% free space? When an SSD gets fragmented, it's performance can drop quite a bit. Did you put the SSDs onto a seperate controller? Probably not. So now you've just increased the load on the single controller, when you really should be spreading it out more to improve things.
SSD are almost empty, 2.4GB of 93GB is used after 'doveadm index' on all mailboxes.
Another possible hack would be to move some stuff to a RAM disk, assuming your server is on a UPS/Generator incase of power loss. But that's an unsafe hack.
Also, do you have quotas turned on? That's a performance hit for sure.
No, we are running without quotas.
Miloslav> Thank you for the fio tip. Definetly I'll try that.
It's a good way to test and measure how the system will react. Unfortunately, you will need to do your testing outside of normal work hours so as to not impact your users too much.
Good luck! Please post some numbers if you get them. If you see only a few disks are 75% or more busy, then *maybe* you have a bad disk in the system, and moving off that disk or replacing it might help. Again, hard to know.
Rebalancing btrfs might also help, especially now that you've moved the indexes off that volume.
John
Thank you Milo
"Miloslav" == Miloslav Hůla <miloslav.hula@gmail.com> writes:
Miloslav> Dne 10.09.2020 v 17:40 John Stoffel napsal(a):
So why not run the backend storage on the Netapp, and just keep the indexes and such local to the system? I've run Netapps for many years and they work really well. And then you'd get automatic backups using schedule snapshots.
Keep the index files local on disk/SSDs and put the maildirs out to NFSv3 volume(s) on the Netapp(s). Should do wonders. And you'll stop needing to do rsync at night.
Miloslav> It's the option we have in minds. As you wrote, NetApp is very solid. Miloslav> The main reason for local storage is, that IMAP server is completely Miloslav> isolated from network. But maybe one day will use it.
It's not completely isolated, it can rsync data to another host that has access to the Netapp. *grin*
Miloslav> :o)
Miloslav> Unfortunately, to quickly fix the problem and make server Miloslav> usable again, we already added SSD and moved indexes on Miloslav> it. So we have no measurements in old state.
That's ok, if it's better, then its better. How is the load now? Looking at the output of 'iostat -x 30' might be a good thing.
Miloslav> Load is between 1 and 2. We can live with that for now.
Has IMAP access gotten faster or more consistent under load? That's the key takeaway, not system load, since the LoadAvg isn't really a good measure on Linux.
Basically, has your IO pattern or IO wait times improved?
Miloslav> Situation is better, but I guess, problem still exists. I Miloslav> takes some time to load be growing. We will see.
Hmm... how did you setup the new indexes volume? Did you just use btrfs again? Did you mirror your SSDs as well?
Miloslav> Yes. Just two SSD into free slots, propagate them as two RAID-0 into OS Miloslav> and btrfs RAID-1.
Miloslav> It is a nasty, I know, but without outage. It is a just quick attempt to Miloslav> improve the situation. Our next plan is to buy more controllers, Miloslav> schedule an outage on weekend and do it properly.
That is a good plan in any case.
Do the indexes fill the SSD, or is there 20-30% free space? When an SSD gets fragmented, it's performance can drop quite a bit. Did you put the SSDs onto a seperate controller? Probably not. So now you've just increased the load on the single controller, when you really should be spreading it out more to improve things.
Miloslav> SSD are almost empty, 2.4GB of 93GB is used after 'doveadm Miloslav> index' on all mailboxes.
Interesting. I wonder if there's other dovecot files that could be moved over to increase speed because they're IOPs or IO bound still?
Another possible hack would be to move some stuff to a RAM disk, assuming your server is on a UPS/Generator incase of power loss. But that's an unsafe hack.
Also, do you have quotas turned on? That's a performance hit for sure.
Miloslav> No, we are running without quotas.
By quotas, I mean btrfs quotas, just to be clear.
Miloslav> Thank you for the fio tip. Definetly I'll try that.
Please do! Getting some numbers from there will let you at least document your changes in performance.
But overall, if sounds like you've made some progress and gotten better performance.
participants (6)
-
John Stoffel
-
KSB
-
Linda A. Walsh
-
Miloslav Hůla
-
Sami Ketola
-
Scott Q.