[Dovecot] Solr clusters

Thu Nov 7 05:36:17 EET 2013

Hi all,

Has anyone implemented larger Dovecot+Solr clusters and would be willing to give some details about how it works for you? My understanding about it so far is:

 - SolrCloud isn’t usable with Dovecot. Replication isn’t useful, because nobody wants to pay for double the disk space for indexes that could be regenerated anyway. The autosharding isn’t very useful also, because: I think the shard keys could be created in two possible ways: a) Mails would be entirely randomly distributed across the cluster. This would make updates efficient, because the writes would be fully distributed across all servers. But I think it would also make reads somewhat inefficient, since all the servers would have to be searched and the results combined. Also if a server is lost, there’s no easy way to reindex back the missing data, because it would contain a piece of pretty much all the users’ data. b) Shard keys could be created so that the same user would typically go only to 1-2 servers. It would be possible (at least in theory) to find a broken server’s list of users and reindex only their data, but I’m not sure if this method is any easier than the non-SolrCloud setup.

 - Without SolrCloud you’d then need to shard the data manually. This would be easy enough to do by just assigning different users to different shards. But at some point the index is going to become too large and you need to add more shards and move some existing users to them. To keep the search performance good during the move, I guess this could be done with a script that does: 1) reindex user to new shard, 2) update userdb to point to new shard, 3) delete user from old shard, 4) doveadm fts rescan the user to remove any mails already deleted during the reindexing.

 - It seems that Solr index shouldn’t grow above 200 GB or the performance will be getting too bad? I’ve seen this in a few web pages. So each server should likely be running multiple separate Solr instances (shards).

 - http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-from-500000-volumes-5-million-volumes-and-beyond recommends NFS (or I guess any kind of a shared filesystem), which does seem to make sense. Since Dovecot wants to get instantly updated search results after indexing, I think it’s probably better not to separate the indexing and searching servers.

 - Would be interesting to know what kind of hardware your Solr servers currently have, how well they’re performing and what are the bottlenecks? From the above URL it appears that disk I/O is first, but if there’s enough of that available then CPU usage is second. I’m not quite sure where most of the memory goes - caching?

 - I’m guessing users are doing relatively few searches compared to how many new emails are being indexed/deleted all the time?