Solr connection timeout hardwired to 60s
Hi,
What's the recommended way to handling timeouts on large mailboxes given the hardwired request timeout of 60s in solr-connection.c:
http_set.request_timeout_msecs = 60*1000;
/Peter
On 4/4/2019 2:21 AM, Peter Mogensen via dovecot wrote:
What's the recommended way to handling timeouts on large mailboxes given the hardwired request timeout of 60s in solr-connection.c:
http_set.request_timeout_msecs = 60*1000;
I'm a denizen of the solr-user@lucene.apache.org mailing list.
For a typical Solr index, 60 seconds is an eternity. Most people aim for query times of 100 milliseconds or less, and they often achieve that goal.
If you have an index where queries really are taking longer than 60 seconds, you're most likely going to need to get better hardware for Solr. Memory is the resource that usually has the greatest impact on Solr performance. Putting the index on SSD can help, but memory will help more.
Here's a wiki page that I wrote about that topic. This wiki is going away next month, but for now you can still access it:
https://wiki.apache.org/solr/SolrPerformanceProblems
There's a section in that wiki page about asking for help on performance issues. It describes how to create a particular process listing for a screenshot. If you can get that screenshot and share it using a file sharing site (dropbox is usually a good choice), I may be able to offer some insight.
Thanks, Shawn
On 4/4/2019 2:21 AM, Peter Mogensen via dovecot wrote:
What's the recommended way to handling timeouts on large mailboxes given the hardwired request timeout of 60s in solr-connection.c:
http_set.request_timeout_msecs = 60*1000;
I'm a denizen of the solr-user@lucene.apache.org mailing list.
For a typical Solr index, 60 seconds is an eternity. Most people aim for query times of 100 milliseconds or less, and they often achieve that goal.
If you have an index where queries really are taking longer than 60 seconds, you're most likely going to need to get better hardware for Solr. Memory is the resource that usually has the greatest impact on Solr performance. Putting the index on SSD can help, but memory will help more.
Here's a wiki page that I wrote about that topic. This wiki is going away next month, but for now you can still access it:
https://wiki.apache.org/solr/SolrPerformanceProblems
There's a section in that wiki page about asking for help on performance issues. It describes how to create a particular process listing for a screenshot. If you can get that screenshot and share it using a file sharing site (dropbox is usually a good choice), I may be able to offer some insight.
Thanks, Shawn
Hi Shawn
Am 04.04.19 um 16:12 schrieb Shawn Heisey via dovecot:
On 4/4/2019 2:21 AM, Peter Mogensen via dovecot wrote: Here's a wiki page that I wrote about that topic. This wiki is going away next month, but for now you can still access it:
https://web.archive.org/web/20190404143817/https://wiki.apache.org/solr/Solr...
That one will last longer :).
Best Daniel
I'm a denizen of the solr-user@lucene.apache.org mailing list. [...] Here's a wiki page that I wrote about that topic. This wiki is going away next month, but for now you can still access it:
That's a great resource, Shawn.
I am about to put together a test case to provide a comprehensive FTS setup around Dovecot with a goal towards exposing proximity keyword searching, with email silos containing tens of terabytes (most of the "bulk" is represented by attachments, each of which get processed down to plaintext, if possible). Figure thousands of users with decades of email (80,000 to 750,000) emails per user).
My main background is in software engineering (C/C++/Python/Assembler), but I have been forced into system admin tasks during many stretches of my work. I do vividly remember the tedium of dealing with JAVA and GC, tuning it to avoid stalls, and its ravenous appetite for RAM.
It looks like those problems are still with us, many versions later. For corporations with infinite budgets, throwing lots of crazy money at the problem is "fine" (>1TB RAM, all PCIe SSDs, etc), but I am worried that I will be shoved forcefully into a wall of having to spend a fortune just to keep FTS performing reasonably well before I even get to the 10,000 user mark.
I realise the only way to keep performance reasonable is to heavily shard the index database, but I am concerned about how well the process works in practice without needing a great deal of sysadmin hand-holding. I would ideally prefer the decisions of how/where to shard be based on volume/heuristics than something that is done manually. I realise that a human will be necessary to add more hardware to the pools, but what are my options for scaling the system by orders of magnitude?
What is a general rule of thumb for RAM and SSD disk requirements as a fraction of indexed document hive size to keep query performance at 200ms or less? How do people deal with the JAVA GC world-stoppages, other than simply doubling or tripling every instance?
I am wondering how well alternatives to Solr work in these situations (ElasticSearch, Xapian, and any others I may have missed).
Regards,
=M=
On 4/4/2019 6:42 PM, M. Balridge via dovecot wrote:
What is a general rule of thumb for RAM and SSD disk requirements as a fraction of indexed document hive size to keep query performance at 200ms or less? How do people deal with the JAVA GC world-stoppages, other than simply doubling or tripling every instance?
There's no hard and fast rule for exactly how much memory you need for a search engine. Some installs work well with half the index cached, others require more, some require less.
For ideal performance, you should have enough memory over and above your program requirements to cache the entire index. That can be problematic with indexes that are hundreds of gigabytes, or even terabytes. Achieving the ideal is rarely necessary, though.
With a large enough heap, it is simply impossible to avoid long stop-the-world GC. With proper tuning, those full garbage collections can happen far less frequently. I've got another page about that.
https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr
To handle extremely large indexes with good performance, I would recommend many servers running SolrCloud, and a sharded index. That way each individual server will not be required to handle terabytes of data. This can get very expensive very quickly. You will also need a load balancer, to eliminate single points of failure.
I am wondering how well alternatives to Solr work in these situations (ElasticSearch, Xapian, and any others I may have missed).
Assuming they are configured as similarly as possible, ElasticSearch and Solr will have nearly identical requirements, and perform similarly to each other. They are both Lucene-based, and it is Lucene that primarily drives the requirements. I know nothing about any other solutions.
With the extremely large index you have described, memory will be your achilles heel no matter what solution you find.
It is not Java that needs the extreme amounts of memory for very large indexes. It is the operating system -- the disk cache. You might also need a fairly large heap, but the on-disk size of the index will have less of an impact on heap requirements than the number of documents in the index.
Thanks, Shawn
participants (5)
-
Daniel Lange
-
M. Balridge
-
Peter Mogensen
-
Shawn Heisey
-
Shawn Heisey