Dovecot - FTS Solr: disk usage & position information?
Dear all,
On a local dovecot cluster currently hosting roughly 2.1TB of data, using Solr as its FTS backend, we now have 256GB of data in Solr, split in 12 shard (to which replication adds 256GB of data through 12 additional cores).
I'm now trying to see if we can optimize that data. Looking at one core at random (22G), I see that the data is split mostly between
- .pos files: 12G
- .tim files: 4.2G
- .doc files: 3.8G
- .cfs files: 1.8G
Looking around a bit, I found https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/codecs/lucene50/... (which is unfortunately a bit outdated I think) that explains each file content:
- .tim: Term Dictionary
- .tip: Term Index
- .doc: Frequencies and Skip Data
- .pos: Positions
- .pay: Payloads and Offsets
So clearly the file naming convention have changed, but still if .pos is really position information ("lists of positions that each term occurs at within documents."), this sounds rather useless for the dovecot integration.
Looking at Solr documentation on search
(https://solr.apache.org/guide/8_6/the-standard-query-parser.html) it
seems that position aware query are written as "term1 term2"~[0-9]+
.
Looking at the dovecot code
(https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend...),
I don't see this kind of query being made, ~
only being used for fuzzy
search.
Has anyone ever tried to set omitTermFreqAndPositions or omitPositions to true for the text fields in the Solr Schema? It sounds that this could improve a lot the disk space used by Solr without losing any feature. The only thing I'm not too clear about is the "autoGeneratePhraseQueries" which is enabled in https://github.com/dovecot/core/blob/master/doc/solr-schema-7.7.0.xml.
Thanks in advance, Vincent Brillault
PS: I have attached the schema we are using for completeness. It's based on the one in the dovecot repo, with a bit of simplification for headers that don't really require as much massaging.
On 8/4/2021 1:24 AM, Vincent Brillault wrote:
On a local dovecot cluster currently hosting roughly 2.1TB of data, using Solr as its FTS backend, we now have 256GB of data in Solr, split in 12 shard (to which replication adds 256GB of data through 12 additional cores).
I'm now trying to see if we can optimize that data. Looking at one core at random (22G), I see that the data is split mostly between
- .pos files: 12G
- .tim files: 4.2G
- .doc files: 3.8G
- .cfs files: 1.8G
Looking around a bit, I found https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/codecs/lucene50/... (which is unfortunately a bit outdated I think) that explains each file content:
- .tim: Term Dictionary
- .tip: Term Index
- .doc: Frequencies and Skip Data
- .pos: Positions
- .pay: Payloads and Offsets
This is completely off-topic for the dovecot list. I am involved with the Solr project, so I can discuss it. My message will also be off topic here.
You didn't say what version of Solr you're on. That document for Lucene 6.2.0 would be relevant for Solr 6.2.0. There are versions of that document for all Lucene releases, which have been in lock-step with Solr releases since one of the early 3.x versions. (Aside: Solr has been split into its own top-level Apache project, so there is no longer a guarantee moving forward that Solr X.Y.Z will be based on Lucene X.Y.Z)
Not all of the lucene file types will be involved on every install of Solr. It will depend on the configuration.
The .cfs file is a file where all of the other file types for a segment are compounded into a single file. Within that single file, each file type will use the same format as it would if it had its own extension. I'm not completely clear on when Lucene (under Solr's control) will choose the CFS format .. but I think it happens when the segments are small, not large.
Looking at Solr documentation on search (https://solr.apache.org/guide/8_6/the-standard-query-parser.html) it seems that position aware query are written as
"term1 term2"~[0-9]+
. Looking at the dovecot code (https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend...), I don't see this kind of query being made,~
only being used for fuzzy search.
Positions are required for a phrase query -- where the query text is in double quotes. The number after ~ on a phrase query refers to phrase slop -- think of it as a fuzziness factor for the phrase, not for each term. Right now you noticed that dovecot's FTS Solr plugin doesn't explicitly use phrase queries, but there's no guarantee that this will always be the case. Position data will only be accessed if it is needed for a query, so if it is not needed it should not affect query performance. I cannot speak as to whether the FTS Solr plugin relies on the autoGenereatePhraseQueries functionality, but if it does, then you definitely want position data in the index. That functionality can do a lot to improve relevancy ranking, so I would expect it to be instrumental in good full-text searching -- disabling positions will probably not help your search results.
If you want an in-depth discussion beyond this email, please subscribe to the solr-user mailing list and ask there.
Note that general Solr recommendations are to have enough space available that the Solr index can triple in size temporarily -- this is to accommodate all possible scenarios for Lucene segment merging. Running Solr on systems with limited disk space is not recommended.
Solr does have an "optimize" operation which will combine all the segments into one, removing space taken up by deleted documents as it works. Lucene calls that operation "forceMerge". Running an optimize can help performance, but it's extremely resource intensive and can take a long time to run -- performance gets worse before it gets better. Also, the amount of performance gain is not usually significant.
Thanks, Shawn
Dear Shawn,
Thanks for your very complete answer!
This is completely off-topic for the dovecot list. I am involved with the Solr project, so I can discuss it. My message will also be off topic here.
Sorry, maybe I didn't explain myself properly. I asked on the dovecot mailing list as I'm interested in:
- The interaction between Solr & dovecot: what dovecot really needs and uses from Solr.
- The reasons for the settings in the schema example in the dovecot repositories.
I think these are still interesting to be discussed on the dovecot mailing list, but I'm extremely grateful for your feedback.
You didn't say what version of Solr you're on. That document for Lucene 6.2.0 would be relevant for Solr 6.2.0.
Indeed, I should have. I'm using Solr 8.6, which is clearly not the same as Solr 6.2.0, but when looking at more recent versions of the documentation, no information about the use of each file appeared. That's why I was mentioning it was slightly outdated.
I don't see this kind of query being made,
~
only being used for fuzzy search.Positions are required for a phrase query -- where the query text is in double quotes.
Yes, I discovered that while testing yesterday :D
PhraseQuery```
> Right now you noticed that dovecot's FTS Solr plugin doesn't
> explicitly use phrase queries, but there's no guarantee that this will
> always be the case. Position data will only be accessed if it is needed
> for a query, so if it is not needed it should not affect query
> performance.
Of course if dovecot's FTS Solr plugin requirements change, then the
schema I'm using will to change. This is why I'm asking here. Solr is a
powerful engine, but search within IMAP are more restricted. As far as I
understand, dovecot does not make use of all the features for Solr, only
of a very small subset and thus I believe it makes sense to try to
optimize the configuration to deliver what it needs without spending to
much compute or storage on features dovecot doesn't need.
> I cannot speak as to whether the FTS Solr plugin relies on
> the autoGenereatePhraseQueries functionality, but if it does, then you
> definitely want position data in the index. That functionality can do a
> lot to improve relevancy ranking, so I would expect it to be
> instrumental in good full-text searching -- disabling positions will
> probably not help your search results.
This is the main question and what I don't really understand. If the
query generated by dovecot from the IMAP searches it creates
significantly improve with position data, then yes, it's clearly
required. If it only marginally improves it, then a cost/benefit
analysis should be taken.
Yesterday, I've modified my test cluster to use
`omitTermFreqAndPositions="true" omitPositions="true"` instead of
`autoGeneratePhraseQueries="true"`. This is a painful operation as it
requires to drop everything and re-index all the data, but at the end of
the day, after re-indexation:
- Total disk usage for the test cluster went from 16.0 GB to 9.8 GB, so
a 39% reduction is disk usage :)
- No .pos file created in the cores
Basic tests show no obvious change in the search results (after I
removed autoGeneratePhraseQueries, before that it failed in some cases).
Did any other Dovecot user try something similar? (I've only found once
post on the internet raising the question so far :/).
> If you want an in-depth discussion beyond this email, please subscribe
> to the solr-user mailing list and ask there.
Thanks, I'll take on your offer, for the Solr specific part, as I need
to understand that autoGeneratePhraseQueries better :)
> Note that general Solr recommendations are to have enough space
> available that the Solr index can triple in size temporarily -- this is
> to accommodate all possible scenarios for Lucene segment merging.
> Running Solr on systems with limited disk space is not recommended.
Well, it depends on what you define as "limited". I'd love to have
infinite storage, but unfortunately every resource is always limited one
way of another. Ensuring that each core can temporarily triple in size
(required e.g. if ones want to split the shards to distribute over more
nodes) is one thing (that can have a limited impact if the shards are
split in small enough sizes). Requiring double the size overall with no
operational benefit is another ;). I'm just trying to understand how
much storage we'll really need once the cluster is scaled to final use.
Thanks again Shawn for your contribution, it was quite helpful!
Cheers,
Vincent
On 8/5/2021 1:00 AM, Vincent Brillault wrote:
Indeed, I should have. I'm using Solr 8.6, which is clearly not the same as Solr 6.2.0, but when looking at more recent versions of the documentation, no information about the use of each file appeared. That's why I was mentioning it was slightly outdated.
Here's documentation from 8.6 about Lucene file formats:
https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/codecs/lucene86/...
Thanks, Shawn
Dear all,
Just a status update, in case this can help others.
We went forward and disabled the position information indexing and the re-indexed of our mail data (over a couple of days to avoid overloading the systems). Before the re-indexing we had 1.33 TiB in our Solr Indexes. After re-indexation, we had only 542 GiB, that's a 60% of our storage requirements for our FTS indexes :)
So far, we haven't been reported any issue or measurable differences by
our users concerning the quality of the FTS. From further debugging, as
discussed on the solr-user mailing list
(https://lists.apache.org/thread.html/rcdf8bb97be0839e57928ad5fa34501ec8a7339...),
I've come to the conclusion that, with the current integration between
Dovecot and Solr (esp the fact that "
is escaped), it's impossible to
trigger phrase queries from user queries as long as
autoGeneratePhraseQueries is false.
I've attached the schema.xml and solrconfig.xml we are now using with Solr 8.6.0, in case there is any interest from others. Let me know if you prefer a MR to update the xmls present in https://github.com/dovecot/core/tree/master/doc.
Cheers, Vincent
Hi Vincent,
thanks for your investigations!
Dear all,
Just a status update, in case this can help others.
We went forward and disabled the position information indexing and the re-indexed of our mail data (over a couple of days to avoid overloading the systems). Before the re-indexing we had 1.33 TiB in our Solr Indexes. After re-indexation, we had only 542 GiB, that's a 60% of our storage requirements for our FTS indexes :)
Il 01/09/21 11:27, Vincent Brillault ha scritto: this optimization also produce a less RAM requirements on Solr server?
So far, we haven't been reported any issue or measurable differences by our users concerning the quality of the FTS. From further debugging, as discussed on the solr-user mailing list (https://lists.apache.org/thread.html/rcdf8bb97be0839e57928ad5fa34501ec8a7339...), I've come to the conclusion that, with the current integration between Dovecot and Solr (esp the fact that
"
is escaped), it's impossible to trigger phrase queries from user queries as long as autoGeneratePhraseQueries is false.I've attached the schema.xml and solrconfig.xml we are now using with Solr 8.6.0, in case there is any interest from others. Let me know if you prefer a MR to update the xmls present in https://github.com/dovecot/core/tree/master/doc.
The attached schema and config file also works with Solr 7.7.0? Since dovecot provide schema and config for 7.7.0 will be useful for many of us a path based on it.
Thanks
-- Alessio Cecchi Postmaster @ http://www.qboxmail.it https://www.linkedin.com/in/alessice
Hi Alessio,
this optimization also produce a less RAM requirements on Solr server?
Unfortunately we didn't measure this before/after the change. Since we are removing features (position information), I wouldn't expect the memory requirement to increase, but I'm no expert.
To be honest, I've not been able to measure in any sensible way the memory really required by Solr. The memory directly used by the Solr process is rather limited, but a lot of memory is used for file caches, which also feels (again not an expert) important for good performances.
The attached schema and config file also works with Solr 7.7.0? Since dovecot provide schema and config for 7.7.0 will be useful for many of us a path based on it.
At least the solrconfig.xml shouldn't work with 7.7.0 since I increased the luceneMatchVersion to match 8.6 and imported a few defaults from the default upstream 8.6 configuration. I think these changes could be ignored for 7.7.0.
For schema.xml, I made quite a few changes, but all seem to be backward compatible:
- Remove unused 'boolean' field type
- Remove KeywordMarkerFilterFactory: protwords are usually empty anyway
- User a simper 'text_basic' field type (no StopFilterFactory, SynonymGraphFilterFactory or PorterStemFilterFactory) for processing non-human fields (all but body and subject)
- Remplace autoGeneratePhraseQueries & positionIncrementGap by omitTermFreqAndPositions="true" & omitPositions="true" on TextField fieldtypes (as discussed in this thread)
- Minor modifications on WordDelimiterGraphFilterFactory when used in search to have better match (things like 'covid19' are indexed as ['covid', '19', 'covid19'] but only searched as 'covid19')
From taking a quick look at the documentation, I _think_ most of them are compatible with 7.7.0, but without testing, I can't guarantee it.
Cheers, Vincent
On 9/6/2021 12:58 AM, Vincent Brillault wrote:
Hi Alessio,
this optimization also produce a less RAM requirements on Solr server?
Unfortunately we didn't measure this before/after the change. Since we are removing features (position information), I wouldn't expect the memory requirement to increase, but I'm no expert.
To be honest, I've not been able to measure in any sensible way the memory really required by Solr. The memory directly used by the Solr process is rather limited, but a lot of memory is used for file caches, which also feels (again not an expert) important for good performances.
There likely would be a decrease in the amount of memory required, especially at index time.
The best way to see what the minimum requirements really are for a Java program is through GC logs. Run the program for a really long time, exercising it hard. Gather the GC logs, and have the gceasy.io website analyze those logs. Solr comes configured out of the box to generate GC logs.
One of the graphs that gceasy.io has is "Heap After GC" ... the low points in that graph (as long as the program was busy during that timeframe) will be the minimum requirement. You would want to add some arbitrary value to that number and set your max heap to that. For instance, if I was seeing 8GB as the minimum required, I would probably want the heap to be 10GB or 12GB. Java memory management works best when it has a little breathing room. If the heap is too close to the minimum requirement in size, Java will be spending more time doing garbage collection than it spends running the application.
At least the solrconfig.xml shouldn't work with 7.7.0 since I increased the luceneMatchVersion to match 8.6 and imported a few defaults from the default upstream 8.6 configuration. I think these changes could be ignored for 7.7.0.
I was curious what a significantly newer luceneMatchVersion would do. So I unpacked solr 6.1.0 (which I already had downloaded), started it, created a core, and edited the config for that core so the version was 8.9.0, then restarted Solr. As expected, it came up with no problem.
Then I did another test, with my Solr install for dovecot. It's running 8.9.0, so I set lMV to 9.5.0 and restarted Solr. Then I sent myself an email to trigger an update to the Solr index. And I did a query in the Solr admin UI. Everything worked.
So having a luceneMatchVersion that's far beyond the actual Solr version won't cause any problems, and I doubt that it makes any difference in how things work. What this setting is for is the ability for some Lucene analysis components to work as they did in older versions -- so users could keep older behavior that they relied on that changed when a bug was fixed. For example, in version 4.8.0 a major bug was fixed in the word delimiter filter ... the fix for this bug caused it to work very differently than it did before. For a while after that, it was possible to set lMV to 4.7.0 or earlier and regain the old behavior.
For schema.xml, I made quite a few changes, but all seem to be backward compatible: - Remove unused 'boolean' field type - Remove KeywordMarkerFilterFactory: protwords are usually empty anyway - User a simper 'text_basic' field type (no StopFilterFactory, SynonymGraphFilterFactory or PorterStemFilterFactory) for processing non-human fields (all but body and subject) - Remplace autoGeneratePhraseQueries & positionIncrementGap by omitTermFreqAndPositions="true" & omitPositions="true" on TextField fieldtypes (as discussed in this thread) - Minor modifications on WordDelimiterGraphFilterFactory when used in search to have better match (things like 'covid19' are indexed as ['covid', '19', 'covid19'] but only searched as 'covid19')
Having an unused fieldType, or even an unused field, does not cause any problems. It will use up an extremely small amount of memory as Solr converts the schema into an in-memory structure. But it makes zero difference in the index size, and any overhead at index time and query time would probably be nearly impossible to measure, unless there is a huge number of them unused.
The keyword marker filter is one of those things that most people will have no idea how to use. And if the config file for it is empty or nonexistent, it doesn't do anything. Good riddance. :)
It's also a good thing to remove the stopword filter. Unless you actually define some stopwords, it doesn't do anything. Removing stopwords was a performance enhancement that was hugely beneficial when CPU and memory capacities were a tiny fraction of what they are today. Now, it doesn't provide much benefit, and comes with some pretty well-known downsides. Also happy to see that one go. On an index the size of yours, removing really common stopwords could make the index noticeably smaller, but I bet a few of your users would notice the downsides.
For most people, there's not a lot of benefit to removing term frequencies and positions. But with an index size well over a terabyte, I understand it for your use case. I knew they caused an index to get bigger, but I wasn't aware that the percentage was so high. My dovecot index is just under 600 megabytes, and the changes I have made are the kind that makes the index bigger, not smaller. Total message count in my index is a little over 153K.
There are a lot of people in the Solr community (including me) that want to pick your brain to see what kind of challenges you encountered with such an enormous index, and what you did to overcome them.
I don't think there is any such thing as "universally correct" settings for the word delimiter filter. What works for you would cause problems for some others. It's an enormously powerful and configurable filter.
Your example of covid 19 is a perfect example of something that phrase queries are REALLY good for. Without a phrase query, a document that has covid 19 right next to each other is almost as relevant a match as a document where those two terms are in completely different sentences.
There are some things that the fts solr plugin could be doing to get better search results. I should condense my thoughts into something formal to present to this project. One of those ideas involves phrase queries. :)
From taking a quick look at the documentation, I _think_ most of them are compatible with 7.7.0, but without testing, I can't guarantee it.
I agree that your changes are OK for 7.7. They're probably good even back to 6.0, but I can't say that for sure without a closer look.
Although I have done some tweaking to the solrconfig and schema for my dovecot index, I haven't gone over it with a fine-tooth comb yet. A full reindex for me only takes a few minutes. I bet it takes a long time for your index!
Thanks, Shawn
participants (3)
-
Alessio Cecchi
-
Shawn Heisey
-
Vincent Brillault