Dovecot - FTS Solr: disk usage & position information?

Tue Sep 7 04:37:53 EEST 2021

On 9/6/2021 12:58 AM, Vincent Brillault wrote:
> Hi Alessio,
> 
>> this optimization also produce a less RAM requirements on Solr server?
> 
> Unfortunately we didn't measure this before/after the change. Since we 
> are removing features (position information), I wouldn't expect the 
> memory requirement to increase, but I'm no expert.
> 
> To be honest, I've not been able to measure in any sensible way the 
> memory really required by Solr. The memory directly used by the Solr 
> process is rather limited, but a lot of memory is used for file caches, 
> which also feels (again not an expert) important for good performances.

There likely would be a decrease in the amount of memory required, 
especially at index time.

The best way to see what the minimum requirements really are for a Java 
program is through GC logs.  Run the program for a really long time, 
exercising it hard.  Gather the GC logs, and have the gceasy.io website 
analyze those logs.  Solr comes configured out of the box to generate GC 
logs.

One of the graphs that gceasy.io has is "Heap After GC" ... the low 
points in that graph (as long as the program was busy during that 
timeframe) will be the minimum requirement.  You would want to add some 
arbitrary value to that number and set your max heap to that.  For 
instance, if I was seeing 8GB as the minimum required, I would probably 
want the heap to be 10GB or 12GB.  Java memory management works best 
when it has a little breathing room.  If the heap is too close to the 
minimum requirement in size, Java will be spending more time doing 
garbage collection than it spends running the application.

> At least the solrconfig.xml shouldn't work with 7.7.0 since I increased 
> the luceneMatchVersion to match 8.6 and imported a few defaults from the 
> default upstream 8.6 configuration. I think these changes could be 
> ignored for 7.7.0.

I was curious what a significantly newer luceneMatchVersion would do. 
So I unpacked solr 6.1.0 (which I already had downloaded), started it, 
created a core, and edited the config for that core so the version was 
8.9.0, then restarted Solr.  As expected, it came up with no problem.

Then I did another test, with my Solr install for dovecot.  It's running 
8.9.0, so I set lMV to 9.5.0 and restarted Solr.  Then I sent myself an 
email to trigger an update to the Solr index.  And I did a query in the 
Solr admin UI.  Everything worked.

So having a luceneMatchVersion that's far beyond the actual Solr version 
won't cause any problems, and I doubt that it makes any difference in 
how things work.  What this setting is for is the ability for some 
Lucene analysis components to work as they did in older versions -- so 
users could keep older behavior that they relied on that changed when a 
bug was fixed.  For example, in version 4.8.0 a major bug was fixed in 
the word delimiter filter ... the fix for this bug caused it to work 
very differently than it did before.  For a while after that, it was 
possible to set lMV to 4.7.0 or earlier and regain the old behavior.

> For schema.xml, I made quite a few changes, but all seem to be backward 
> compatible:
>   - Remove unused 'boolean' field type
>   - Remove KeywordMarkerFilterFactory: protwords are usually empty anyway
>   - User a simper 'text_basic' field type (no StopFilterFactory, 
> SynonymGraphFilterFactory or PorterStemFilterFactory) for processing 
> non-human fields (all but body and subject)
>   - Remplace autoGeneratePhraseQueries & positionIncrementGap by 
> omitTermFreqAndPositions="true" & omitPositions="true" on TextField 
> fieldtypes (as discussed in this thread)
>   - Minor modifications on WordDelimiterGraphFilterFactory when used in 
> search to have better match (things like 'covid19' are indexed as 
> ['covid', '19', 'covid19'] but only searched as 'covid19')

Having an unused fieldType, or even an unused field, does not cause any 
problems.  It will use up an extremely small amount of memory as Solr 
converts the schema into an in-memory structure.  But it makes zero 
difference in the index size, and any overhead at index time and query 
time would probably be nearly impossible to measure, unless there is a 
huge number of them unused.

The keyword marker filter is one of those things that most people will 
have no idea how to use.  And if the config file for it is empty or 
nonexistent, it doesn't do anything.  Good riddance. :)

It's also a good thing to remove the stopword filter.  Unless you 
actually define some stopwords, it doesn't do anything.  Removing 
stopwords was a performance enhancement that was hugely beneficial when 
CPU and memory capacities were a tiny fraction of what they are today. 
Now, it doesn't provide much benefit, and comes with some pretty 
well-known downsides.  Also happy to see that one go.  On an index the 
size of yours, removing really common stopwords could make the index 
noticeably smaller, but I bet a few of your users would notice the 
downsides.

For most people, there's not a lot of benefit to removing term 
frequencies and positions.  But with an index size well over a terabyte, 
I understand it for your use case.  I knew they caused an index to get 
bigger, but I wasn't aware that the percentage was so high.  My dovecot 
index is just under 600 megabytes, and the changes I have made are the 
kind that makes the index bigger, not smaller.  Total message count in 
my index is a little over 153K.

There are a lot of people in the Solr community (including me) that want 
to pick your brain to see what kind of challenges you encountered with 
such an enormous index, and what you did to overcome them.

I don't think there is any such thing as "universally correct" settings 
for the word delimiter filter.  What works for you would cause problems 
for some others.  It's an enormously powerful and configurable filter.

Your example of covid 19 is a perfect example of something that phrase 
queries are REALLY good for.  Without a phrase query, a document that 
has covid 19 right next to each other is almost as relevant a match as a 
document where those two terms are in completely different sentences.

There are some things that the fts solr plugin could be doing to get 
better search results.  I should condense my thoughts into something 
formal to present to this project.  One of those ideas involves phrase 
queries. :)

>  From taking a quick look at the documentation, I _think_ most of them 
> are compatible with 7.7.0, but without testing, I can't guarantee it.

I agree that your changes are OK for 7.7.  They're probably good even 
back to 6.0, but I can't say that for sure without a closer look.

Although I have done some tweaking to the solrconfig and schema for my 
dovecot index, I haven't gone over it with a fine-tooth comb yet.  A 
full reindex for me only takes a few minutes.  I bet it takes a long 
time for your index!

Thanks,
Shawn