Dear all,
On a local dovecot cluster currently hosting roughly 2.1TB of data, using Solr as its FTS backend, we now have 256GB of data in Solr, split in 12 shard (to which replication adds 256GB of data through 12 additional cores).
I'm now trying to see if we can optimize that data. Looking at one core at random (22G), I see that the data is split mostly between
- .pos files: 12G
- .tim files: 4.2G
- .doc files: 3.8G
- .cfs files: 1.8G
Looking around a bit, I found https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/codecs/lucene50/... (which is unfortunately a bit outdated I think) that explains each file content:
- .tim: Term Dictionary
- .tip: Term Index
- .doc: Frequencies and Skip Data
- .pos: Positions
- .pay: Payloads and Offsets
So clearly the file naming convention have changed, but still if .pos is really position information ("lists of positions that each term occurs at within documents."), this sounds rather useless for the dovecot integration.
Looking at Solr documentation on search
(https://solr.apache.org/guide/8_6/the-standard-query-parser.html) it
seems that position aware query are written as "term1 term2"~[0-9]+
.
Looking at the dovecot code
(https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend...),
I don't see this kind of query being made, ~
only being used for fuzzy
search.
Has anyone ever tried to set omitTermFreqAndPositions or omitPositions to true for the text fields in the Solr Schema? It sounds that this could improve a lot the disk space used by Solr without losing any feature. The only thing I'm not too clear about is the "autoGeneratePhraseQueries" which is enabled in https://github.com/dovecot/core/blob/master/doc/solr-schema-7.7.0.xml.
Thanks in advance, Vincent Brillault
PS: I have attached the schema we are using for completeness. It's based on the one in the dovecot repo, with a bit of simplification for headers that don't really require as much massaging.