Questions about how fts-solr works

Radu Gheorghe radu.gheorghe at sematext.com
Mon Aug 8 11:30:28 UTC 2016


Hello,

I'm quite a Dovecot newbie, so please be gentle :) Though I did my
homework as well as I could, I still have some questions regarding the
Solr plugin. Specifically:
- I understand that by default, a mailbox is indexed on the first
search and then deltas are indexed in subsequent searches. Are Emails
indexed in batches or one by one? Looking at the code, I see a
hardcoded limit of 1000, and I'm guessing if the mailbox is done and
there are <1000 Emails in the buffer, it just flushes them, right?
- if I set fts_autoindex=yes, does it mean that as soon as the Email
is delivered by the MTA, it will be indexed in Solr? or does it have
to read by the user or touched in any way?
- also, with fts_autoindex=yes, are Emails indexed in batches? if yes,
is there also a time limit besides the size limit? e.g. if only 100
messages were received
- I have the same question about deletes: when do they happen, and are
they batched?
- what happens if Solr is unavailable? I know Dovecot keeps track of
what indexed in dovecot.indexed files, but does it retry? if yes,
what's the retry policy and can it be configured? Also, does it behave
the same if Solr is actually available and throws an error?
- the same question is for attachments, though I think this is general
FTS - what if Tika fails to parse the attachment? Does Dovecot still
index the Email metadata? As a side, I'm also wondering if I could use
the Tika that comes with Solr
(https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika).
Can fts-solr handle attachments at all? I'm asking because I don't see
that field in the default schema.xml
- can I add arbitrary URI parameters to the Solr request? I see that
one could fiddle with the path, which I assume will let one have one
collection per mailbox (though I'm curious how that works with batches
- I'm guessing one batch/indexing thread per mailbox?). Specifically,
I'm interested in using the mailbox as a routing value
- if I read the code well, Dovecot does a soft commit when it's done
with the specific mailbox. For indexing at search time, I see why it
makes sense. If I do "autoindex", can I disable that and let Solr
autoSoftCommit every N seconds? That should improve indexing
throughput and reduce load. I see that one can already do this for
hard commits (I'd use autoCommit there, though a hard commit is also
triggered when ramBufferSizeMB gets hit)
- when querying, can I sort by an arbitrary field, such as the date? I
saw I can sort by score, but I can't find anything in the code that
will suggest it's supported
- also when querying, can I specify which fields to return? I see that
the plugin asks for Email ID, so I'm guessing it fetches things like
from/to from the Email itself. I'm thinking that if I want to sort by
those values I need to set docValues=true on them, to save memory. In
that case, I might as well retrieve the original string from
docValues, which should be a whole lot faster

My plan is to do all sorts of tests, but having a better background on
how it works will certainly help.

Any pointers, feedback, encouragement, etc is certainly welcome -
thanks in advance!

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


More information about the dovecot mailing list