Hello,
I'm quite a Dovecot newbie, so please be gentle :) Though I did my homework as well as I could, I still have some questions regarding the Solr plugin. Specifically: there are <1000 Emails in the buffer, it just flushes them, right?
- I understand that by default, a mailbox is indexed on the first search and then deltas are indexed in subsequent searches. Are Emails indexed in batches or one by one? Looking at the code, I see a hardcoded limit of 1000, and I'm guessing if the mailbox is done and
- if I set fts_autoindex=yes, does it mean that as soon as the Email is delivered by the MTA, it will be indexed in Solr? or does it have to read by the user or touched in any way?
- also, with fts_autoindex=yes, are Emails indexed in batches? if yes, is there also a time limit besides the size limit? e.g. if only 100 messages were received
- I have the same question about deletes: when do they happen, and are they batched?
- what happens if Solr is unavailable? I know Dovecot keeps track of what indexed in dovecot.indexed files, but does it retry? if yes, what's the retry policy and can it be configured? Also, does it behave the same if Solr is actually available and throws an error?
- the same question is for attachments, though I think this is general FTS - what if Tika fails to parse the attachment? Does Dovecot still index the Email metadata? As a side, I'm also wondering if I could use the Tika that comes with Solr (https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Ce...). Can fts-solr handle attachments at all? I'm asking because I don't see that field in the default schema.xml
- can I add arbitrary URI parameters to the Solr request? I see that one could fiddle with the path, which I assume will let one have one
collection per mailbox (though I'm curious how that works with batches - I'm guessing one batch/indexing thread per mailbox?). Specifically, I'm interested in using the mailbox as a routing value - if I read the code well, Dovecot does a soft commit when it's done with the specific mailbox. For indexing at search time, I see why it makes sense. If I do "autoindex", can I disable that and let Solr autoSoftCommit every N seconds? That should improve indexing throughput and reduce load. I see that one can already do this for hard commits (I'd use autoCommit there, though a hard commit is also triggered when ramBufferSizeMB gets hit) - when querying, can I sort by an arbitrary field, such as the date? I saw I can sort by score, but I can't find anything in the code that will suggest it's supported - also when querying, can I specify which fields to return? I see that the plugin asks for Email ID, so I'm guessing it fetches things like from/to from the Email itself. I'm thinking that if I want to sort by those values I need to set docValues=true on them, to save memory. In that case, I might as well retrieve the original string from docValues, which should be a whole lot faster
My plan is to do all sorts of tests, but having a better background on how it works will certainly help.
Any pointers, feedback, encouragement, etc is certainly welcome - thanks in advance!
Best regards, Radu
Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/
participants (1)
-
Radu Gheorghe