Solr -> Xapian ?

Joan Moreau jom at grosjo.net
Fri Jan 11 19:27:29 EET 2019


There is no point into a separate plugin, the purpose is to replace
squat as the default fts (solr being a nightmare) 

On 2019-01-11 18:23, Aki Tuomi wrote:

> I would recommend making this a standalone plugin for now instead of trying to keep it in core fts.  
> 
> Aki 
> 
>> On 11 January 2019 at 18:40 Joan Moreau via dovecot < dovecot at dovecot.org> wrote: 
>> 
>> I managed to deal with the namespace issue (updated makefile.am) 
>> 
>> However, I reach : 
>> 
>> ../../../src/lib/compat.h:207:19: error: conflicting declaration of 
>> 'ssize_t i_my_pread(int, void*, size_t, __off_t)' with 'C' linkage 
>> # define pread i_my_pread 
>> ^~~~~~~~~~ 
>> ../../../src/lib/compat.h:210:9: note: previous declaration with 'C++' 
>> linkage 
>> ssize_t i_my_pread(int fd, void *buf, size_t count, off_t offset); 
>> ^~~~~~~~~~ 
>> ../../../src/lib/compat.h:208:20: error: conflicting declaration of 
>> 'ssize_t i_my_pwrite(int, const void*, size_t, __off_t)' with 'C' 
>> linkage 
>> # define pwrite i_my_pwrite 
>> 
>> Any help welcome 
>> 
>> Hi, 
>> 
>> I figured out the "namespace" issue 
>> 
>> Remaining questions are : 
>> 
>> 1 - WHat does represent "subargs" in mail_search_args 
>> 
>> 2 - for rescan : who is responsible for passing again the new email ? Is 
>> the Dovecot core sending again all the emails to index ? or the fts 
>> shall somehow access the mailbox and read all emails ? Wouldn't just be 
>> saying "delete all index and get_last_uid is now 0" the easy way ? or 
>> the fts must process all emails (and block the current thread as a 
>> mailbx maybe quite large) 
>> 
>> 3 - for get_last_uid : this uncertainity is very unclear. "If there is a 
>> gap, then indexer first indexes all the missing" -> this mean at a 
>> certain point, indexer maybe rebuilding a previous email, so *last* uid 
>> is something different than max. And how indexer does know whther there 
>> is a gap wihtout callong the fts backend (whch it does not as there are 
>> no function for that) ? 
>> 
>> 4 - How to update configure.ac & additional files to add the 
>> "--with-xapian" wichi will test for libxapian presence and add it to the 
>> build ? 
>> 
>> Thank you 
>> 
>> On 2019-01-08 04:24, Timo Sirainen wrote: 
>> 
>> On 7 Jan 2019, at 16.05, Joan Moreau via dovecot < dovecot at dovecot.org> 
>> wrote: 
>> Hi 
>> 
>> ANyone to answer specifically ? 
>> 
>> Q1 : get_last_uid -> Is this the last UID indexed (which may be not the 
>> greatest value), or the gratest value (which may not be the latest) (the 
>> code of existing plugins is unclear about this, Solr looks for the 
>> greatest for insance) 
>> All the mails are always supposed to be indexed from the beginning to 
>> the last indexed mail. If there's a gap, indexer first indexes all the 
>> missing mails. So the latest UID is supposed to be the greatest UID. 
>> (Supporting out-of-order indexing would be rather difficult to keep 
>> track of.) 
>> 
>> Q2 : WHen Indexing an email, the data is not passed by "build_key". Why 
>> so ? What is the link with "build_more" ? 
>> The idea is that it calls something like: 
>> 
>> - build_key(type=hdr, hdr_name=From) 
>> - build_more(" tss at iki.fi") 
>> - build_key(type=hdr, hdr_name=Subject) 
>> - build_more("Re: Solr -> Xapian ?") 
>> - build_key(type=body_part) 
>> - build_more("message body piece") 
>> - build_more("message body piece2") 
>> ... 
>> 
>> Q3 : Searching/Lookup : THe fheader in which to llok for (must be a 
>> least among "cc, to, from, subject, body") is not appearing in the 
>> 'struct' data. WHere to find it ? 
>> lookup() gets struct mail_search_arg *args, which contains the entire 
>> IMAP SEARCH query. This could be used for more or less complex query 
>> builders. 
>> 
>> In case of a single header search, you should have 
>> args->args->hdr_field_name contain the header name and 
>> args->args->value.str contain the content you're searching for. 
>> 
>> Q4 : Refresh : this is very unclear. How come there would not be the 
>> "latest" view on index. What is the real meaning of this function ? 
>> In case of Xapian it might not matter if it automatically refreshes its 
>> indexes between each query. But with some other indexes this could 
>> happen: 
>> 
>> - IMAP session is opened 
>> - IMAP SEARCH is run, which opens and searches the index 
>> - a new mail is delivered to the mailbox and indexed 
>> - IMAP SEARCH is run. Without refresh() it doesn't see the newly 
>> indexed mail and doesn't include it in the search results. 
>> 
>> Q5 : Rescan : is it just a bout remonving all indexes for a specific 
>> mailbox ? 
>> It's run when "doveadm fts rescan" is run manually. Usually that's only 
>> run manually to fix up some brokenness. So it's intended to verify that 
>> the current mailbox contents match the FTS indexes: 
>> - If there are any mails in FTS index that no longer exist in the 
>> actual mailbox, delete those mails from FTS 
>> - If FTS is missing any mails in the middle of the mailbox, make sure 
>> that the next mailbox indexing will index those missing mails. I think 
>> currently this basically means reindexing all the mails since the first 
>> missing mail, even the mails that are already in the index. 
>> 
>> fts-lucene implements this, but other FTS backends are lazy and simply 
>> rebuild all mails. Actually fts-solr is bad because it doesn't even 
>> delete the extra mails. 
>> 
>> Q6 : lokkup_multi : isn't the function the same for all plugnins (see 
>> below) ?and finally , for fts_backend_xxxx_lookup_multi, why is that 
>> backend dependent ? 
>> This function is called only when searching in virtual folders. So for 
>> example the virtual "All mails" folder, which would contain all mails in 
>> all folders. In that case the boxes[] would contain a list of user's all 
>> folders, except Trash and Spam. If lookup_multi() isn't implemented 
>> (left to NULL), the search is run separately via lookup() for each 
>> folder. With lookup_multi() there can be just one lookup, and the 
>> backend can filter only the wanted folders and return them directly. So 
>> it's an optimization for FTS indexes that support user-global searches 
>> rather than only per-folder searches. 
>> 
>> static int fts_backend_xapian_lookup_multi(struct fts_backend *_backend, 
>> struct mailbox *const boxes[], struct mail_search_arg *args, enum 
>> fts_lookup_flags flags, struct fts_multi_result *result) 
>> { 
>> struct xapian_fts_backend_update_context *ctx = 
>> (struct xapian_fts_backend_update_context *)_ctx; 
>> 
>> int i=0; 
>> 
>> while(boxes[i]!=NULL) 
>> { 
>> if(fts_backend_xapian_lookup(backend,box[i],args,flags,result->box_results[i])<0) 
>> return -1; 
>> i++; 
>> } 
>> return 0; 
>> } 
>> See fts_backend_lookup_multi() - if you leave lookup_multi=NULL it 
>> basically does this. 
>> 
>> For "rescan " and "optimize", wouldn't it be the dovecot core who 
>> indicate which are to be dismissed (expunged), or re-ask for indexing a 
>> particular (or all) uid ? WHy would the backend be aware of the 
>> transactions on the mailbox ??? 
>> rescan() is about fixing up a more or less broken index, or simply to 
>> verify that it's all ok. So core doesn't know what messages exist in the 
>> FTS index and can't request specific reindexing or expunging. I guess an 
>> alternative API could have been to have functions that iterate through 
>> all mails in the index, and use that to implement rescan in core. Now 
>> thinking about it, that sounds like a simpler and better way. 
>> 
>> optimize() is currently done only when explicitly running "doveadm fts 
>> optimize", which requests running a slower index optimization. Depends 
>> on the FTS backend whether this is useful or not. 
>> 
>> There is alredy "fts_backend_xxx_update_expunge", so I beleive the 
>> management of the expunged messages is *NOT* in the backend, right ? 
>> Normally when mails are expunged, update_expunge() is called to notify 
>> FTS backend that it should delete the mail also from FTS index. 
>> 
>> .flags = FTS_BACKEND_FLAG_NORMALIZE_INPUT,*-> what other flags ?* 
>> You probably want to use FTS_BACKEND_FLAG_FUZZY_SEARCH only like Solr. 
>> See enum fts_backend_flags in fts-api-private.h
> 
> --- 
> Aki Tuomi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://dovecot.org/pipermail/dovecot/attachments/20190111/a8383318/attachment-0001.html>


More information about the dovecot mailing list