Solr -> Xapian ?

Joan Moreau jom at grosjo.net
Fri Jan 11 21:23:34 EET 2019


The below patch resolves the compilation error 

$ DIFF -P COMPAT.H COMPAT.H.JOAN 
*** compat.h 2019-01-11 20:21:00.726625427 +0100
--- compat.h.joan 2019-01-11 20:14:41.729109919 +0100
*************** struct iovec;
*** 202,207 ****
--- 202,211 ----
ssize_t i_my_writev(int fd, const struct iovec *iov, int iov_len);
#endif

+ #ifdef __cplusplus
+ extern "C" {
+ #endif
+ 
#if !defined(HAVE_PREAD) || defined(PREAD_WRAPPERS) ||
defined(PREAD_BROKEN)
# ifndef IN_COMPAT_C
# define pread i_my_pread
*************** ssize_t i_my_pread(int fd, void *buf, si
*** 211,216 ****
--- 215,225 ----
ssize_t i_my_pwrite(int fd, const void *buf, size_t count, off_t
offset);
#endif

+ #ifdef __cplusplus
+ }
+ #endif
+ 
+ 
#ifndef HAVE_SETEUID
# define seteuid i_my_seteuid
int i_my_seteuid(uid_t euid); 

To resolve integration in source tree, the following diff resolve the
case: 

$ DIFF -P CONFIGURE.AC CONFIGURE.AC.JOAN 
*** configure.ac 2019-01-11 20:19:47.905942264 +0100
--- configure.ac.joan 2019-01-11 17:54:58.433381828 +0100
*************** AS_HELP_STRING([--with-solr], [Build wit
*** 172,177 ****
--- 172,184 ----
TEST_WITH(solr, $withval),
want_solr=no)

+ AC_ARG_WITH(xapian,
+ AS_HELP_STRING([--with-xapian], [Build with Xapian full text search
support]),
+ TEST_WITH(xapian, $withval),
+ want_xapian=auto)
+ AM_CONDITIONAL(BUILD_XAPIAN, test "$want_xapian" = "yes")
+ 
+ 
AC_ARG_WITH(sodium,
AS_HELP_STRING([--with-sodium], [Build with libsodium support (enables
argon2, default: auto)]),
TEST_WITH(sodium, $withval),
*************** DOVECOT_WANT_SOLR
*** 746,751 ****
--- 753,759 ----
DOVECOT_WANT_CLUCENE
DOVECOT_WANT_STEMMER
DOVECOT_WANT_TEXTCAT
+ DOVECOT_WANT_XAPIAN

DOVECOT_WANT_ICU

*************** fi
*** 757,762 ****
--- 765,774 ----
if test $have_solr = no; then
not_fts="$not_fts solr"
fi
+ if test $have_xapian = no; then
+ not_fts="$not_fts xapian"
+ fi
+ 

dnl **
dnl ** Settings
*************** src/plugins/fs-compress/Makefile
*** 899,904 ****
--- 911,917 ----
src/plugins/fts/Makefile
src/plugins/fts-lucene/Makefile
src/plugins/fts-solr/Makefile
+ src/plugins/fts-xapian/Makefile
src/plugins/fts-squat/Makefile
src/plugins/last-login/Makefile
src/plugins/lazy-expunge/Makefile 

$ DIFF -P MAKEFILE.AM MAKEFILE.AM.JOAN 
*** Makefile.am 2019-01-11 20:22:23.910740574 +0100
--- Makefile.am.joan 2019-01-11 17:51:19.051153270 +0100
*************** DISTCLEANFILES = \
*** 99,105 ****
distcheck-hook:
if which scan-build > /dev/null; then \
cd $(distdir)/_build; \
! scan-build -o scan-reports ../configure --with-ldap=auto
--with-pgsql=auto --with-mysql=auto --with-sqlite=auto --with-solr=auto
--with-gssapi=auto --with-libwrap=auto; \
rm -rf scan-reports; \
scan-build -o scan-reports make 2>&1 || exit 1; \
if ! rmdir scan-reports 2>/dev/null; then \
--- 99,105 ----
distcheck-hook:
if which scan-build > /dev/null; then \
cd $(distdir)/_build; \
! scan-build -o scan-reports ../configure --with-ldap=auto
--with-pgsql=auto --with-mysql=auto --with-sqlite=auto --with-solr=auto
--with-xapian=auto --with-gssapi=auto --with-libwrap=auto; \
rm -rf scan-reports; \
scan-build -o scan-reports make 2>&1 || exit 1; \
if ! rmdir scan-reports 2>/dev/null; then \ 

WHAT ABOUT THE OTHER QUESTIONS ?

1 - WHat does represent "subargs" in mail_search_args 

2 - for rescan : who is responsible for passing again the new email ? Is

the Dovecot core sending again all the emails to index ? or the fts 
shall somehow access the mailbox and read all emails ? Wouldn't just be 
saying "delete all index and get_last_uid is now 0" the easy way ? or 
the fts must process all emails (and block the current thread as a 
mailbx maybe quite large) 

3 - for get_last_uid : this uncertainity is very unclear. "If there is a

gap, then indexer first indexes all the missing" -> this mean at a 
certain point, indexer maybe rebuilding a previous email, so *last* uid 
is something different than max. And how indexer does know whther there 
is a gap wihtout callong the fts backend (whch it does not as there are 
no function for that) ? 

Thank you 

On 2019-01-11 18:27, Joan Moreau wrote:

> There is no point into a separate plugin, the purpose is to replace squat as the default fts (solr being a nightmare) 
> 
> On 2019-01-11 18:23, Aki Tuomi wrote: 
> I would recommend making this a standalone plugin for now instead of trying to keep it in core fts.  
> 
> Aki 
> On 11 January 2019 at 18:40 Joan Moreau via dovecot < dovecot at dovecot.org> wrote: 
> 
> I managed to deal with the namespace issue (updated makefile.am) 
> 
> However, I reach : 
> 
> ../../../src/lib/compat.h:207:19: error: conflicting declaration of 
> 'ssize_t i_my_pread(int, void*, size_t, __off_t)' with 'C' linkage 
> # define pread i_my_pread 
> ^~~~~~~~~~ 
> ../../../src/lib/compat.h:210:9: note: previous declaration with 'C++' 
> linkage 
> ssize_t i_my_pread(int fd, void *buf, size_t count, off_t offset); 
> ^~~~~~~~~~ 
> ../../../src/lib/compat.h:208:20: error: conflicting declaration of 
> 'ssize_t i_my_pwrite(int, const void*, size_t, __off_t)' with 'C' 
> linkage 
> # define pwrite i_my_pwrite 
> 
> Any help welcome 
> 
> Hi, 
> 
> I figured out the "namespace" issue 
> 
> Remaining questions are : 
> 
> 1 - WHat does represent "subargs" in mail_search_args 
> 
> 2 - for rescan : who is responsible for passing again the new email ? Is 
> the Dovecot core sending again all the emails to index ? or the fts 
> shall somehow access the mailbox and read all emails ? Wouldn't just be 
> saying "delete all index and get_last_uid is now 0" the easy way ? or 
> the fts must process all emails (and block the current thread as a 
> mailbx maybe quite large) 
> 
> 3 - for get_last_uid : this uncertainity is very unclear. "If there is a 
> gap, then indexer first indexes all the missing" -> this mean at a 
> certain point, indexer maybe rebuilding a previous email, so *last* uid 
> is something different than max. And how indexer does know whther there 
> is a gap wihtout callong the fts backend (whch it does not as there are 
> no function for that) ? 
> 
> 4 - How to update configure.ac & additional files to add the 
> "--with-xapian" wichi will test for libxapian presence and add it to the 
> build ? 
> 
> Thank you 
> 
> On 2019-01-08 04:24, Timo Sirainen wrote: 
> 
> On 7 Jan 2019, at 16.05, Joan Moreau via dovecot < dovecot at dovecot.org> 
> wrote: 
> Hi 
> 
> ANyone to answer specifically ? 
> 
> Q1 : get_last_uid -> Is this the last UID indexed (which may be not the 
> greatest value), or the gratest value (which may not be the latest) (the 
> code of existing plugins is unclear about this, Solr looks for the 
> greatest for insance) 
> All the mails are always supposed to be indexed from the beginning to 
> the last indexed mail. If there's a gap, indexer first indexes all the 
> missing mails. So the latest UID is supposed to be the greatest UID. 
> (Supporting out-of-order indexing would be rather difficult to keep 
> track of.) 
> 
> Q2 : WHen Indexing an email, the data is not passed by "build_key". Why 
> so ? What is the link with "build_more" ? 
> The idea is that it calls something like: 
> 
> - build_key(type=hdr, hdr_name=From) 
> - build_more(" tss at iki.fi") 
> - build_key(type=hdr, hdr_name=Subject) 
> - build_more("Re: Solr -> Xapian ?") 
> - build_key(type=body_part) 
> - build_more("message body piece") 
> - build_more("message body piece2") 
> ... 
> 
> Q3 : Searching/Lookup : THe fheader in which to llok for (must be a 
> least among "cc, to, from, subject, body") is not appearing in the 
> 'struct' data. WHere to find it ? 
> lookup() gets struct mail_search_arg *args, which contains the entire 
> IMAP SEARCH query. This could be used for more or less complex query 
> builders. 
> 
> In case of a single header search, you should have 
> args->args->hdr_field_name contain the header name and 
> args->args->value.str contain the content you're searching for. 
> 
> Q4 : Refresh : this is very unclear. How come there would not be the 
> "latest" view on index. What is the real meaning of this function ? 
> In case of Xapian it might not matter if it automatically refreshes its 
> indexes between each query. But with some other indexes this could 
> happen: 
> 
> - IMAP session is opened 
> - IMAP SEARCH is run, which opens and searches the index 
> - a new mail is delivered to the mailbox and indexed 
> - IMAP SEARCH is run. Without refresh() it doesn't see the newly 
> indexed mail and doesn't include it in the search results. 
> 
> Q5 : Rescan : is it just a bout remonving all indexes for a specific 
> mailbox ? 
> It's run when "doveadm fts rescan" is run manually. Usually that's only 
> run manually to fix up some brokenness. So it's intended to verify that 
> the current mailbox contents match the FTS indexes: 
> - If there are any mails in FTS index that no longer exist in the 
> actual mailbox, delete those mails from FTS 
> - If FTS is missing any mails in the middle of the mailbox, make sure 
> that the next mailbox indexing will index those missing mails. I think 
> currently this basically means reindexing all the mails since the first 
> missing mail, even the mails that are already in the index. 
> 
> fts-lucene implements this, but other FTS backends are lazy and simply 
> rebuild all mails. Actually fts-solr is bad because it doesn't even 
> delete the extra mails. 
> 
> Q6 : lokkup_multi : isn't the function the same for all plugnins (see 
> below) ?and finally , for fts_backend_xxxx_lookup_multi, why is that 
> backend dependent ? 
> This function is called only when searching in virtual folders. So for 
> example the virtual "All mails" folder, which would contain all mails in 
> all folders. In that case the boxes[] would contain a list of user's all 
> folders, except Trash and Spam. If lookup_multi() isn't implemented 
> (left to NULL), the search is run separately via lookup() for each 
> folder. With lookup_multi() there can be just one lookup, and the 
> backend can filter only the wanted folders and return them directly. So 
> it's an optimization for FTS indexes that support user-global searches 
> rather than only per-folder searches. 
> 
> static int fts_backend_xapian_lookup_multi(struct fts_backend *_backend, 
> struct mailbox *const boxes[], struct mail_search_arg *args, enum 
> fts_lookup_flags flags, struct fts_multi_result *result) 
> { 
> struct xapian_fts_backend_update_context *ctx = 
> (struct xapian_fts_backend_update_context *)_ctx; 
> 
> int i=0; 
> 
> while(boxes[i]!=NULL) 
> { 
> if(fts_backend_xapian_lookup(backend,box[i],args,flags,result->box_results[i])<0) 
> return -1; 
> i++; 
> } 
> return 0; 
> } 
> See fts_backend_lookup_multi() - if you leave lookup_multi=NULL it 
> basically does this. 
> 
> For "rescan " and "optimize", wouldn't it be the dovecot core who 
> indicate which are to be dismissed (expunged), or re-ask for indexing a 
> particular (or all) uid ? WHy would the backend be aware of the 
> transactions on the mailbox ??? 
> rescan() is about fixing up a more or less broken index, or simply to 
> verify that it's all ok. So core doesn't know what messages exist in the 
> FTS index and can't request specific reindexing or expunging. I guess an 
> alternative API could have been to have functions that iterate through 
> all mails in the index, and use that to implement rescan in core. Now 
> thinking about it, that sounds like a simpler and better way. 
> 
> optimize() is currently done only when explicitly running "doveadm fts 
> optimize", which requests running a slower index optimization. Depends 
> on the FTS backend whether this is useful or not. 
> 
> There is alredy "fts_backend_xxx_update_expunge", so I beleive the 
> management of the expunged messages is *NOT* in the backend, right ? 
> Normally when mails are expunged, update_expunge() is called to notify 
> FTS backend that it should delete the mail also from FTS index. 
> 
> .flags = FTS_BACKEND_FLAG_NORMALIZE_INPUT,*-> what other flags ?* 
> You probably want to use FTS_BACKEND_FLAG_FUZZY_SEARCH only like Solr. 
> See enum fts_backend_flags in fts-api-private.h 
> 
> --- 
> Aki Tuomi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://dovecot.org/pipermail/dovecot/attachments/20190111/b78f2982/attachment-0001.html>


More information about the dovecot mailing list