[Dovecot] FTS Plugin design

Timo Sirainen tss at iki.fi
Wed Apr 22 19:38:59 EEST 2009


On Wed, 2009-04-22 at 15:51 +0100, Rui Carneiro wrote:
> Hi,
> 
> Almost full text search engines (C/C++) I looked (Swish-E, Wumpus,
> Lemur and Xapian) do not use any kind of library or parser. Instead,
> they use other applications like pdftotext, catdoc, catppt (etc) and
> call them with execvp (or equivalent). Using this approach on my
> project have some pros and cons:
> 
> Pros:
> - The existing libraries to extract the content of pdf, doc (etc) are
> not very stable.
> - Easier to handle errors (even if those applications crash dovecot
> will be still running)

Hmm. I hadn't thought of this before. Yeah, if they're not stable it's
probably not a good idea to run in the same process as the rest of
Dovecot. But I guess there could be some kind of a separate text
extracting process that fts plugin would talk to. If that process dies
it could get restarted automatically and fts could maybe retry and if it
it dies again log it and just skip over it.

> - Some programs to parse special formats (p.e. catppt and pdftotext)
> do not accept input from stdin (we need to create temporary files).

Maybe those programs could be changed and just require the newer
versions?..

> What approach would be better? Using applications like pdftotext and
> catdoc or, on the other hand, use their libraries and do it almost
> from scratch?

I think the API that fts plugin uses to do the conversion should be
generic enough that both approaches would work. Then it would be easier
to implement one or another or both eventually.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20090422/1c52794d/attachment.bin 


More information about the dovecot mailing list