Re: [Dovecot] FTS Plugin design

22 Apr 2009

      On Wed, 2009-04-22 at 15:51 +0100, Rui Carneiro wrote:
...
Hi,
Almost full text search engines (C/C++) I looked (Swish-E, Wumpus,
Lemur and Xapian) do not use any kind of library or parser. Instead,
they use other applications like pdftotext, catdoc, catppt (etc) and
call them with execvp (or equivalent). Using this approach on my
project have some pros and cons:
Pros:

The existing libraries to extract the content of pdf, doc (etc) are
not very stable.
Easier to handle errors (even if those applications crash dovecot
will be still running)

Hmm. I hadn't thought of this before. Yeah, if they're not stable it's
probably not a good idea to run in the same process as the rest of
Dovecot. But I guess there could be some kind of a separate text
extracting process that fts plugin would talk to. If that process dies
it could get restarted automatically and fts could maybe retry and if it
it dies again log it and just skip over it.
...

Some programs to parse special formats (p.e. catppt and pdftotext)
do not accept input from stdin (we need to create temporary files).

Maybe those programs could be changed and just require the newer
versions?..
...
What approach would be better? Using applications like pdftotext and
catdoc or, on the other hand, use their libraries and do it almost
from scratch?
I think the API that fts plugin uses to do the conversion should be
generic enough that both approaches would work. Then it would be easier
to implement one or another or both eventually.

Re: [Dovecot] FTS Plugin design

Timo Sirainen