Re: [Dovecot] FTS Plugin design

22 Apr 2009

      Hi,
Almost full text search engines (C/C++) I looked (Swish-E, Wumpus, Lemur and
Xapian) do not use any kind of library or parser. Instead, they use other
applications like pdftotext, catdoc, catppt (etc) and call them with execvp
(or equivalent). Using this approach on my project have some pros and cons:
Pros:

The existing libraries to extract the content of pdf, doc (etc) are not
very stable.
Easier to handle errors (even if those applications crash dovecot will be
still running)
Less developing time

Cons:

Some programs to parse special formats (p.e. catppt and pdftotext) do not
accept input from stdin (we need to create temporary files).

What approach would be better? Using applications like pdftotext and catdoc
or, on the other hand, use their libraries and do it almost from scratch?
Regards
Rui Carneiro
On Tue, Apr 21, 2009 at 5:52 PM, Rui Carneiro <rui.arc@gmail.com> wrote:
...
Great idea!
I will give news soon.
On Tue, Apr 21, 2009 at 5:32 PM, Timo Sirainen <tss@iki.fi> wrote:
...
I've no idea, but you could at least look at some of the other full text
search engines. I remember them advertising indexing support for all kinds
of formats. Maybe they're using some specific library or maybe it would be
easy to extract their parsing code.
--
mobile: +351 963446125
mail: rui.arc@gmail.com
mail: ei04073@fe.up.pt
website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>