[Dovecot] FTS Plugin design

Wed Apr 22 17:51:45 EEST 2009

Hi,

Almost full text search engines (C/C++) I looked (Swish-E, Wumpus, Lemur and
Xapian) do not use any kind of library or parser. Instead, they use other
applications like pdftotext, catdoc, catppt (etc) and call them with execvp
(or equivalent). Using this approach on my project have some pros and cons:

Pros:
- The existing libraries to extract the content of pdf, doc (etc) are not
very stable.
- Easier to handle errors (even if those applications crash dovecot will be
still running)
- Less developing time

Cons:
- Some programs to parse special formats (p.e. catppt and pdftotext) do not
accept input from stdin (we need to create temporary files).

What approach would be better? Using applications like pdftotext and catdoc
or, on the other hand, use their libraries and do it almost from scratch?

Regards
Rui Carneiro

On Tue, Apr 21, 2009 at 5:52 PM, Rui Carneiro <rui.arc at gmail.com> wrote:

> Great idea!
>
> I will give news soon.
>
>
> On Tue, Apr 21, 2009 at 5:32 PM, Timo Sirainen <tss at iki.fi> wrote:
>
>> I've no idea, but you could at least look at some of the other full text
>> search engines. I remember them advertising indexing support for all kinds
>> of formats. Maybe they're using some specific library or maybe it would be
>> easy to extract their parsing code.
>>
>

-- 
mobile: +351 963446125
mail: rui.arc at gmail.com
mail: ei04073 at fe.up.pt
website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>