Hi,
Almost full text search engines (C/C++) I looked (Swish-E, Wumpus, Lemur and Xapian) do not use any kind of library or parser. Instead, they use other applications like pdftotext, catdoc, catppt (etc) and call them with execvp (or equivalent). Using this approach on my project have some pros and cons:
Pros:
- The existing libraries to extract the content of pdf, doc (etc) are not very stable.
- Easier to handle errors (even if those applications crash dovecot will be still running)
- Less developing time
Cons:
- Some programs to parse special formats (p.e. catppt and pdftotext) do not accept input from stdin (we need to create temporary files).
What approach would be better? Using applications like pdftotext and catdoc or, on the other hand, use their libraries and do it almost from scratch?
Regards Rui Carneiro
On Tue, Apr 21, 2009 at 5:52 PM, Rui Carneiro <rui.arc@gmail.com> wrote:
Great idea!
I will give news soon.
On Tue, Apr 21, 2009 at 5:32 PM, Timo Sirainen <tss@iki.fi> wrote:
I've no idea, but you could at least look at some of the other full text search engines. I remember them advertising indexing support for all kinds of formats. Maybe they're using some specific library or maybe it would be easy to extract their parsing code.
-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>