On Wed, 2009-04-22 at 15:51 +0100, Rui Carneiro wrote:
Hi,
Almost full text search engines (C/C++) I looked (Swish-E, Wumpus, Lemur and Xapian) do not use any kind of library or parser. Instead, they use other applications like pdftotext, catdoc, catppt (etc) and call them with execvp (or equivalent). Using this approach on my project have some pros and cons:
Pros:
- The existing libraries to extract the content of pdf, doc (etc) are not very stable.
- Easier to handle errors (even if those applications crash dovecot will be still running)
Hmm. I hadn't thought of this before. Yeah, if they're not stable it's probably not a good idea to run in the same process as the rest of Dovecot. But I guess there could be some kind of a separate text extracting process that fts plugin would talk to. If that process dies it could get restarted automatically and fts could maybe retry and if it it dies again log it and just skip over it.
- Some programs to parse special formats (p.e. catppt and pdftotext) do not accept input from stdin (we need to create temporary files).
Maybe those programs could be changed and just require the newer versions?..
What approach would be better? Using applications like pdftotext and catdoc or, on the other hand, use their libraries and do it almost from scratch?
I think the API that fts plugin uses to do the conversion should be generic enough that both approaches would work. Then it would be easier to implement one or another or both eventually.