[Dovecot] FTS Plugin design

Wed May 13 19:26:01 EEST 2009

On Tue, 2009-05-05 at 12:08 +0100, Rui Carneiro wrote:
> >  - fts_build_mail() indexes a single mail. It parses the messages and
> > returns the data in small blocks. For text/* and message/rfc822 parts
> > those blocks are currently sent to FTS backend. This is where I think
> > you should look into hooking your attachment parsing. Change
> > fts_build_want_index_part() to look for more content-types that you're
> > interested in and then before feeding the blocks to FTS backend put them
> > through your own converter function, something like:
> >
> > int attachment_extract_text(struct attachment_extract_context *ctx,
> > const struct message_block *input, struct message_block *output);
> 
> 
> Let's take the example of an application-pdf content-type. Before I
> converter all pdf data to text I need to gather all data before. The actual
> process is feeding FTS backend with small parts of data and appending them
> on "build_more" functions (e.g. fts_backend_solr_build_more()).

Right.

> So where should I call attachment_extract_text()? In
> fts_backend_solr_build_more() and not making append to cmd until data is
> extracted? Or gather all information before (e.g. fts_build_mail()) and send
> all in once to FTS backend?

Since others already mentioned that many formats pretty much require
having the entire file available, I guess it's better to just save all
the attachments to file at some point. So if I wrote the code it would
probably work something like:

1. You notice a non-text/* content-type and initialize text extraction
for the MIME part. Like:

struct attachment_extract_context *
attachment_extract_init(const char *content_type);

2. After this you feed all the input belonging to that MIME part to:

int attachment_extract_add(struct attachment_extract_context *ctx,
const struct message_block *input);

Don't output anything to FTS backend at this point. The
attachment_extract_add() would probably just basically write to a
temporary file.

3. Finally you'll notice that the MIME part ends (either you get headers
for the next MIME part or the entire message ends). Then finish the
extraction, which actually executes the whatever conversion binaries:

int attachment_extract_finish(struct attachment_extract_context *ctx);

4. Get the resulting text to fts_backend_build_more() somehow. Either
some attachment_extract_add_to_fts() which internally adds it or some
kind of an iterator that returns the text in smaller blocks. Either
would work..

That kind of an API would also make it possible to pretty easily modify
in future to not write temporary files for specific content types if
it's not required.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20090513/b1a030d7/attachment.bin