Citando Timo Sirainen tss@iki.fi:
- You notice a non-text/* content-type and initialize text extraction for the MIME part. Like:
struct attachment_extract_context * attachment_extract_init(const char *content_type);
- After this you feed all the input belonging to that MIME part to:
int attachment_extract_add(struct attachment_extract_context *ctx, const struct message_block *input);
Don't output anything to FTS backend at this point. The attachment_extract_add() would probably just basically write to a temporary file.
- Finally you'll notice that the MIME part ends (either you get headers for the next MIME part or the entire message ends). Then finish the extraction, which actually executes the whatever conversion binaries:
int attachment_extract_finish(struct attachment_extract_context *ctx);
- Get the resulting text to fts_backend_build_more() somehow. Either some attachment_extract_add_to_fts() which internally adds it or some kind of an iterator that returns the text in smaller blocks. Either would work..
That kind of an API would also make it possible to pretty easily modify in future to not write temporary files for specific content types if it's not required.
I tried your approach and I think it is working pretty well. Now I only need to look carefully to the output of external programs and build the XML correctly to send to Solr.
Thanks Timo
Regards, Rui Carneiro
-- Portugalmail, Comunicações S.A. www.portugalmail.net