Hi again,
On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen <tss@iki.fi> wrote:
- fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into hooking your attachment parsing. Change fts_build_want_index_part() to look for more content-types that you're interested in and then before feeding the blocks to FTS backend put them through your own converter function, something like:
int attachment_extract_text(struct attachment_extract_context *ctx, const struct message_block *input, struct message_block *output);
Let's take the example of an application-pdf content-type. Before I converter all pdf data to text I need to gather all data before. The actual process is feeding FTS backend with small parts of data and appending them on "build_more" functions (e.g. fts_backend_solr_build_more()).
So where should I call attachment_extract_text()? In fts_backend_solr_build_more() and not making append to cmd until data is extracted? Or gather all information before (e.g. fts_build_mail()) and send all in once to FTS backend?
I hope I've made myself clear.
Regards, Rui Carneiro
Portugalmail, Comunicações S.A. www.portugalmail.net