On Tue, 2009-05-05 at 12:08 +0100, Rui Carneiro wrote:
- fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into hooking your attachment parsing. Change fts_build_want_index_part() to look for more content-types that you're interested in and then before feeding the blocks to FTS backend put them through your own converter function, something like:
int attachment_extract_text(struct attachment_extract_context *ctx, const struct message_block *input, struct message_block *output);
Let's take the example of an application-pdf content-type. Before I converter all pdf data to text I need to gather all data before. The actual process is feeding FTS backend with small parts of data and appending them on "build_more" functions (e.g. fts_backend_solr_build_more()).
Right.
So where should I call attachment_extract_text()? In fts_backend_solr_build_more() and not making append to cmd until data is extracted? Or gather all information before (e.g. fts_build_mail()) and send all in once to FTS backend?
Since others already mentioned that many formats pretty much require having the entire file available, I guess it's better to just save all the attachments to file at some point. So if I wrote the code it would probably work something like:
- You notice a non-text/* content-type and initialize text extraction for the MIME part. Like:
struct attachment_extract_context * attachment_extract_init(const char *content_type);
- After this you feed all the input belonging to that MIME part to:
int attachment_extract_add(struct attachment_extract_context *ctx, const struct message_block *input);
Don't output anything to FTS backend at this point. The attachment_extract_add() would probably just basically write to a temporary file.
- Finally you'll notice that the MIME part ends (either you get headers for the next MIME part or the entire message ends). Then finish the extraction, which actually executes the whatever conversion binaries:
int attachment_extract_finish(struct attachment_extract_context *ctx);
- Get the resulting text to fts_backend_build_more() somehow. Either some attachment_extract_add_to_fts() which internally adds it or some kind of an iterator that returns the text in smaller blocks. Either would work..
That kind of an API would also make it possible to pretty easily modify in future to not write temporary files for specific content types if it's not required.