Hi again,
Anyone know some good libraries to handle the content of files like pdf, ppt, doc, etc? I am already indexing attachments all I need now is extract the text of them.
Regards, Rui Carneiro
On Mon, Apr 20, 2009 at 3:29 PM, Rui Carneiro rui.arc@gmail.com wrote:
Hi,
The problem was on the flag. My hexa to binary conversions was wrong.
Regards, Rui Carneiro
On Fri, Apr 17, 2009 at 10:03 AM, Rui Carneiro rui.arc@gmail.com wrote:
Thank you for all tips. The design look more clear to me now.
I have one more question. I looked into fts_build_want_index_part() and I saw that I need to add some flags to message_part_flags, what values should I choose? My first approach was to follow your schema and set MESSAGE_PART_FLAG_ATTACHMENT = 0x16. There is any problem with this?
I already had changed parse_content_type() to set ctx->part->flags correctly but if i choose my custom flag dovecot assume that all attachment lines are headers. I already tried to set those ctx->part->flags as TEXT and the fts_backend was feeded correctly with all attachment lines.
I don't know if this is related with the value of MESSAGE_PART_FLAG_ATTACHMENT or if I am missing something (like setting block.hdr = NULL or some more code to handle new flags).
Thank you, Rui Carneiro
On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen tss@iki.fi wrote:
On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:
I didn't understood yet what is the plugin's design and how the plugins are called from the core system and I was wondering if anyone could help me with that.
fts-storage.c hooks into all the functions in mail-storage API that it needs to. Currently indexing isn't done while messages are being saved, but instead just before searching. The searching functions are:
fts_mailbox_search_init() tries to figure out if FTS can optimize the search. If it does, it tries to figure out if FTS index is up-to-date and if not, starts the search.
fts_mailbox_search_next_nonblock() continues the indexing (or searching after indexing) for a while. The idea is that IMAP connection is able to process other commands while doing a long-running search. So fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It would be nice if that value was dynamically calculated and also based on bytes instead of messages, but that's maybe too much trouble.
fts_mailbox_search_next_update_seq() uses the fts search results and updates mail-storage's search stuff so that it doesn't go through messages that don't match.
fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into hooking your attachment parsing. Change fts_build_want_index_part() to look for more content-types that you're interested in and then before feeding the blocks to FTS backend put them through your own converter function, something like:
int attachment_extract_text(struct attachment_extract_context *ctx, const struct message_block *input, struct message_block *output);
-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073http://paginas.fe.up.pt/%7Eei04073
-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073http://paginas.fe.up.pt/%7Eei04073
-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073