[Dovecot] FTS Plugin design

Rui Carneiro rui.arc at gmail.com
Tue Apr 21 13:25:47 EEST 2009


Hi again,

Anyone know some good libraries to handle the content of files like pdf,
ppt, doc, etc? I am already indexing attachments all I need now is extract
the text of them.

Regards,
Rui Carneiro

On Mon, Apr 20, 2009 at 3:29 PM, Rui Carneiro <rui.arc at gmail.com> wrote:

> Hi,
>
> The problem was on the flag. My hexa to binary conversions was wrong.
>
> Regards,
> Rui Carneiro
>
>
>
> On Fri, Apr 17, 2009 at 10:03 AM, Rui Carneiro <rui.arc at gmail.com> wrote:
>
>> Thank you for all tips. The design look more clear to me now.
>>
>> I have one more question. I looked into fts_build_want_index_part() and I
>> saw that I need to add some flags to message_part_flags, what values should
>> I choose? My first approach was to follow your schema and set
>> MESSAGE_PART_FLAG_ATTACHMENT = 0x16. There is any problem with this?
>>
>> I already had changed parse_content_type() to set ctx->part->flags
>> correctly but if i choose my custom flag dovecot assume that all attachment
>> lines are headers. I already tried to set those ctx->part->flags as TEXT and
>> the fts_backend was feeded correctly with all attachment lines.
>>
>> I don't know if this is related with the value of
>> MESSAGE_PART_FLAG_ATTACHMENT or if I am missing something (like setting
>> block.hdr = NULL or some more code to handle new flags).
>>
>> Thank you,
>> Rui Carneiro
>>
>>
>> On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen <tss at iki.fi> wrote:
>>
>>> On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:
>>> > I didn't understood yet what is the plugin's design and how the plugins
>>> are
>>> > called from the core system and I was wondering if anyone could help me
>>> with
>>> > that.
>>>
>>> fts-storage.c hooks into all the functions in mail-storage API that it
>>> needs to. Currently indexing isn't done while messages are being saved,
>>> but instead just before searching. The searching functions are:
>>>
>>>  - fts_mailbox_search_init() tries to figure out if FTS can optimize the
>>> search. If it does, it tries to figure out if FTS index is up-to-date
>>> and if not, starts the search.
>>>
>>>  - fts_mailbox_search_next_nonblock() continues the indexing (or
>>> searching after indexing) for a while. The idea is that IMAP connection
>>> is able to process other commands while doing a long-running search. So
>>> fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It
>>> would be nice if that value was dynamically calculated and also based on
>>> bytes instead of messages, but that's maybe too much trouble.
>>>
>>>  - fts_mailbox_search_next_update_seq() uses the fts search results and
>>> updates mail-storage's search stuff so that it doesn't go through
>>> messages that don't match.
>>>
>>>  - fts_build_mail() indexes a single mail. It parses the messages and
>>> returns the data in small blocks. For text/* and message/rfc822 parts
>>> those blocks are currently sent to FTS backend. This is where I think
>>> you should look into hooking your attachment parsing. Change
>>> fts_build_want_index_part() to look for more content-types that you're
>>> interested in and then before feeding the blocks to FTS backend put them
>>> through your own converter function, something like:
>>>
>>> int attachment_extract_text(struct attachment_extract_context *ctx,
>>> const struct message_block *input, struct message_block *output);
>>>
>>>
>>>
>>
>>
>> --
>> mobile: +351 963446125
>> mail: rui.arc at gmail.com
>> mail: ei04073 at fe.up.pt
>> website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>
>>
>
>
>
> --
> mobile: +351 963446125
> mail: rui.arc at gmail.com
> mail: ei04073 at fe.up.pt
> website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>
>



-- 
mobile: +351 963446125
mail: rui.arc at gmail.com
mail: ei04073 at fe.up.pt
website: http://paginas.fe.up.pt/~ei04073


More information about the dovecot mailing list