[Dovecot] FTS Plugin design

Rui Carneiro

13 Apr 2009 13 Apr '09

1:18 p.m.

Hi all,

Currently I am developing some changes on the solr plugin. I want this plugin indexing also the attachment's content. I have already started to look on plugin's source but I am having some problems understanding how it works.

I didn't understood yet what is the plugin's design and how the plugins are called from the core system and I was wondering if anyone could help me with that.

Sorry if this doubts sound stupid but I am newcomer on Dovecot.

Regards, Rui Carneiro

Show replies by date

Timo Sirainen

16 Apr 16 Apr

1:23 a.m.

On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:

...

I didn't understood yet what is the plugin's design and how the plugins are called from the core system and I was wondering if anyone could help me with that.

fts-storage.c hooks into all the functions in mail-storage API that it needs to. Currently indexing isn't done while messages are being saved, but instead just before searching. The searching functions are:

fts_mailbox_search_init() tries to figure out if FTS can optimize the search. If it does, it tries to figure out if FTS index is up-to-date and if not, starts the search.
fts_mailbox_search_next_nonblock() continues the indexing (or searching after indexing) for a while. The idea is that IMAP connection is able to process other commands while doing a long-running search. So fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It would be nice if that value was dynamically calculated and also based on bytes instead of messages, but that's maybe too much trouble.
fts_mailbox_search_next_update_seq() uses the fts search results and updates mail-storage's search stuff so that it doesn't go through messages that don't match.
fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into hooking your attachment parsing. Change fts_build_want_index_part() to look for more content-types that you're interested in and then before feeding the blocks to FTS backend put them through your own converter function, something like:

int attachment_extract_text(struct attachment_extract_context *ctx, const struct message_block *input, struct message_block *output);

Rui Carneiro

17 Apr 17 Apr

12:03 p.m.

Thank you for all tips. The design look more clear to me now.

I have one more question. I looked into fts_build_want_index_part() and I saw that I need to add some flags to message_part_flags, what values should I choose? My first approach was to follow your schema and set MESSAGE_PART_FLAG_ATTACHMENT = 0x16. There is any problem with this?

I already had changed parse_content_type() to set ctx->part->flags correctly but if i choose my custom flag dovecot assume that all attachment lines are headers. I already tried to set those ctx->part->flags as TEXT and the fts_backend was feeded correctly with all attachment lines.

I don't know if this is related with the value of MESSAGE_PART_FLAG_ATTACHMENT or if I am missing something (like setting block.hdr = NULL or some more code to handle new flags).

Thank you, Rui Carneiro

On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen <tss@iki.fi> wrote:

...

On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:

...
I didn't understood yet what is the plugin's design and how the plugins are called from the core system and I was wondering if anyone could help me with that.

fts-storage.c hooks into all the functions in mail-storage API that it needs to. Currently indexing isn't done while messages are being saved, but instead just before searching. The searching functions are:

fts_mailbox_search_init() tries to figure out if FTS can optimize the search. If it does, it tries to figure out if FTS index is up-to-date and if not, starts the search.

fts_mailbox_search_next_nonblock() continues the indexing (or searching after indexing) for a while. The idea is that IMAP connection is able to process other commands while doing a long-running search. So fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It would be nice if that value was dynamically calculated and also based on bytes instead of messages, but that's maybe too much trouble.

fts_mailbox_search_next_update_seq() uses the fts search results and updates mail-storage's search stuff so that it doesn't go through messages that don't match.

fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into hooking your attachment parsing. Change fts_build_want_index_part() to look for more content-types that you're interested in and then before feeding the blocks to FTS backend put them through your own converter function, something like:

int attachment_extract_text(struct attachment_extract_context *ctx, const struct message_block *input, struct message_block *output);

-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>

Rui Carneiro

20 Apr 20 Apr

5:29 p.m.

Hi,

The problem was on the flag. My hexa to binary conversions was wrong.

Regards, Rui Carneiro

On Fri, Apr 17, 2009 at 10:03 AM, Rui Carneiro <rui.arc@gmail.com> wrote:

...

Thank you for all tips. The design look more clear to me now.

I have one more question. I looked into fts_build_want_index_part() and I saw that I need to add some flags to message_part_flags, what values should I choose? My first approach was to follow your schema and set MESSAGE_PART_FLAG_ATTACHMENT = 0x16. There is any problem with this?

I already had changed parse_content_type() to set ctx->part->flags correctly but if i choose my custom flag dovecot assume that all attachment lines are headers. I already tried to set those ctx->part->flags as TEXT and the fts_backend was feeded correctly with all attachment lines.

I don't know if this is related with the value of MESSAGE_PART_FLAG_ATTACHMENT or if I am missing something (like setting block.hdr = NULL or some more code to handle new flags).

Thank you, Rui Carneiro

On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen <tss@iki.fi> wrote:

...
On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:

...
I didn't understood yet what is the plugin's design and how the plugins are called from the core system and I was wondering if anyone could help me with that.

fts-storage.c hooks into all the functions in mail-storage API that it needs to. Currently indexing isn't done while messages are being saved, but instead just before searching. The searching functions are:

fts_mailbox_search_init() tries to figure out if FTS can optimize the search. If it does, it tries to figure out if FTS index is up-to-date and if not, starts the search.

fts_mailbox_search_next_nonblock() continues the indexing (or searching after indexing) for a while. The idea is that IMAP connection is able to process other commands while doing a long-running search. So fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It would be nice if that value was dynamically calculated and also based on bytes instead of messages, but that's maybe too much trouble.

fts_mailbox_search_next_update_seq() uses the fts search results and updates mail-storage's search stuff so that it doesn't go through messages that don't match.

fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into hooking your attachment parsing. Change fts_build_want_index_part() to look for more content-types that you're interested in and then before feeding the blocks to FTS backend put them through your own converter function, something like:

int attachment_extract_text(struct attachment_extract_context *ctx, const struct message_block *input, struct message_block *output);

-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>

-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073

Rui Carneiro

21 Apr 21 Apr

1:25 p.m.

Hi again,

Anyone know some good libraries to handle the content of files like pdf, ppt, doc, etc? I am already indexing attachments all I need now is extract the text of them.

Regards, Rui Carneiro

On Mon, Apr 20, 2009 at 3:29 PM, Rui Carneiro <rui.arc@gmail.com> wrote:

...

Hi,

The problem was on the flag. My hexa to binary conversions was wrong.

Regards, Rui Carneiro

On Fri, Apr 17, 2009 at 10:03 AM, Rui Carneiro <rui.arc@gmail.com> wrote:

...
Thank you for all tips. The design look more clear to me now.

I have one more question. I looked into fts_build_want_index_part() and I saw that I need to add some flags to message_part_flags, what values should I choose? My first approach was to follow your schema and set MESSAGE_PART_FLAG_ATTACHMENT = 0x16. There is any problem with this?

I already had changed parse_content_type() to set ctx->part->flags correctly but if i choose my custom flag dovecot assume that all attachment lines are headers. I already tried to set those ctx->part->flags as TEXT and the fts_backend was feeded correctly with all attachment lines.

I don't know if this is related with the value of MESSAGE_PART_FLAG_ATTACHMENT or if I am missing something (like setting block.hdr = NULL or some more code to handle new flags).

Thank you, Rui Carneiro

On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen <tss@iki.fi> wrote:

...
On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:

...
I didn't understood yet what is the plugin's design and how the plugins are called from the core system and I was wondering if anyone could help me with that.

fts-storage.c hooks into all the functions in mail-storage API that it needs to. Currently indexing isn't done while messages are being saved, but instead just before searching. The searching functions are:

fts_mailbox_search_init() tries to figure out if FTS can optimize the search. If it does, it tries to figure out if FTS index is up-to-date and if not, starts the search.

fts_mailbox_search_next_nonblock() continues the indexing (or searching after indexing) for a while. The idea is that IMAP connection is able to process other commands while doing a long-running search. So fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It would be nice if that value was dynamically calculated and also based on bytes instead of messages, but that's maybe too much trouble.

fts_mailbox_search_next_update_seq() uses the fts search results and updates mail-storage's search stuff so that it doesn't go through messages that don't match.

fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into hooking your attachment parsing. Change fts_build_want_index_part() to look for more content-types that you're interested in and then before feeding the blocks to FTS backend put them through your own converter function, something like:

int attachment_extract_text(struct attachment_extract_context *ctx, const struct message_block *input, struct message_block *output);

-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>

-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>

-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073

Timo Sirainen

7:32 p.m.

On Apr 21, 2009, at 6:25 AM, Rui Carneiro wrote:

...

Anyone know some good libraries to handle the content of files like
pdf, ppt, doc, etc? I am already indexing attachments all I need now is
extract the text of them.

I've no idea, but you could at least look at some of the other full
text search engines. I remember them advertising indexing support for
all kinds of formats. Maybe they're using some specific library or
maybe it would be easy to extract their parsing code.

Rui Carneiro

7:52 p.m.

Great idea!

I will give news soon.

On Tue, Apr 21, 2009 at 5:32 PM, Timo Sirainen <tss@iki.fi> wrote:

...

I've no idea, but you could at least look at some of the other full text search engines. I remember them advertising indexing support for all kinds of formats. Maybe they're using some specific library or maybe it would be easy to extract their parsing code.

Rui Carneiro

22 Apr 22 Apr

5:51 p.m.

Hi,

Almost full text search engines (C/C++) I looked (Swish-E, Wumpus, Lemur and Xapian) do not use any kind of library or parser. Instead, they use other applications like pdftotext, catdoc, catppt (etc) and call them with execvp (or equivalent). Using this approach on my project have some pros and cons:

Pros:

The existing libraries to extract the content of pdf, doc (etc) are not very stable.
Easier to handle errors (even if those applications crash dovecot will be still running)
Less developing time

Cons:

Some programs to parse special formats (p.e. catppt and pdftotext) do not accept input from stdin (we need to create temporary files).

What approach would be better? Using applications like pdftotext and catdoc or, on the other hand, use their libraries and do it almost from scratch?

Regards Rui Carneiro

On Tue, Apr 21, 2009 at 5:52 PM, Rui Carneiro <rui.arc@gmail.com> wrote:

...

Great idea!

I will give news soon.

On Tue, Apr 21, 2009 at 5:32 PM, Timo Sirainen <tss@iki.fi> wrote:

...
I've no idea, but you could at least look at some of the other full text search engines. I remember them advertising indexing support for all kinds of formats. Maybe they're using some specific library or maybe it would be easy to extract their parsing code.

-- mobile: +351 963446125 mail: rui.arc@gmail.com mail: ei04073@fe.up.pt website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>

Timo Sirainen

7:38 p.m.

On Wed, 2009-04-22 at 15:51 +0100, Rui Carneiro wrote:

...

Hi,

Almost full text search engines (C/C++) I looked (Swish-E, Wumpus, Lemur and Xapian) do not use any kind of library or parser. Instead, they use other applications like pdftotext, catdoc, catppt (etc) and call them with execvp (or equivalent). Using this approach on my project have some pros and cons:

Pros:

The existing libraries to extract the content of pdf, doc (etc) are not very stable.

Easier to handle errors (even if those applications crash dovecot will be still running)

Hmm. I hadn't thought of this before. Yeah, if they're not stable it's probably not a good idea to run in the same process as the rest of Dovecot. But I guess there could be some kind of a separate text extracting process that fts plugin would talk to. If that process dies it could get restarted automatically and fts could maybe retry and if it it dies again log it and just skip over it.

...

Some programs to parse special formats (p.e. catppt and pdftotext) do not accept input from stdin (we need to create temporary files).

Maybe those programs could be changed and just require the newer versions?..

...

What approach would be better? Using applications like pdftotext and catdoc or, on the other hand, use their libraries and do it almost from scratch?

I think the API that fts plugin uses to do the conversion should be generic enough that both approaches would work. Then it would be easier to implement one or another or both eventually.

Rui Carneiro

8:23 p.m.

On Wed, Apr 22, 2009 at 5:38 PM, Timo Sirainen <tss@iki.fi> wrote:

...

Maybe those programs could be changed and just require the newer versions?..

I will talk with the developers of those applications about the possibility of supporting stdin input (if not supported yet).

I think the API that fts plugin uses to do the conversion should be

...

generic enough that both approaches would work. Then it would be easier to implement one or another or both eventually.

I think I will try the external applications approach. My developing time available is not to much. I will develop the API as much as generic I can for possible improvements in the future.

Regards, Rui Carneiro

Steffen Kaiser

23 Apr 23 Apr

1:42 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Wed, 22 Apr 2009, Rui Carneiro wrote:

...

I will talk with the developers of those applications about the possibility of supporting stdin input (if not supported yet).

I think the API that fts plugin uses to do the conversion should be

...
generic enough that both approaches would work. Then it would be easier to implement one or another or both eventually.

I think I will try the external applications approach. My developing time available is not to much.

Actually, if I consider what the xls-to-HTML converter did lately to our webmail frontend, I suggest to index "alien" formats asynchroneously, maybe in low-priority process, not only to prevent potential long conversation time and resource requirement, but also to prevent MUAs re-initate the search and force the IMAP server to index the same file simultaneously.

Bye,

Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBSfBGBXWSIuGy1ktrAQKrRwgAll5KRqG0tMwPYgt21cKR5F4r8mrnA9nJ 5zvdQgFGXJoT4NegpzJ15+V8l7a28Uaxx79hzrubRpJSTNI5gU08TkzdNkJwWLTu IA8gK/ZwQnnMqpQByF/pf7ERzMroZv3ZpYpkbEbI64MMSYOrI2hT92t3KSSnJ39f TUSdRN9sUhdA69uWlKCFMofhAEfaoP+U8N3pg1b/kc14+HzmTqrx/SWNHZkzU5qm clUmfa/uGMuv+gq+bKSEtos79Q1QOTqH9qRSRbNsxOVISM75C7dTpqIlcqz53iIg RsRHDxCtyIv/UJrfE9fniOYE6l/xs8iLgG69fOGUCzwmLjVx2j9dKA== =7O9D -----END PGP SIGNATURE-----

tomas＠tuxteam.de

7:47 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Wed, Apr 22, 2009 at 03:51:45PM +0100, Rui Carneiro wrote:

[...]

...

Cons:

Some programs to parse special formats (p.e. catppt and pdftotext) do not accept input from stdin (we need to create temporary files).

[from the peanut gallery here]

Note that some formats might require to seek to some point in the file [1] (typically the end), so reading from stdin is awkward (it would require stdin to be seekable, so either the app or the caller would have to put the whole file somewhere anyway).

[1] Notably PDF has some index tables at EOF - 1k if I remember correctly.

Regards

-- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFJ7/LeBcgs9XrR2kYRAqG+AJ48Lg3W65h6E0LAda/Q0O8RE9s15ACfSrOS t2AUOrB+A0CXQYZAHFI/Qks= =Dtcc -----END PGP SIGNATURE-----

Rui Carneiro

5 May 5 May

2:08 p.m.

Hi again,

On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen <tss@iki.fi> wrote:

...

fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into hooking your attachment parsing. Change fts_build_want_index_part() to look for more content-types that you're interested in and then before feeding the blocks to FTS backend put them through your own converter function, something like:

int attachment_extract_text(struct attachment_extract_context *ctx, const struct message_block *input, struct message_block *output);

Let's take the example of an application-pdf content-type. Before I converter all pdf data to text I need to gather all data before. The actual process is feeding FTS backend with small parts of data and appending them on "build_more" functions (e.g. fts_backend_solr_build_more()).

So where should I call attachment_extract_text()? In fts_backend_solr_build_more() and not making append to cmd until data is extracted? Or gather all information before (e.g. fts_build_mail()) and send all in once to FTS backend?

I hope I've made myself clear.

Regards, Rui Carneiro

Portugalmail, Comunicações S.A. www.portugalmail.net

Timo Sirainen

13 May 13 May

7:26 p.m.

On Tue, 2009-05-05 at 12:08 +0100, Rui Carneiro wrote:

...

...

fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into hooking your attachment parsing. Change fts_build_want_index_part() to look for more content-types that you're interested in and then before feeding the blocks to FTS backend put them through your own converter function, something like:

int attachment_extract_text(struct attachment_extract_context *ctx, const struct message_block *input, struct message_block *output);

Let's take the example of an application-pdf content-type. Before I converter all pdf data to text I need to gather all data before. The actual process is feeding FTS backend with small parts of data and appending them on "build_more" functions (e.g. fts_backend_solr_build_more()).

Right.

...

So where should I call attachment_extract_text()? In fts_backend_solr_build_more() and not making append to cmd until data is extracted? Or gather all information before (e.g. fts_build_mail()) and send all in once to FTS backend?

Since others already mentioned that many formats pretty much require having the entire file available, I guess it's better to just save all the attachments to file at some point. So if I wrote the code it would probably work something like:

You notice a non-text/* content-type and initialize text extraction for the MIME part. Like:

struct attachment_extract_context * attachment_extract_init(const char *content_type);

After this you feed all the input belonging to that MIME part to:

int attachment_extract_add(struct attachment_extract_context *ctx, const struct message_block *input);

Don't output anything to FTS backend at this point. The attachment_extract_add() would probably just basically write to a temporary file.

Finally you'll notice that the MIME part ends (either you get headers for the next MIME part or the entire message ends). Then finish the extraction, which actually executes the whatever conversion binaries:

int attachment_extract_finish(struct attachment_extract_context *ctx);

Get the resulting text to fts_backend_build_more() somehow. Either some attachment_extract_add_to_fts() which internally adds it or some kind of an iterator that returns the text in smaller blocks. Either would work..

That kind of an API would also make it possible to pretty easily modify in future to not write temporary files for specific content types if it's not required.

Rui Carneiro

15 May 15 May

7:47 p.m.

Citando Timo Sirainen <tss@iki.fi>:

...

You notice a non-text/* content-type and initialize text extraction for the MIME part. Like:

struct attachment_extract_context * attachment_extract_init(const char *content_type);

After this you feed all the input belonging to that MIME part to:

int attachment_extract_add(struct attachment_extract_context *ctx, const struct message_block *input);

Don't output anything to FTS backend at this point. The attachment_extract_add() would probably just basically write to a temporary file.

Finally you'll notice that the MIME part ends (either you get headers for the next MIME part or the entire message ends). Then finish the extraction, which actually executes the whatever conversion binaries:

int attachment_extract_finish(struct attachment_extract_context *ctx);

Get the resulting text to fts_backend_build_more() somehow. Either some attachment_extract_add_to_fts() which internally adds it or some kind of an iterator that returns the text in smaller blocks. Either would work..

That kind of an API would also make it possible to pretty easily modify in future to not write temporary files for specific content types if it's not required.

I tried your approach and I think it is working pretty well. Now I only need to look carefully to the output of external programs and build the XML correctly to send to Solr.

Thanks Timo

Regards, Rui Carneiro

-- Portugalmail, Comunicações S.A. www.portugalmail.net

Rui Carneiro

18 May 18 May

1:42 p.m.

Hi again,

I am having some troubles sending all data to a file. When I finish to send all data to a file, I tried to open it and the file is corrupted.

The first think I noticed is that all chars are capitalized what destroy all the file format.

Where are the chars capitalized? Any other idea why files are getting corrupted?

Thank you, Rui Carneiro

Portugalmail, Comunicações S.A. www.portugalmail.net

Timo Sirainen

6:10 p.m.

On May 18, 2009, at 6:42 AM, Rui Carneiro wrote:

...

I am having some troubles sending all data to a file. When I finish
to send all data to a file, I tried to open it and the file is
corrupted.

The first think I noticed is that all chars are capitalized what
destroy all the file format.

Where are the chars capitalized?

Hmm. I'll see about getting it fixed in a better way, but for now you
could just change:

decoder = message_decoder_init(TRUE);

decoder = message_decoder_init(FALSE);

I'm thinking about making message_decoder uppercase only text/* body
parts.

...

Any other idea why files are getting corrupted?

Nope. If you still see corruption, try with some simple test mails and
see if it's adding garbage, losing contents or adding more content.

Rui Carneiro

7:35 p.m.

Citando Timo Sirainen <tss@iki.fi>:

...

Nope. If you still see corruption, try with some simple test mails and see if it's adding garbage, losing contents or adding more content.

I tried something more advanced than that. I hexdumped my pdf test file and on the first line I get:

00000000 25 50 44 46 2d 31 2e 33 0a 25 e2 e3 cf d3 0a 31

Where "e2 e3 cf d3" is binary data. When I do the same for my copied file I get:

00000000 25 50 44 46 2d 31 2e 33 0a 25 ef bf bd 0a 31 20

It is weird but the binary data changed.

Further, I print to logs the 11 character from the first block.data just before fts_backend_build_more() and the value is EF (the correct one would be E2).

I think binary data is being corrupted anywhere before fts_backend_build_more() and I don't have any idea where.

Any help would be appreciated.

Thank you, Rui Carneiro

-- Portugalmail, Comunicações S.A. www.portugalmail.net

Timo Sirainen

8:10 p.m.

On Mon, 2009-05-18 at 17:35 +0100, Rui Carneiro wrote:

...

I think binary data is being corrupted anywhere before fts_backend_build_more() and I don't have any idea where.

All the data comes from lib-mail/message-decoder.c. Hmm. Looks like it tries to force giving only valid UTF-8 output. I guess it should have some flag or something that makes it do that only for text/* parts, not for binary parts. OK, implemented, see if it works with this and using the flag:

http://hg.dovecot.org/dovecot-1.2/rev/44548a7fb10d

Rui Carneiro

19 May 19 May

4:40 p.m.

Citando Timo Sirainen <tss@iki.fi>:

...

All the data comes from lib-mail/message-decoder.c. Hmm. Looks like it tries to force giving only valid UTF-8 output. I guess it should have some flag or something that makes it do that only for text/* parts, not for binary parts. OK, implemented, see if it works with this and using the flag:

http://hg.dovecot.org/dovecot-1.2/rev/44548a7fb10d

It is working now but I needed to do some changes on your code.

When you check charset_utf8 and charset_trans you have a problem on attachments case. Attachments part do not have any charset defined on headers so, by default, charset_utf8=1 and charset_trans is garbage (I have no idea where that garbage came from).

To avoid this problem swap the some lines of code that set ctx->binary_input to the function's beginning.

Please see the attachment to checked any problem that may exist.

Thank you, Rui Carneiro

Portugalmail, Comunicações S.A. www.portugalmail.net

Timo Sirainen

10:51 p.m.

On Tue, 2009-05-19 at 14:40 +0100, Rui Carneiro wrote:

...

...
http://hg.dovecot.org/dovecot-1.2/rev/44548a7fb10d

It is working now but I needed to do some changes on your code.

OK.

...

Please see the attachment to checked any problem that may exist.

You forgot the attachment.

Rui Carneiro

11:40 p.m.

On Tue, May 19, 2009 at 8:51 PM, Timo Sirainen <tss@iki.fi> wrote:

...

You forgot the attachment.

Oh Sorry, I am not at the office now (almost 10pm here) I will send it tomorrow morning.

Rui Carneiro

Portugalmail, Comunicações S.A. www.portugalmail.net

Rui Carneiro

20 May 20 May

11:18 a.m.

Now, with attachment.

Rui Carneiro

22 May 22 May

8:24 p.m.

Hi Timo,

I almost finish the changes on fts plugin. By now, it seems to work fine with attachments (extracting and sending them to Solr). I only have a problem with the max size of the command (cmd) that we can send to Solr:

#define SOLR_CMDBUF_SIZE (1024*64)

By now, if we send some message bigger than this value the fts-plugin crash.

There is anything in your TODO-List that solves this problem?

Regards, Rui Carneiro

PS: asap I will send you my code for your approval :)

-- Portugalmail, Comunicações S.A. www.portugalmail.net

Timo Sirainen

8:29 p.m.

On Fri, 2009-05-22 at 18:24 +0100, Rui Carneiro wrote:

...

Hi Timo,

I almost finish the changes on fts plugin. By now, it seems to work fine with attachments (extracting and sending them to Solr). I only have a problem with the max size of the command (cmd) that we can send to Solr:

#define SOLR_CMDBUF_SIZE (1024*64)

By now, if we send some message bigger than this value the fts-plugin crash.

The problem is something else. The Solr code simply tries to keep the send buffer smaller than that, nothing would break if you sent a larger buffer. Show gdb backtrace of the crash?

Rui Carneiro

8:57 p.m.

Citando Timo Sirainen <tss@iki.fi>:

...

The problem is something else. The Solr code simply tries to keep the send buffer smaller than that, nothing would break if you sent a larger buffer. Show gdb backtrace of the crash?

I said it was from the buff size because when I increased it Dovecot didn't crash.

It's Friday and I will not be able to do the gdb backtrace on weekend but it will be the first thing I will do Monday morning.

Regards, Rui Carneiro

Portugalmail, Comunicações S.A. www.portugalmail.net

Timo Sirainen

9:03 p.m.

On Fri, 2009-05-22 at 18:57 +0100, Rui Carneiro wrote:

...

Citando Timo Sirainen <tss@iki.fi>:

...
The problem is something else. The Solr code simply tries to keep the send buffer smaller than that, nothing would break if you sent a larger buffer. Show gdb backtrace of the crash?

I said it was from the buff size because when I increased it Dovecot didn't crash.

I guess it works around some other bug then. If it's a memory-related bug you could also see if valgrind complains something:

protocol imap { .. mail_executable = /usr/bin/valgrind /usr/local/libexec/dovecot/imap }

Rui Carneiro

25 May 25 May

4:20 p.m.

Citando Timo Sirainen <tss@iki.fi>:

...

I guess it works around some other bug then. If it's a memory-related bug you could also see if valgrind complains something:

protocol imap { .. mail_executable = /usr/bin/valgrind /usr/local/libexec/dovecot/imap }

Here is the output (I cloned the http://hg.dovecot.org/dovecot-1.2 and made no changes to this test):

ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 123 from 2) malloc/free: in use at exit: 94,040 bytes in 1,032 blocks. malloc/free: 1,704 allocs, 672 frees, 1,042,476 bytes allocated. For counts of detected errors, rerun with: -v searching for pointers to 1,032 not-freed blocks. checked 111,072 bytes.

88,161 (328 direct, 87,833 indirect) bytes in 1 blocks are definitely lost in loss record 30 of 45 at 0x4C24384: calloc (vg_replace_malloc.c:397) by 0x4AF165: pool_system_malloc (mempool-system.c:77) by 0x63E0DA2: ??? by 0x63DF91D: ??? by 0x5DBAF27: ??? by 0x5DBBE50: ??? by 0x46BBFF: mailbox_transaction_begin (mail-storage.c:794) by 0x42976F: imap_search_start (imap-search.c:540) by 0x4206D7: cmd_search (cmd-search.c:50) by 0x4232CB: client_command_input (client.c:608) by 0x423389: client_command_input (client.c:657) by 0x4239F4: client_handle_input (client.c:698)

LEAK SUMMARY: definitely lost: 328 bytes in 1 blocks. indirectly lost: 87,833 bytes in 1,016 blocks. possibly lost: 0 bytes in 0 blocks. still reachable: 5,879 bytes in 15 blocks. suppressed: 0 bytes in 0 blocks. Reachable blocks (those to which a pointer was found) are not shown. To see them, rerun with: --leak-check=full --show-reachable=yes

Timo Sirainen

26 May 26 May

2:39 a.m.

On Mon, 2009-05-25 at 14:20 +0100, Rui Carneiro wrote:

...

Citando Timo Sirainen <tss@iki.fi>:

...
I guess it works around some other bug then. If it's a memory-related bug you could also see if valgrind complains something:

protocol imap { .. mail_executable = /usr/bin/valgrind /usr/local/libexec/dovecot/imap }

Here is the output (I cloned the http://hg.dovecot.org/dovecot-1.2 and made no changes to this test):

ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 123 from 2)

So valgrind didn't find anything wrong. What does gdb show as the backtrace?

Rui Carneiro

12:46 p.m.

Citando Timo Sirainen <tss@iki.fi>:

...

So valgrind didn't find anything wrong.

We should ignore LEAK SUMMARY?

...

What does gdb show as the backtrace?

My gdb is not writing where he should (or not writing at all). This shouldn't be enough?

mail_executable = /usr/local/libexec/dovecot/gdbhelper /usr/local/libexec/dovecot/imap

The crash occurs after indexing all stuff and when imap is returning the result.

Thank you, Rui Carneiro

Portugalmail, Comunicações S.A. www.portugalmail.net

Timo Sirainen

6:32 p.m.

On May 26, 2009, at 5:46 AM, Rui Carneiro wrote:

...

Citando Timo Sirainen <tss@iki.fi>:

...
So valgrind didn't find anything wrong.

We should ignore LEAK SUMMARY?

At least for now. Memory leaks don't cause crashes.

...

...
What does gdb show as the backtrace?

My gdb is not writing where he should (or not writing at all). This
shouldn't be enough?

mail_executable = /usr/local/libexec/dovecot/gdbhelper /usr/local/ libexec/dovecot/imap

It's not writing /tmp/gdbhelper* files when crashing? Anyway there's
also one guaranteed way to get backtrace. Remove the gdbhelper and
then run:

gdb -p pidof imap cont <make it crash> bt full

Rui Carneiro

8:49 p.m.

Citando Timo Sirainen <tss@iki.fi>:

...

At least for now. Memory leaks don't cause crashes.

Ok.

...

gdb -p pidof imap cont <make it crash> bt full

I think it won't be necessary. It is not crashing anymore. Maybe it was a bug in my code.

Tomorrow (or in the next day) I will send you the code.

Thank you for all the support!

Regards, Rui Carneiro

Portugalmail, Comunicações S.A. www.portugalmail.net

5932

Age (days ago)

5975

Last active (days ago)

List overview

31 comments

5 participants

participants (5)

Rui Carneiro
Rui Carneiro
Steffen Kaiser
Timo Sirainen
tomas＠tuxteam.de