[Dovecot] Japanese Search
Hello,
I'm trying to debug search command with Japanese 'ISO-2022-JP' codec.
ISO-2022-JP codec needs to keep character cases until convert to UTF-8.
For example:
- '\033[$B%s\033[(B' means a character sounds 'N'
- '\033[$B%S\033[(B' means a character sounds 'Bi'
It causes a trouble in searches with Japanese. I found imap/imap-search.c/add_new() makes strings uppercase and fixed. But search results didn't affected.. ;-)
I hope I can fix this problem in a few days. But I'm not clear about dovecot's source. If you have hints, please tell me.
thanks,
Kazuo Moriwaka moriwaka@valinux.co.jp
On 20.12.2004, at 04:04, Kazuo Moriwaka wrote:
I'm trying to debug search command with Japanese 'ISO-2022-JP' codec.
ISO-2022-JP codec needs to keep character cases until convert to UTF-8.
For example:
- '\033[$B%s\033[(B' means a character sounds 'N'
- '\033[$B%S\033[(B' means a character sounds 'Bi'
It causes a trouble in searches with Japanese. I found imap/imap-search.c/add_new() makes strings uppercase and fixed. But search results didn't affected.. ;-)
I'll fix add_new(), but I'm not sure what else could be there.. That value gets passed as key parameter to message_body_search() and message_header_search_init(). Those call charset_to_ucase_utf8_string() to get an uppercase utf-8 string from it which is then compared to text found in messages.
Did you check if the ISO-2022-JP text is converted correctly to UTF-8 at all? Looking at charset_to_ucase_utf8() in lib-charset/charset-iconv.c might show something.
Hello,
From: Timo Sirainen <tss@iki.fi> Subject: Re: [Dovecot] Japanese Search Date: Mon, 20 Dec 2004 06:54:15 +0200
On 20.12.2004, at 04:04, Kazuo Moriwaka wrote:
I'm trying to debug search command with Japanese 'ISO-2022-JP' codec.
ISO-2022-JP codec needs to keep character cases until convert to UTF-8.
For example:
- '\033[$B%s\033[(B' means a character sounds 'N'
- '\033[$B%S\033[(B' means a character sounds 'Bi'
It causes a trouble in searches with Japanese. I found imap/imap-search.c/add_new() makes strings uppercase and fixed. But search results didn't affected.. ;-)
I'll fix add_new(), but I'm not sure what else could be there.. That value gets passed as key parameter to message_body_search() and message_header_search_init(). Those call charset_to_ucase_utf8_string() to get an uppercase utf-8 string from it which is then compared to text found in messages.
Did you check if the ISO-2022-JP text is converted correctly to UTF-8 at all? Looking at charset_to_ucase_utf8() in lib-charset/charset-iconv.c might show something.
Thank you for your fix. And I'm sorry for my mistake.
This problem is already fixed by add_new() fix. But I couldn't notice it because I was mistaken in the binary files.
Now, I can search headers with Japanese strings :-) # I test some Subject and From headers. But body of messages can't. I'll check it out.
thanks,
Kazuo Moriwaka moriwaka@valinux.co.jp
Hello,
From: Kazuo Moriwaka <moriwaka@valinux.co.jp> Subject: Re: [Dovecot] Japanese Search Date: Mon, 20 Dec 2004 16:08:54 +0900 (JST)
But body of messages can't. I'll check it out.
I found a missing of NULL value checking while this work. hdr_search_ctx can be NULL when key is not valid. I made a patch for it.
best regards,
Kazuo Moriwaka moriwaka@valinux.co.jp
=================================================================== RCS file: /home/cvs/dovecot/src/lib-mail/message-body-search.c,v retrieving revision 1.20 diff -c -r1.20 message-body-search.c *** message-body-search.c 7 Nov 2004 15:21:29 -0000 1.20 --- message-body-search.c 20 Dec 2004 09:46:46 -0000
*** 120,126 **** if (hdr->eoh) continue;
! if (!ctx->ignore_header) { if (message_header_search(hdr->value, hdr->value_len, hdr_search_ctx)) { found = TRUE; --- 120,126 ---- if (hdr->eoh) continue;
! if (!ctx->ignore_header && hdr_search_ctx) { if (message_header_search(hdr->value, hdr->value_len, hdr_search_ctx)) { found = TRUE;
Hello,
From: Kazuo Moriwaka <moriwaka@valinux.co.jp> Subject: Re: [Dovecot] Japanese Search Date: Mon, 20 Dec 2004 19:00:41 +0900 (JST)
Hello,
From: Kazuo Moriwaka <moriwaka@valinux.co.jp> Subject: Re: [Dovecot] Japanese Search Date: Mon, 20 Dec 2004 16:08:54 +0900 (JST)
But body of messages can't. I'll check it out.
I found a missing of NULL value checking while this work. hdr_search_ctx can be NULL when key is not valid. I made a patch for it.
It looks like to need a patch for message-header-search.c. Please check it.
thanks,
Kazuo Moriwaka moriwaka@valinux.co.jp
Index: message-header-search.c
RCS file: /home/cvs/dovecot/src/lib-mail/message-header-search.c,v retrieving revision 1.12 diff -c -r1.12 message-header-search.c *** message-header-search.c 5 Jan 2003 13:09:52 -0000 1.12 --- message-header-search.c 20 Dec 2004 10:30:39 -0000
*** 48,53 **** --- 48,54 ----
if (key == NULL) {
/* invalid key */
}p_free(ctx); return NULL;
Hello,
From: Kazuo Moriwaka <moriwaka@valinux.co.jp> Subject: Re: [Dovecot] Japanese Search Date: Mon, 20 Dec 2004 16:08:54 +0900 (JST)
Now, I can search headers with Japanese strings :-) # I test some Subject and From headers. But body of messages can't. I'll check it out.
I found a reason for imapd cannot search message body by japanese. charset_to_ucase_utf8_string() convert key to UTF-8 string, but context->charset keeps old value(ex. "iso-2022-jp").
When 2nd call of charset_to_ucase_utf8_string(), charset(iso-2022-jp) and key value (utf-8 encoded string) are mismatch. It causes: -> iconv returns error -> key isn't valid -> search failed.
I attach a patch for this problem. I'm sorry for this patch is just a quick hack.
For message-header-search.c, I feel this patch isn't need. It's just to be safe.
best regards,
Kazuo Moriwaka moriwaka@valinux.co.jp
Index: lib-mail/message-body-search.c
RCS file: /home/cvs/dovecot/src/lib-mail/message-body-search.c,v retrieving revision 1.21 diff -r1.21 message-body-search.c 376c376 < ctx->charset = charset;
ctx->charset = "UTF-8"; Index: lib-mail/message-header-search.c
RCS file: /home/cvs/dovecot/src/lib-mail/message-header-search.c,v retrieving revision 1.13 diff -r1.13 message-header-search.c 57c57 < ctx->key_charset = p_strdup(pool, charset);
ctx->key_charset = p_strdup(pool, "UTF-8");
Hello,
From: Kazuo Moriwaka <moriwaka@valinux.co.jp> Subject: Re: [Dovecot] Japanese Search Date: Tue, 21 Dec 2004 12:30:54 +0900 (JST)
For message-header-search.c, I feel this patch isn't need. It's just to be safe.
I read the code more carefully, and message-header-search.c 's patch doesn't match for my object, and doesn't need. I'm sorry for my careless work.
regards,
Kazuo Moriwaka moriwaka@valinux.co.jp
Hello,
From: Kazuo Moriwaka <moriwaka@valinux.co.jp> Subject: Re: [Dovecot] Japanese Search Date: Tue, 21 Dec 2004 12:30:54 +0900 (JST)
I found a reason for imapd cannot search message body by japanese. charset_to_ucase_utf8_string() convert key to UTF-8 string, but context->charset keeps old value(ex. "iso-2022-jp").
When 2nd call of charset_to_ucase_utf8_string(), charset(iso-2022-jp) and key value (utf-8 encoded string) are mismatch. It causes: -> iconv returns error -> key isn't valid -> search failed.
I append some descriptions..
2 calls of charset_to_ucase_utf8_string() is caused by commands like following: a001 search charset ***** body "*****"
When imapd receive it, it calls message_body_search()
Call flow is look like:
message_body_search() +-> message_body_search_init() | +-> charset_to_ucase_utf8_string() <-- 1st (key is 'charset') +-> message_bodd_search_ctx() +-> message_search_header() +-> message_header_search_init() +-> charset_to_ucase_utf8_string() <-- 2nd (key is utf-8)
My last patch is just a quick hack to avoid this. I think search key's initialize (convert to utf8) is done in or near of index_storage_search_init() or imap_search().
thanks,
Kazuo Moriwaka moriwaka@valinux.co.jp
On Fri, 2004-12-24 at 15:26 +0900, Kazuo Moriwaka wrote:
2 calls of charset_to_ucase_utf8_string() is caused by commands like following: a001 search charset ***** body "*****"
When imapd receive it, it calls message_body_search()
Call flow is look like:
message_body_search() +-> message_body_search_init() | +-> charset_to_ucase_utf8_string() <-- 1st (key is 'charset') +-> message_bodd_search_ctx() +-> message_search_header() +-> message_header_search_init() +-> charset_to_ucase_utf8_string() <-- 2nd (key is utf-8)
This happens only when it's searching MIME part headers, so it shouldn't affect the actual body searching? I did several tests and looks like it all works, except that one. I used this patch: --- lib-mail/message-body-search.c 20 Dec 2004 12:51:18 -0000 1.21 +++ lib-mail/message-body-search.c 6 Jan 2005 21:39:27 -0000 @@ -109,8 +109,7 @@ hdr_search_ctx = message_header_search_init(pool_datastack_create(), ctx->body_ctx->key, - ctx->body_ctx->charset, - NULL); + "UTF-8", NULL); if (hdr_search_ctx == NULL) { /* Invalid key. */ return FALSE;
Hello,
From: Timo Sirainen <tss@iki.fi> Subject: Re: [Dovecot] Japanese Search Date: Thu, 06 Jan 2005 23:39:49 +0200
On Fri, 2004-12-24 at 15:26 +0900, Kazuo Moriwaka wrote:
2 calls of charset_to_ucase_utf8_string() is caused by commands like following: a001 search charset ***** body "*****"
When imapd receive it, it calls message_body_search()
Call flow is look like:
message_body_search() +-> message_body_search_init() | +-> charset_to_ucase_utf8_string() <-- 1st (key is 'charset') +-> message_bodd_search_ctx() +-> message_search_header() +-> message_header_search_init() +-> charset_to_ucase_utf8_string() <-- 2nd (key is utf-8)
This happens only when it's searching MIME part headers, so it shouldn't affect the actual body searching? I did several tests and looks like it all works, except that one. I used this patch:
Thank you very much for your work. My testing maildir have some messages which have some MIME parts. I'm sorry to forget to write it. I tested the patch for my Maildir and some Japanese search test cases. It works well.
thanks,
Kazuo Moriwaka <moriwaka@valinux.co.jp>
participants (2)
-
Kazuo Moriwaka
-
Timo Sirainen