[Dovecot] fts squat non-english search for 2 words
Hello,
It looks I encoutered a bug or misconfiguration. fts_squat search for subject and body works excellent for English mails. For non-English (in particular, Russian) it works only when query consists of 1 word. Phrases - 2 and more words - always returns nothing. Example: search for "planet" ("планета") returns results, search for "Earth" ("Земля") also returns results, but "planet Earth" ("планета Земля") returns nothing. But there are emails having exact phrase "planet Earth". This problem occurs only for non-English queries, both for search in subject and in email body. I tried web-mail Horde 3.2 and Thunderbird. I turned fts plugin off and it correctly finds phrases with 2 and more russian words! So problem is squat. Is it a bug or known config issue?
OS: Debian 5.0, installed as openVZ container inside Ubuntu 8.04. Dovecot: 1.2.4, from backports.org Ext3 filesys
dovecot -n
# 1.2.4: /etc/dovecot/dovecot.conf # OS: Linux 2.6.24-24-openvz i686 Debian 5.0.3 simfs log_timestamp: %Y-%m-%d %H:%M:%S protocols: imap imaps pop3 pop3s ssl_cert_file: /etc/ssl/test123/test123.full.crt ssl_key_file: /etc/ssl/test123/priv/test123.key disable_plaintext_auth: no login_dir: /var/run/dovecot/login login_executable(default): /usr/lib/dovecot/imap-login login_executable(imap): /usr/lib/dovecot/imap-login login_executable(pop3): /usr/lib/dovecot/pop3-login last_valid_uid: 500 mail_privileged_group: mail mail_location: maildir:/var/mail/%u mbox_write_locks: fcntl dotlock mail_executable(default): /usr/lib/dovecot/imap mail_executable(imap): /usr/lib/dovecot/imap mail_executable(pop3): /usr/lib/dovecot/pop3 mail_plugins(default): quota imap_quota fts fts_squat mail_plugins(imap): quota imap_quota fts fts_squat mail_plugins(pop3): mail_plugin_dir(default): /usr/lib/dovecot/modules/imap mail_plugin_dir(imap): /usr/lib/dovecot/modules/imap mail_plugin_dir(pop3): /usr/lib/dovecot/modules/pop3 imap_client_workarounds(default): outlook-idle delay-newmail imap_client_workarounds(imap): outlook-idle delay-newmail imap_client_workarounds(pop3): pop3_client_workarounds(default): pop3_client_workarounds(imap): pop3_client_workarounds(pop3): outlook-no-nuls oe-ns-eoh lda: postmaster_address: postmaster@test123.ru hostname: test123.ru sendmail_path: /usr/sbin/sendmail auth_socket_path: /var/run/dovecot/auth-master mail_plugins: quota sieve log_path: info_log_path: auth default: mechanisms: plain login user: nobody passdb: driver: sql args: /etc/dovecot/dovecot-sql.conf userdb: driver: passwd userdb: driver: sql args: /etc/dovecot/dovecot-sql.conf userdb: driver: prefetch socket: type: listen client: path: /var/spool/postfix/private/auth mode: 432 user: postfix group: mail master: path: /var/run/dovecot/auth-master mode: 432 user: vmail group: mail plugin: acl: vfile:/etc/dovecot/acls trash: /etc/dovecot/trash.conf fts: squat fts_squat: partial=4 full=20
(with full=10 problem persists)
Maybe I asked wrong question. OK, does anybody use fts_squat for non-English emails? Can you find emails by query of 2 WORDS - "planet Earth"? On my system it works only when both words are from latin alphabet, otherwise returns nothing. For latin, it finds even emails having both lating and russian letters (UTF-8 encoding). For non-latin, query must consist of 1 word only.
Thanks for any ideas.
It looks I encoutered a bug or misconfiguration. fts_squat search for subject and body works excellent for English mails. For non-English (in particular, Russian) it works only when query consists of 1 word. Phrases - 2 and more words - always returns nothing. Example: search for "planet" ("планета") returns results, search for "Earth" ("Земля") also returns results, but "planet Earth" ("планета Земля") returns nothing. But there are emails having exact phrase "planet Earth". This problem occurs only for non-English queries, both for search in subject and in email body. I tried web-mail Horde 3.2 and Thunderbird. I *turned fts plugin off* and it correctly finds phrases with 2 and more russian words! So problem is squat. Is it a bug or known config issue?
dovecot -n
I'll try to look into this when I have a bit more time..
On Wed, 2009-11-18 at 16:19 +0700, vuser1@test123.ru wrote:
Maybe I asked wrong question. OK, does anybody use fts_squat for non-English emails? Can you find emails by query of 2 WORDS - "planet Earth"? On my system it works only when both words are from latin alphabet, otherwise returns nothing. For latin, it finds even emails having both lating and russian letters (UTF-8 encoding). For non-latin, query must consist of 1 word only.
Thanks for any ideas.
It looks I encoutered a bug or misconfiguration. fts_squat search for subject and body works excellent for English mails. For non-English (in particular, Russian) it works only when query consists of 1 word. Phrases - 2 and more words - always returns nothing. Example: search for "planet" ("планета") returns results, search for "Earth" ("Земля") also returns results, but "planet Earth" ("планета Земля") returns nothing. But there are emails having exact phrase "planet Earth". This problem occurs only for non-English queries, both for search in subject and in email body. I tried web-mail Horde 3.2 and Thunderbird. I *turned fts plugin off* and it correctly finds phrases with 2 and more russian words! So problem is squat. Is it a bug or known config issue?
dovecot -n
Timo, thank you for answer. Meanwhile I was trying to setup horde+dovecot+search. Next step was dovecot 1.2.4 + solr 1.4. It works! Now it can find 2 non-latin words.
- I cannot search by substrings - neither "plane" nor "plane*" does find "planet"
- Solr can use "plane*" to find "planet". So I think dovecot internally cuts or masks metasymbols.
I see on wiki that you have plans to implement IMAP extension for this. Have you ever considered idea to allow users to use "*" wildcards for solr backend? If dovecot already "break" imap search, why not to allow people to use "plane* Ear*" to find "planet Earth"?
----- Message from tss@iki.fi ---------
I'll try to look into this when I have a bit more time..
On Wed, 2009-11-18 at 16:19 +0700, vuser1@test123.ru wrote:
Maybe I asked wrong question. OK, does anybody use fts_squat for non-English emails? Can you find emails by query of 2 WORDS - "planet Earth"? On my system it works only when both words are from latin alphabet, otherwise returns nothing. For latin, it finds even emails having both lating and russian letters (UTF-8 encoding). For non-latin, query must consist of 1 word only.
Thanks for any ideas.
It looks I encoutered a bug or misconfiguration. fts_squat search for subject and body works excellent for English mails. For non-English (in particular, Russian) it works only when query consists of 1 word. Phrases - 2 and more words - always returns nothing. Example: search for "planet" ("планета") returns results, search for "Earth" ("Земля") also returns results, but "planet Earth" ("планета Земля") returns nothing. But there are emails having exact phrase "planet Earth". This problem occurs only for non-English queries, both for search in subject and in email body. I tried web-mail Horde 3.2 and Thunderbird. > I *turned fts plugin off* and it correctly finds phrases with 2 and more russian words! So problem is squat. Is it a bug or known config > issue?
dovecot -n
----- End message from tss@iki.fi -----
On Sun, 2009-11-22 at 20:35 +0700, vuser1@test123.ru wrote:
Timo, thank you for answer. Meanwhile I was trying to setup horde+dovecot+search. Next step was dovecot 1.2.4 + solr 1.4. It works! Now it can find 2 non-latin words.
- I cannot search by substrings - neither "plane" nor "plane*" does find "planet"
Try if attached patch helps?
- Solr can use "plane*" to find "planet". So I think dovecot internally cuts or masks metasymbols.
Yes, and I don't really like changing that. Seems like it could make things even worse..
On Wed, 2009-11-18 at 00:53 +0700, vuser1@test123.ru wrote:
It looks I encoutered a bug or misconfiguration. fts_squat search for subject and body works excellent for English mails. For non-English (in particular, Russian) it works only when query consists of 1 word. Phrases - 2 and more words - always returns nothing. Example: search for "planet" ("планета") returns results, search for "Earth" ("Земля") also returns results, but "planet Earth" ("планета Земля") returns nothing. But there are emails having exact phrase "planet Earth". This problem occurs only for non-English queries, both for search in subject and in email body.
This should fix it: http://hg.dovecot.org/dovecot-1.2/rev/6541fcc3bf54
Timo, many thanx for this! Finally I installed dovecot 1.2.9 from debian backports. Your fix have solved the problem. But look, it happens both for English and Russian emails:
- I have testing mailbox with ~27000 emails. Big and small, 13Gb total.
- Search (squat) for single word "planet" runs for 2-4 seconds.
- Search for another word "Earth" runs fast as well.
- Search for "planet Earth" runs for more than 3 minutes! And it uses a lot of I/O - server's HDD LED constantly blinks during the search.
I use horde/imp mail client. I can't believe the problem is squat internal design. There must be something wrong in algorithm implementation. With Thunderbird/Win32 there is same search delay. More, thunderbird can't search for Russian words - always no results. There are things to stabilize.
I must say that squat is my preferable FTS engine, as you know SOLR engine has issues. I am very interested in easy and powerful IMAP search and would like to help you make it even better, as tester. Anyway, thank you for great product!
-----Original Message----- From: dovecot-bounces+vuser1=test123.ru@dovecot.org [mailto:dovecot-bounces+vuser1=test123.ru@dovecot.org] On Behalf Of Timo Sirainen Sent: Tuesday, November 24, 2009 12:52 AM To: vuser1@test123.ru Cc: dovecot@dovecot.org Subject: Re: [Dovecot] fts squat non-english search for 2 words
On Wed, 2009-11-18 at 00:53 +0700, vuser1@test123.ru wrote:
It looks I encoutered a bug or misconfiguration. fts_squat search for subject and body works excellent for English mails. For non-English (in particular, Russian) it works only when query consists of 1 word. Phrases - 2 and more words - always returns nothing. Example: search for "planet" ("планета") returns results, search for "Earth" ("Земля") also returns results, but "planet Earth" ("планета Земля") returns nothing. But there are emails having exact phrase "planet Earth". This problem occurs only for non-English queries, both for search in subject and in email body.
This should fix it: http://hg.dovecot.org/dovecot-1.2/rev/6541fcc3bf54
On Thu, 2010-01-07 at 17:07 +0700, vuser1@test123.ru wrote:
Timo, many thanx for this! Finally I installed dovecot 1.2.9 from debian backports. Your fix have solved the problem. But look, it happens both for English and Russian emails:
- I have testing mailbox with ~27000 emails. Big and small, 13Gb total.
- Search (squat) for single word "planet" runs for 2-4 seconds.
- Search for another word "Earth" runs fast as well.
- Search for "planet Earth" runs for more than 3 minutes! And it uses a lot of I/O - server's HDD LED constantly blinks during the search.
I suppose searching "planet Earth" in your client means the same as planet OR Earth. ORs aren't currently supported by Squat. It's been in TODO for a while..
participants (2)
-
Timo Sirainen
-
vuser1@test123.ru