Strange indexing behavior on HTML emails ..
Hi,
In continuation to the issue I had posted about long back:
http://www.dovecot.org/list/dovecot/2014-August/097362.html
I did further testing today on a fresh new Debian & latest Dovecot and observed a undesired behavior. I am using fts_lucene & following sequence of commands on a empty test account me@myself.com:
doveadm expunge -u 'my@myself.com' mailbox 'INBOX' all cat test.eml | /usr/lib/dovecot/dovecot-lda -e -f you@yourself.com -d me@myself.com doveadm search -u 'akash@mailjol.in' mailbox 'INBOX' text ABCD
Search command does or doesn't find the email with slight variation in the content of test.eml. Here are the results:
test.eml content:
From: you@yourself.com To: me@myself.com Subject: Test Message Content-Type: text/html
<div id="mydiv">ABCD 1234</div>
RESULT: OK. The email is found.
test.eml content (double quotes inside div tag replaced with single):
From: you@yourself.com To: me@myself.com Subject: Test Message Content-Type: text/html
<div id='mydiv'>ABCD 1234</div>
RESULT: None. The email isn't found.
test.eml content (single quotes in div but content/type header removed):
From: you@yourself.com To: me@myself.com Subject: Test Message
<div id='mydiv'>ABCD 1234</div>
RESULT: OK. The email is found.
What could be the reason for this?
-Akash
The issue is probably linked to:
http://www.dovecot.org/list/dovecot-cvs/2014-May/024462.html
But that change-set was in 2014 and I am using Dovecot 2.2.19 so don't understand why I am still seeing this behavior.
-Akash
Tried latest source from HG and with solr also apart from lucene which I tested previously. The problem with single quotes in HTML is still there.
The revision:
http://hg.dovecot.org/dovecot-2.2/rev/ad028a950248
should have solved it but the relevant code no longer exists in src/plugins/fts/fts-parser-html.c. Seems like it has been moved into lib-mail. The file src/lib-mail/mail-html2text.c does contain something about single quotes but to no avail. Can someone at-least confirm existence of this issue?
On Wed, Oct 14, 2015 at 08:33:56PM +0530, Akash wrote:
Tried latest source from HG and with solr also apart from lucene which I tested previously. The problem with single quotes in HTML is still there.
The revision:
http://hg.dovecot.org/dovecot-2.2/rev/ad028a950248
should have solved it but the relevant code no longer exists in src/plugins/fts/fts-parser-html.c. Seems like it has been moved into lib-mail. The file src/lib-mail/mail-html2text.c does contain something about single quotes but to no avail. Can someone at-least confirm existence of this issue?
Thanks for the report. Bug found. My bad. A patch is working its way through the internal process, and will be in the public tree soon.
Cheers, Phil
On Thu, Oct 15, 2015 at 02:19:22PM +0200, Jean-Baptiste Vignaud wrote:
Thanks for the report. Bug found. My bad. A patch is working its way through the internal process, and will be in the public tree soon.
Hello; Does this patch will need to reindex lucene ?
Yes, unfortunately, it does.
Phil
participants (3)
-
Akash
-
Jean-Baptiste Vignaud
-
Phil Carmody