[Dovecot] Messages causing an error from Solr.

Erik Hetzner ehetzner at gmail.com
Mon Aug 23 21:47:10 EEST 2010


At Fri, 20 Aug 2010 20:39:47 +0100,
Timo Sirainen wrote:
> 
> On Fri, 2010-08-20 at 09:53 -0700, Erik Hetzner wrote:
> 
> > For what it’s worth, here are the “invalid XML character”s being
> > complained about by Solr:
> 
> Oh. It's not about illegal UTF8 sequences, but about some unicode
> characters actually not being valid for XML. Hopefully these help:
> 
> http://hg.dovecot.org/dovecot-1.2/rev/5efba9f9f0a7
> http://hg.dovecot.org/dovecot-1.2/rev/cf0da2cd31fb

Hi Timo,

Unfortunately this second changeset (cf0da2cd31fb) seems to have
introduced a bug that results ever other character being dropped from
strings before they are indexed. For instance, my username `egh`
becomes `eh`, `spam` becomes `sa`, `drafts` becomes `dat`,
etc. Furthermore I am not sure that the UTF-8 code is working as
expected. Attached is a patch which fixes the problem with every
second character being dropped & results in a solr index that can be
searched for unicode characters (at least I tested it with latin
accents and with greek)

best, Erik

-------------- next part --------------
A non-text attachment was scrubbed...
Name: solr_unicode.diff
Type: application/octet-stream
Size: 759 bytes
Desc: not available
Url : http://dovecot.org/pipermail/dovecot/attachments/20100823/600e4044/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://dovecot.org/pipermail/dovecot/attachments/20100823/600e4044/attachment.bin 


More information about the dovecot mailing list