[Dovecot] pigeonhole, regex, UTF-8

Tue Jul 13 20:51:47 EEST 2010

  On 07/13/2010 10:16 AM, Stephan Bosch wrote:
> The standard regexp library does not support unicode and I was not 
> planning to write my own regexp compiler any time soon.
I wouldn't want to write one as well.
> As a matter of fact, I haven't looked at TRE before. I'm quite 
> interested though, since it is backwards compatible with POSIX and 
> seems to be available in most systems. I'll give it a closer look, 
> also in terms of compatibility with the latest draft of the Sieve 
> regex extension specification.
>
> Regards,
>
> Stephan.
>

There are a few odd things about the wide character support in TRE. 
Either you need to convert each message to wchar_t and make sure you set 
the system encoding to wchar_t, or you need to set the system encoding 
for each message, which may or may not mess up your UTF-8 regex.

My project is an Internet Classifier (used with things like Squid proxy 
to make a filter). I convert everything to wchar_t (using iconv with 
info gathered from headers) and use the wide character versions of the 
functions. That way I know everything is just fine. I then have the 
program set the system encoding (at least the environment variable for 
the given session) to UTF-8 before I do any of the regex compiling. 
Everything works wonderfully and quite quickly.

I am not sure TRE is available on all systems where dovecot is designed 
to be compiled. I know it is for most, if not all, Unix-like systems. I 
use it in Fedora.

Anyway, thank you your work on pigeonhole.

Trever