[Dovecot] pigeonhole, regex, UTF-8

older
[Dovecot] Problems with Dovecot +...

Trever L. Adams

13 Jul 2010 13 Jul '10

4:11 p.m.

Hello,

I am just learning about pigeonhole and thinking of using it. I see that regex doesn't supportf UTF-8. Any particular reason for this?

If it is a library problem is the library, have you looked at tre? I am using it in a project (I am using it in wchar_t mode because elsewhere all date is converted to wchar_t). It does work with UTF-8.

Thanks, Trever

Show replies by date

Stephan Bosch

13 Jul 13 Jul

7:16 p.m.

Trever L. Adams wrote:

...

Hello,

I am just learning about pigeonhole and thinking of using it. I see that regex doesn't supportf UTF-8. Any particular reason for this? The standard regexp library does not support unicode and I was not planning to write my own regexp compiler any time soon.

If it is a library problem is the library, have you looked at tre? I am using it in a project (I am using it in wchar_t mode because elsewhere all date is converted to wchar_t). It does work with UTF-8. As a matter of fact, I haven't looked at TRE before. I'm quite interested though, since it is backwards compatible with POSIX and seems to be available in most systems. I'll give it a closer look, also in terms of compatibility with the latest draft of the Sieve regex extension specification.

Regards,

Stephan.

Trever L. Adams

8:51 p.m.

On 07/13/2010 10:16 AM, Stephan Bosch wrote:

...

The standard regexp library does not support unicode and I was not planning to write my own regexp compiler any time soon. I wouldn't want to write one as well. As a matter of fact, I haven't looked at TRE before. I'm quite interested though, since it is backwards compatible with POSIX and seems to be available in most systems. I'll give it a closer look, also in terms of compatibility with the latest draft of the Sieve regex extension specification.

Regards,

Stephan.

There are a few odd things about the wide character support in TRE. Either you need to convert each message to wchar_t and make sure you set the system encoding to wchar_t, or you need to set the system encoding for each message, which may or may not mess up your UTF-8 regex.

My project is an Internet Classifier (used with things like Squid proxy to make a filter). I convert everything to wchar_t (using iconv with info gathered from headers) and use the wide character versions of the functions. That way I know everything is just fine. I then have the program set the system encoding (at least the environment variable for the given session) to UTF-8 before I do any of the regex compiling. Everything works wonderfully and quite quickly.

I am not sure TRE is available on all systems where dovecot is designed to be compiled. I know it is for most, if not all, Unix-like systems. I use it in Fedora.

Anyway, thank you your work on pigeonhole.

Trever

Perry E. Metzger

14 Jul 14 Jul

8:32 p.m.

On Tue, 13 Jul 2010 18:16:58 +0200 Stephan Bosch <stephan@rename-it.nl> wrote:

...

As a matter of fact, I haven't looked at TRE before. I'm quite interested though, since it is backwards compatible with POSIX and seems to be available in most systems. I'll give it a closer look, also in terms of compatibility with the latest draft of the Sieve regex extension specification.

TRE has another significant advantage -- the algorithms it uses scale (for most regexes) linearly, instead of the exponential algorithms that Spencer-descended regex libraries often use. The difference in performance can be quite remarkable.

-- Perry E. Metzger perry@piermont.com

5512

Age (days ago)

5513

Last active (days ago)

List overview

3 comments

3 participants

participants (3)

Perry E. Metzger
Stephan Bosch
Trever L. Adams