Sieve regexp matches wrong

Sat May 23 10:28:43 EEST 2020

kverizhnikov writes:

> Hello!
>
> Recently I've faced with strange issue. So, I want to filter some mails 
> which do not contain cyrillic symbols. I would not like receive email on 
> foreign language except russian and I'm using rule below, but it does 
> not work when text of mail has unicode symbol u2019 or ’ - right single 
> quotation mark.
>
> |require ["body","regex"];|
>
> |# rule:[Regexp test]|
>
> |if not body :text :regex 
> ".*[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ].*"|
>
> |{|
>
> |    discard;|
>
> |    stop;|
>
> |}|

This must be because the regexp extension does not support Unicode.

Non ASCII characters encoded in UTF-8 will be considered as multiple
octets. For example А (U+410) is D0 90… And Й (U+419) is D0 99. And
what your bracket expression really says is: match any octet that is
between 90 and AF or D0. It does not match on Unicode characters as you
expected.

So your range includes the byte 99 and will match any body that contains
a 99. As for the ’ it encodes to E2 80 99. So it will match your regular
expression because it does contain the octet 99.

As a workaround you can use the alternation syntax:

    .*(а|А|б|Б|в|В|г|Г|д|Д|е|Е|ё|Ё|ж|Ж|з|З|и|И|й|Й|к|К|л|Л|м|М|н|Н|о|О|п|П|р|Р|с|С|т|Т|у|У|ф|Ф|х|Х|ц|Ц|ч|Ч|ш|Ш|щ|Щ|ъ|Ъ|ы|Ы|ь|Ь|э|Э|ю|Ю|я|Я).*

But this is still fragile.

Kim Minh.