Re: Sieve regexp matches wrong

23 May 2020 · *[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ].*


      kverizhnikov writes:
...
Hello!
Recently I've faced with strange issue. So, I want to filter some mails
which do not contain cyrillic symbols. I would not like receive email on
foreign language except russian and I'm using rule below, but it does
not work when text of mail has unicode symbol u2019 or ’ - right single
quotation mark.
|require ["body","regex"];|
|# rule:[Regexp test]|
|if not body :text :regex
".*[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ].*"|
|{|
|    discard;|
|    stop;|
|}|
This must be because the regexp extension does not support Unicode.
Non ASCII characters encoded in UTF-8 will be considered as multiple
octets. For example А (U+410) is D0 90… And Й (U+419) is D0 99. And
what your bracket expression really says is: match any octet that is
between 90 and AF or D0. It does not match on Unicode characters as you
expected.
So your range includes the byte 99 and will match any body that contains
a 99. As for the ’ it encodes to E2 80 99. So it will match your regular
expression because it does contain the octet 99.
As a workaround you can use the alternation syntax:
.*(а|А|б|Б|в|В|г|Г|д|Д|е|Е|ё|Ё|ж|Ж|з|З|и|И|й|Й|к|К|л|Л|м|М|н|Н|о|О|п|П|р|Р|с|С|т|Т|у|У|ф|Ф|х|Х|ц|Ц|ч|Ч|ш|Ш|щ|Щ|ъ|Ъ|ы|Ы|ь|Ь|э|Э|ю|Ю|я|Я).*
But this is still fragile.
Kim Minh.

Re: Sieve regexp matches wrong

Kim Minh Kaplan