kverizhnikov writes:
Hello!
Recently I've faced with strange issue. So, I want to filter some mails which do not contain cyrillic symbols. I would not like receive email on foreign language except russian and I'm using rule below, but it does not work when text of mail has unicode symbol u2019 or ’ - right single quotation mark.
|require ["body","regex"];|
|# rule:[Regexp test]|
|if not body :text :regex ".*[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ].*"|
|{|
| discard;|
| stop;|
|}|
This must be because the regexp extension does not support Unicode.
Non ASCII characters encoded in UTF-8 will be considered as multiple octets. For example А (U+410) is D0 90… And Й (U+419) is D0 99. And what your bracket expression really says is: match any octet that is between 90 and AF or D0. It does not match on Unicode characters as you expected.
So your range includes the byte 99 and will match any body that contains a 99. As for the ’ it encodes to E2 80 99. So it will match your regular expression because it does contain the octet 99.
As a workaround you can use the alternation syntax:
.*(а|А|б|Б|в|В|г|Г|д|Д|е|Е|ё|Ё|ж|Ж|з|З|и|И|й|Й|к|К|л|Л|м|М|н|Н|о|О|п|П|р|Р|с|С|т|Т|у|У|ф|Ф|х|Х|ц|Ц|ч|Ч|ш|Ш|щ|Щ|ъ|Ъ|ы|Ы|ь|Ь|э|Э|ю|Ю|я|Я).*
But this is still fragile.
Kim Minh.