Sieve regexp matches wrong
Hello!
Recently I've faced with strange issue. So, I want to filter some mails which do not contain cyrillic symbols. I would not like receive email on foreign language except russian and I'm using rule below, but it does not work when text of mail has unicode symbol u2019 or ’ - right single quotation mark.
|require ["body","regex"];|
|# rule:[Regexp test]|
|if not body :text :regex ".*[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ].*"|
|{|
| discard;|
| stop;|
|}|
I checked this behavior on different versions of dovecot and pigeonhole, and it was the same in all cases. If I change u2019 to, for instance to apostrophe, it begins work as I expect - the mail discarded.
Below some information which was got from sieve-test utility. The text consist of only one phrase *Test’test*.
Check string *Test’test *- without u2019, using apostrophe
root@a4e4b17d33a1:/srv/mail# tail -2 1586937347.M574837P24389.vps.kveri.ru\,S\=1904\,W\=1944\:2\,S
Test'test
sieve-test output
- Script metadata (block: 0):
class = file class.version = 0 location = /srv/mail/roundcube.sieve
- Required extensions (block: 1):
0: body (id: 18) 1: regex (id: 13)
- Main program (block: 2):
Address Line Code 00000000: DEBUG BLOCK: 3 00000001: EXTENSIONS [2]: 00000002: body 00000004: regex 00000006: 3: BODY 00000007: BODY-TRANSFORM: TEXT 0000000b: match type: regex 0000000d: key list: STR[138] ".*[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРс... 0000009b: 3: JMPTRUE 6 [000000a2] 000000a0: 5: DISCARD 000000a1: 6: STOP 000000a2: 6: [End of code]
Performed actions:
* discard
Implicit keep:
(none)
In this case the rule works as I expect
The second test *Test’test* with ’ instead of apostrophe
root@a4e4b17d33a1:/srv/mail# tail -2 1586937347.M574837P24389.vps.kveri.ru\,S\=1904\,W\=1944\:2\,S
Test’test
sieve-test output
- Script metadata (block: 0):
class = file class.version = 0 location = /srv/mail/roundcube.sieve
- Required extensions (block: 1):
0: body (id: 18) 1: regex (id: 13)
- Main program (block: 2):
Address Line Code 00000000: DEBUG BLOCK: 3 00000001: EXTENSIONS [2]: 00000002: body 00000004: regex 00000006: 3: BODY 00000007: BODY-TRANSFORM: TEXT 0000000b: match type: regex 0000000d: key list: STR[138] ".*[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРс... 0000009b: 3: JMPTRUE 6 [000000a2] 000000a0: 5: DISCARD 000000a1: 6: STOP 000000a2: 6: [End of code]
Performed actions:
(none)
Implicit keep:
* store message in folder: INBOX
In this case email "was located" into INBOX, but I expected it should be discarded. As I said this behavior does not depend on dovecot and pigeonhole version - I've tried dovecot 2.3.7, 2.2.30.x, 2.3.9.3 and 2.3.10, pigeonhole 0.5.7.2, 0.5.9 and 0.5.10, fresh install and working in docker container. The dovecot-sysreport was taken from the last one. What am I doing wrong? Is it pigeonhole bug or smth like that?
kverizhnikov writes:
Hello!
Recently I've faced with strange issue. So, I want to filter some mails which do not contain cyrillic symbols. I would not like receive email on foreign language except russian and I'm using rule below, but it does not work when text of mail has unicode symbol u2019 or ’ - right single quotation mark.
|require ["body","regex"];|
|# rule:[Regexp test]|
|if not body :text :regex ".*[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ].*"|
|{|
| discard;|
| stop;|
|}|
This must be because the regexp extension does not support Unicode.
Non ASCII characters encoded in UTF-8 will be considered as multiple octets. For example А (U+410) is D0 90… And Й (U+419) is D0 99. And what your bracket expression really says is: match any octet that is between 90 and AF or D0. It does not match on Unicode characters as you expected.
So your range includes the byte 99 and will match any body that contains a 99. As for the ’ it encodes to E2 80 99. So it will match your regular expression because it does contain the octet 99.
As a workaround you can use the alternation syntax:
.*(а|А|б|Б|в|В|г|Г|д|Д|е|Е|ё|Ё|ж|Ж|з|З|и|И|й|Й|к|К|л|Л|м|М|н|Н|о|О|п|П|р|Р|с|С|т|Т|у|У|ф|Ф|х|Х|ц|Ц|ч|Ч|ш|Ш|щ|Щ|ъ|Ъ|ы|Ы|ь|Ь|э|Э|ю|Ю|я|Я).*
But this is still fragile.
Kim Minh.
participants (2)
-
Kim Minh Kaplan
-
kverizhnikov