email-address Tokenizer Splits Hyphenated Local-Parts, Causing Inaccurate FTS Search Results
I’m using Dovecot FTS with the flatcurve backend in a mailcow: dockerized setup. When searching for an email address with a hyphenated local-part (e.g., ma-g@example.com), the email-address tokenizer splits the local-part on hyphens, producing tokens like ma, g, and ma-g@example.com. This prevents searching for ma-g as a single term. With fts_flatcurve_substring_search = yes, searching for ma-g matches unrelated addresses containing ma (e.g., manager@example.com), leading to irrelevant results. Dovecot version: 2.3.21.1 (d492236fa0) Including only relevant part of dovecot config: plugin { fts = flatcurve fts_autoindex = yes fts_autoindex_exclude = \Junk fts_autoindex_exclude2 = \Trash fts_autoindex_max_recent_msgs = 999999 fts_tokenizers = generic email-address fts_tokenizer_email_address = maxlen=100 fts_tokenizer_generic = algorithm=simple maxlen=100 fts_flatcurve_substring_search = yes fts_languages = en es de ru fts_filters = normalizer-icu snowball stopwords fts_filters_en = lowercase snowball english-possessive stopwords fts_filters_ru = lowercase snowball stopwords fts_index_timeout = 300s } service indexer-worker { process_limit = 12 vsz_limit = 512 MB }
Steps to Reproduce: Index an email with ma-g@example.com in the From field also index email contains "ma" and "g" in the From field.
Check tokenization:
doveadm fts tokenize -u user@example.com "ma-g@example.com"
Output:
ma g example com ma-g@example.com
Search:
doveadm search -u user@example.com FROM ma-g
Results include manager@example.com due to ma matching.
Expected Behavior: FROM ma-g should match only emails with ma-g@example.com, treating ma-g as a single term or exact local-part. Expected tokens: doveadm fts tokenize -u user@example.com "ma-g@example.com"
Output: ma-g ma g example com ma-g@example.com Actual Behavior: The tokenizer splits ma-g into ma and g. Substring search matches "ma" or "g" in unrelated addresses (e.g., manager@example.com, greg@example.com). Without substring search, ma-g matches nothing unless searching the full address. Impact: Searching hyphenated local-parts for short email address local-parts is unreliable, especially for common fragments like ma, flooding results with irrelevant matches. Request: Add a configuration option, such as "fts_tokenizer_email_address_keep_hyphenated = yes|no" (default: no, for compatibility), to include the hyphenated local-part of an email address as an additional token. For example, with "yes", tokenizing "ma-g@example.com" would produce "ma-g", "ma", "g", "example", "com", and "ma-g@example.com". This allows searches for "FROM ma-g" to match emails with "ma-g@example.com" exactly, while preserving "ma" and "g" for substring searches. Consider "yes" as a future default, as including hyphenated local-parts aligns with RFC 5322 and user expectations for precise email searches, especially for common hyphenated addresses like "first-last@domain.com". If changing defaults, provide upgrade notes for users relying on the current token set.
Is there any workaround to search hyphenated local-parts accurately?
Best regards, Daniel Levin
participants (1)
-
Daniel