Hi everyone!
I'm not sure this is a design decision, a bug, something not implemented or malformed input on my side.
My use case is the following: users that forward their mails outside the domain I manage have the option to report spam as attachment so the spam classifier can learn anyway. I prefer attachment over forward because MUAs have their own way to do it and it's difficult to reliably reconstruct some headers/email structure that can be important for the spam classifier.
Each input message is processed through a Sieve extensions of RFC 5703 to extract the attachment. I ask user to produce a file with .eml extension through any export-like feature of their MUA.
Below is an example EML file produced by Tuta (Thunderbird does a similar thing with a different encoding for the body). I'll use a bunch of = characters as a separator in the rest of this.
=============================================== [...bunch of header...] X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 00000000-0000-0000-0000-0000000000 00 X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYSPR04MB7035 Content-Type: multipart/related; boundary="------------79Bu5A16qPEYcVIZL@tutanota"
--------------79Bu5A16qPEYcVIZL@tutanota Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: base64
DQo8ZGl2IHN0eWxlPSJmb250LWZhbWlseTogQXB0b3MsIEFwdG9zX0VtYmVkZGVkRm9udCwgQXB0b3 NfTVNGb250U2VydmljZSwgQ2FsaWJyaSwgSGVsdmV0aWNhLCBzYW5zLXNlcmlmOyBmb250LXNpemU6 [...base64 encoding of the body...]
While attached and sent, the received messaged looks like this:
[...headers of the actual email...] X-Infomaniak-Routing: alpha
This is a multi-part message in MIME format. --------------5jyKrhQ08xXUif7LHhgg648N Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit
--------------5jyKrhQ08xXUif7LHhgg648N Content-Type: message/rfc822; name="a.eml" Content-Disposition: attachment; filename="a.eml" Content-Transfer-Encoding: 7bit
[...headers of the spam....] Subject: Re: //Re: AI + Ranking /// Thread-Topic: //Re: AI + Ranking /// [....] Content-Type: multipart/related; boundary="------------79Bu5A16qPEYcVIZL@tutanota"
--------------79Bu5A16qPEYcVIZL@tutanota Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: base64
DQo8ZGl2IHN0eWxlPSJmb250LWZhbWlseTogQXB0b3MsIEFwdG9zX0VtYmVkZGVkRm9udCwgQXB0b3 NfTVNGb250U2VydmljZSwgQ2FsaWJyaSwgSGVsdmV0aWNhLCBzYW5zLXNlcmlmOyBmb250LXNpemU6 [...base64 encoding of the spam....] --------------79Bu5A16qPEYcVIZL@tutanota--
--------------5jyKrhQ08xXUif7LHhgg648N--
I face two issues:
- Only the base64 decoded string can be retrived through foreverypart/extracttext;
- It has a lot of newlines.
- is not a big deal. The original spam has a lot of HTML junk, and each line with HTML without an inner text node is stripped, but the newline is kept. I don't know if this is intended.
- is more of a big deal. To try to confirm that, I made an over-simplified version of the Sieve script (I skip configuration but plugins are loaded etc, please ask if useful).
=============================================== [...require...] foreverypart { extracttext "eml"; debug_log "PART ============================== ${eml}"; }
mail_debug is activated. In the logs, I can read:
Info: sieve: DEBUG: PART ============================== Info: sieve: DEBUG: PART ============================== Info: sieve: DEBUG: PART ============================== Info: sieve: DEBUG: PART ============================== Info: sieve: DEBUG: PART ============================== Info: sieve: Info: sieve: Info: sieve: Info: sieve: Info: sieve: Info: sieve: Info: sieve: Hello, Info: sieve: Info: sieve: Hope you're doing well. Info: sieve: Info: sieve: Info: sieve: Info: sieve: Info: sieve: Would you like attract more traffic with our AEO + GEO + SEO services. AI-driven search is here—don’t miss out.
So multiple parts are analyzed but only the last gives something. The RFC says that "If the transfer encoding or character set is unrecognized by the implementation or recognized but invalid, an empty string will result.". But it's over my knowledge.
I activated trace for Sieve with "matching" level, and the file looks like (this is another test but was the same result):
=============================================== Sieve trace log for message delivery:
Username: REDACTED Session ID: ozNMOEPNl2nlAAAAO14Lzw Sender: REDACTED Final recipient: REDACTED Default mailbox: INBOX
## Started executing script 'move-spam'
33: foreverypart loop begin 33: loop ends at line 36 35: extracttext command 36: assign 'eml' [0] = "" 36: debug_log "PART ============================== " 36: foreverypart loop end 36: switched to next message part 36: looping back to line 35 35: extracttext command 36: assign 'eml' [0] = "" 36: debug_log "PART ============================== " 36: foreverypart loop end 36: switched to next message part 36: looping back to line 35 35: extracttext command 36: assign 'eml' [0] = "" 36: debug_log "PART ============================== " 36: foreverypart loop end 36: switched to next message part 36: looping back to line 35 35: extracttext command 36: assign 'eml' [0] = "" 36: debug_log "PART ============================== " 36: foreverypart loop end 36: switched to next message part 36: looping back to line 35 35: extracttext command 36: assign 'eml' [0] = "Hi,
Just checking in to see if you had a chance to review my earlier message. [...body spam....] 36: debug_log "PART ============================== ????????????Hi, ?? ??Just checking in t..." 36: foreverypart loop end 36: no more message parts 36: exiting loops at line 36 ## Finished executing script 'move-spam'
Hypothesis:
A. Could it be because the mail headers are indistinguishable from the part headers?
B. Or is it related to Pigeonhole not implementing yet the enclose extension, which do this the other way?
I hope this message is not too hard to read with all the code/logs. Please tell me if there is a better way of doing it.
And thanks in advance for any help, because I don't really know what to do and trying to take a (superficial) look at the Pigeonhole code didn't helped.