Processing incoming mail efficiently
I’ve asked a related question on this list before but I now have a much better handle on what I’m doing and I realize that I still don’t know the answer, so I’m going to ask this again in a slightly different form.
I’m writing a spam filter, so obviously I need to feed incoming mail to it somehow. The “obvious” way to do this is with a sieve script using the pipe extension. There are two problems with this:
This will always pipe the entire file no matter how big it is. The filter will often not need to process the body of the message, only the headers, or only the first part of a multipart MIME message. Is there any way to allow my filter to open the file in which the message is stored rather than piping it a copy of the message?
Once the filter has processed the message and decided if it’s spam it still needs to move the message to the appropriate folder (INBOX or Junk). To do this it needs to somehow correlate the *content* of the message that was piped to it with the UID of the message that needs to be moved. One way to do this is to pull out the message-id header and then use doveadm to find the file containing the message with that message-id, but there are two problems with this. First, not all messages have message-ids. I can work around this by adding my own message-id to messages that don’t already have them, but this just feel wrong. And second, unless dovecot keeps an index of message-ids (does it?) then this will be horribly inefficient because it will have to essentially grep for the message id every time I want to move a message. So it seems like there has to be a better way, but I can’t think of what that would be.
I figure this has to be a solved problem because I am obviously not the first person to write a spam filter for dovecot. What is the Right Way to do this?
Thanks, rg
On 30-01-2021 17:49, Ron Garret wrote:
I’ve asked a related question on this list before but I now have a much better handle on what I’m doing and I realize that I still don’t know the answer, so I’m going to ask this again in a slightly different form.
I’m writing a spam filter, so obviously I need to feed incoming mail to it somehow. The “obvious” way to do this is with a sieve script using the pipe extension. There are two problems with this:
This will always pipe the entire file no matter how big it is. The filter will often not need to process the body of the message, only the headers, or only the first part of a multipart MIME message. Is there any way to allow my filter to open the file in which the message is stored rather than piping it a copy of the message?
Once the filter has processed the message and decided if it’s spam it still needs to move the message to the appropriate folder (INBOX or Junk). To do this it needs to somehow correlate the *content* of the message that was piped to it with the UID of the message that needs to be moved. One way to do this is to pull out the message-id header and then use doveadm to find the file containing the message with that message-id, but there are two problems with this. First, not all messages have message-ids. I can work around this by adding my own message-id to messages that don’t already have them, but this just feel wrong. And second, unless dovecot keeps an index of message-ids (does it?) then this will be horribly inefficient because it will have to essentially grep for the message id every time I want to move a message. So it seems like there has to be a better way, but I can’t think of what that would be.
Normally the flow is a bit different:
You configure the spam/content filter in your MTA (for instance SMTP-proxy, pre-queue, milter or post-queue content filter). The main benefit of doing this type of work in the MTA is that you have the ability to reject blatant spam messages during the SMTP stage. This means that you don't have to store the spam at all, you simply tell the sending server that you don't want to accept the message, and the sending server will have to deal with that decision (f.i. by sending a non-delivery notice to the sender).
The spam filter will add headers to the incoming message. If you decide to accept it, you can configure Sieve to deliver the message to the Inbox or the Junk folder. A nice implementation is https://doc.dovecot.org/configuration_manual/sieve/extensions/spamtest_virus...), but can of course wrangle your own sieve recipes.
Spam scanning during the delivery phase (f.i. with a sieve filter) is less common because it has a few downsides.
So to answer your questions:
Your content filter can be a spam filter, but it might also be an antivirus scanner. The latter is of course very interested in the complete e-mail including all attachments. So most setups try so send the complete message. There are also implementations that ignore messages with a size above a certain threshold, or others which just ignore the data after a certain threshold. What filter are you trying to implement? Something off the shelf, or a homebrewn one? Why is it so hard to consume the whole message? Please explain :)
The normal flow is a bit different (as described above), but in general: the spam filter decides. Some (existing) filters take the whole message from the MTA, add headers and re-inject the message again. Other filters use a mechanism (f.i. milter protocol) which allows them to consume only a part of the message, and in response they instruct the MTA to add the result headers. This means that the filters must support the protocol to the MTA, but it doesn't have to take care of re-delivering the message.
We need to know about the actual problem you're trying to solve. It sounds a lot like your trying to reinvent things that have been solved many times before. Please give a broader explanation of your specific problem and we can give you a better advice :)
Kind regards,
Tom
Sorry, I left out a few details.
The filter actually has two parts, one of which is on the MTA side (a milter). That part does things like keep track of outgoing mail from authorized users so that it knows when an incoming message has a subject line that a user has sent out or is from a sender that a user has previously sent a message to. Those are two very reliable ham signals.
The reason there is also a filter on the LDA side is that one of the filtering strategies I’m using is looking for two messages from two different previously unknown senders with the same subject received within a few minutes of each other. This turns out to be a very reliable spam signal. But it requires that messages with unknown provenance to be held in temporary storage for a while to see if another matching message comes in. That message then needs to be processed as spam after the fact.
rg
On Jan 30, 2021, at 9:56 AM, Tom Hendrikx <tom@whyscream.net> wrote:
On 30-01-2021 17:49, Ron Garret wrote:
I’ve asked a related question on this list before but I now have a much better handle on what I’m doing and I realize that I still don’t know the answer, so I’m going to ask this again in a slightly different form. I’m writing a spam filter, so obviously I need to feed incoming mail to it somehow. The “obvious” way to do this is with a sieve script using the pipe extension. There are two problems with this:
- This will always pipe the entire file no matter how big it is. The filter will often not need to process the body of the message, only the headers, or only the first part of a multipart MIME message. Is there any way to allow my filter to open the file in which the message is stored rather than piping it a copy of the message?
- Once the filter has processed the message and decided if it’s spam it still needs to move the message to the appropriate folder (INBOX or Junk). To do this it needs to somehow correlate the *content* of the message that was piped to it with the UID of the message that needs to be moved. One way to do this is to pull out the message-id header and then use doveadm to find the file containing the message with that message-id, but there are two problems with this. First, not all messages have message-ids. I can work around this by adding my own message-id to messages that don’t already have them, but this just feel wrong. And second, unless dovecot keeps an index of message-ids (does it?) then this will be horribly inefficient because it will have to essentially grep for the message id every time I want to move a message. So it seems like there has to be a better way, but I can’t think of what that would be.
Normally the flow is a bit different:
You configure the spam/content filter in your MTA (for instance SMTP-proxy, pre-queue, milter or post-queue content filter). The main benefit of doing this type of work in the MTA is that you have the ability to reject blatant spam messages during the SMTP stage. This means that you don't have to store the spam at all, you simply tell the sending server that you don't want to accept the message, and the sending server will have to deal with that decision (f.i. by sending a non-delivery notice to the sender).
The spam filter will add headers to the incoming message. If you decide to accept it, you can configure Sieve to deliver the message to the Inbox or the Junk folder. A nice implementation is https://doc.dovecot.org/configuration_manual/sieve/extensions/spamtest_virus...), but can of course wrangle your own sieve recipes.
Spam scanning during the delivery phase (f.i. with a sieve filter) is less common because it has a few downsides.
So to answer your questions:
Your content filter can be a spam filter, but it might also be an antivirus scanner. The latter is of course very interested in the complete e-mail including all attachments. So most setups try so send the complete message. There are also implementations that ignore messages with a size above a certain threshold, or others which just ignore the data after a certain threshold. What filter are you trying to implement? Something off the shelf, or a homebrewn one? Why is it so hard to consume the whole message? Please explain :)
The normal flow is a bit different (as described above), but in general: the spam filter decides. Some (existing) filters take the whole message from the MTA, add headers and re-inject the message again. Other filters use a mechanism (f.i. milter protocol) which allows them to consume only a part of the message, and in response they instruct the MTA to add the result headers. This means that the filters must support the protocol to the MTA, but it doesn't have to take care of re-delivering the message.
We need to know about the actual problem you're trying to solve. It sounds a lot like your trying to reinvent things that have been solved many times before. Please give a broader explanation of your specific problem and we can give you a better advice :)
Kind regards,
Tom
On 30-01-2021 19:11, Ron Garret wrote:
Sorry, I left out a few details.
The filter actually has two parts, one of which is on the MTA side (a milter). That part does things like keep track of outgoing mail from authorized users so that it knows when an incoming message has a subject line that a user has sent out or is from a sender that a user has previously sent a message to. Those are two very reliable ham signals.
The reason there is also a filter on the LDA side is that one of the filtering strategies I’m using is looking for two messages from two different previously unknown senders with the same subject received within a few minutes of each other. This turns out to be a very reliable spam signal. But it requires that messages with unknown provenance to be held in temporary storage for a while to see if another matching message comes in. That message then needs to be processed as spam after the fact.
If you don't want to deliver the message to the inbox of the sender, you should just do that: don;t deliver it. Put it in some quarantine, and when you're sure you want it to end up in the mailbox of the user, pick up the message from quarantine and put it back in the mail queue, and have it delivered using the normal delivery route.
How you set up the quarantine is up to you. This could be a simple mailbox, which is reprocessed using a sieve filter (as you suggested). The most logical routine would then be to consume the message by the sieve filter, and then re-inject it in the mail delivery queue. But there are probably better solutions.
I suggest that you look into existing OSS quarantine solutions and learn from them, amavis or rspamd come to mind. IMHO you're still trying to re-invent the wheel :)
Kind regards, Tom
rg
On Jan 30, 2021, at 9:56 AM, Tom Hendrikx <tom@whyscream.net> wrote:
On 30-01-2021 17:49, Ron Garret wrote:
I’ve asked a related question on this list before but I now have a much better handle on what I’m doing and I realize that I still don’t know the answer, so I’m going to ask this again in a slightly different form. I’m writing a spam filter, so obviously I need to feed incoming mail to it somehow. The “obvious” way to do this is with a sieve script using the pipe extension. There are two problems with this:
- This will always pipe the entire file no matter how big it is. The filter will often not need to process the body of the message, only the headers, or only the first part of a multipart MIME message. Is there any way to allow my filter to open the file in which the message is stored rather than piping it a copy of the message?
- Once the filter has processed the message and decided if it’s spam it still needs to move the message to the appropriate folder (INBOX or Junk). To do this it needs to somehow correlate the *content* of the message that was piped to it with the UID of the message that needs to be moved. One way to do this is to pull out the message-id header and then use doveadm to find the file containing the message with that message-id, but there are two problems with this. First, not all messages have message-ids. I can work around this by adding my own message-id to messages that don’t already have them, but this just feel wrong. And second, unless dovecot keeps an index of message-ids (does it?) then this will be horribly inefficient because it will have to essentially grep for the message id every time I want to move a message. So it seems like there has to be a better way, but I can’t think of what that would be.
Normally the flow is a bit different:
You configure the spam/content filter in your MTA (for instance SMTP-proxy, pre-queue, milter or post-queue content filter). The main benefit of doing this type of work in the MTA is that you have the ability to reject blatant spam messages during the SMTP stage. This means that you don't have to store the spam at all, you simply tell the sending server that you don't want to accept the message, and the sending server will have to deal with that decision (f.i. by sending a non-delivery notice to the sender).
The spam filter will add headers to the incoming message. If you decide to accept it, you can configure Sieve to deliver the message to the Inbox or the Junk folder. A nice implementation is https://doc.dovecot.org/configuration_manual/sieve/extensions/spamtest_virus...), but can of course wrangle your own sieve recipes.
Spam scanning during the delivery phase (f.i. with a sieve filter) is less common because it has a few downsides.
So to answer your questions:
Your content filter can be a spam filter, but it might also be an antivirus scanner. The latter is of course very interested in the complete e-mail including all attachments. So most setups try so send the complete message. There are also implementations that ignore messages with a size above a certain threshold, or others which just ignore the data after a certain threshold. What filter are you trying to implement? Something off the shelf, or a homebrewn one? Why is it so hard to consume the whole message? Please explain :)
The normal flow is a bit different (as described above), but in general: the spam filter decides. Some (existing) filters take the whole message from the MTA, add headers and re-inject the message again. Other filters use a mechanism (f.i. milter protocol) which allows them to consume only a part of the message, and in response they instruct the MTA to add the result headers. This means that the filters must support the protocol to the MTA, but it doesn't have to take care of re-delivering the message.
We need to know about the actual problem you're trying to solve. It sounds a lot like your trying to reinvent things that have been solved many times before. Please give a broader explanation of your specific problem and we can give you a better advice :)
Kind regards,
Tom
On Jan 30, 2021, at 11:54 AM, Tom Hendrikx <tom@whyscream.net> wrote:
IMHO you're still trying to re-invent the wheel :)
I don’t deny that. The goal of this project is as much (maybe more) to be a learning experience as it is to produce something useful.
FWIW, there are two reasons I don’t want to use a non-user-visible quarantine. First, there is always the possibility of a false positive, so all email must be made accessible to the user somehow. And second, there are occasions when you are expecting an email that looks spammy and you need to be able to get to it in a timely manner. The most common use case here is password reset links or 2FA authorization codes. It is not possible for a spam filter to distinguish a legitimate email of this type from a phishing attack. Only the user know if they recently requested a password reset. But *most* password reset emails are phishing attacks (at least most of the ones I get are) so I don’t want to see them by default.
rg
Just for the record, here is my approach to my spam filter design just in case anyone is interested.
Design goals:
It should Just Work with an absolute minimum amount of user intervention and training required. That said, there are cases where user intervention will be necessary. In particular, emails that are sent in order to verify that the user controls their email address (password resets, 2FA auth codes, subscription confirmations) tend to be indistinguishable form phishing attacks and so some amount of user intervention will be needed.
It should be entirely server-side. This project was motivated in large part because my current spam filter is SpamSieve, which works great, but it runs on my laptop, so whenever my laptop is off-line so is my spam filter. So if I check my mail from my phone and my laptop is off-line I get a ton of spam.
I want to use as much off-the-shelf software as possible, but I also want to be able to have full control over the system. That means that whatever off-the-shelf software I use has to either Just Work, or be written in a language that I am proficient in. That rules out a lot of stuff because one of the languages I am *not* proficient in is Perl, and a lot of off-the-shelf spam filter stuff is written in Perl, and doesn’t do what I want out of the box.
Approach:
Seed the process with a source of reliable ham and reliable spam that does not require user labeling. The reliable ham is provided by keeping track of outgoing messages, which are presumed to be ham, and messages inbound to a honeypot address, which are presumed to be spam. The honeypot spam training corpus is shared among all users. The outgoing ham corpus is user-specific because who knows, someone may actually want information about Viagra.
The filter consists of two parts, a milter on the MTA side and a set of minimal sieve scripts on the LDA side that connect to a more or less traditional Bayesian filter. The milter tracks outgoing mail, and tags “easy” spam and ham on the incoming side. “Easy” ham consists of messages from senders to whom the user has previously sent messages, or with subjects that are “Re:” a subject about which the user has previously sent a message. Easy spam is things like Chinese text (at least that’s easy spam for me — that would obviously not work for a user in China, but this is mainly for my personal use) and certain super-spammy TLDs like .ru, .biz, etc. The milter also does greylisting. It’s written in Python using the pymilter library. The LDA-side Bayesian filter is written in Common Lisp. State is stored in a shared DB (currently SQLite3, but I’m probably going to switch to MariaDB because SQLite3 doesn’t play well with threads.)
What is left over after the milter is a set of messages from addresses with which the user has never corresponded. Those get put into an INCOMING folder and where they are processed by the Bayesian filter. The reason for putting the Bayesian filter on the LDA side is that this filter also applies the heuristic that if a message is received from two different unknown senders with the same subject within a relative short period of time (like 10-15 minutes) then both of those messages are almost certainly spam. Empirically, an MTA-side Bayesian filter misses a lot of easy spam because it cannot apply this heuristic. I mean yes, it’s possible, but it causes a lot of problems, not least of which is that timely things like password resets and 2FA auths get held up along with potential spam. (Greylisting has this problem too. I’m actually still trying to decide what to do about that.)
User training input is provided in the usual way, by having dovecot-sieve scripts that intercept messages being moved from INBOX to Junk and vice versa. (I have not yet decided what to do about messages that the user moves out of INCOMING.)
That’s where I’m at. Currently stuck on trying to figure out how to get the LDA-side filter to move messages. My baseline plan was to use external calls to doveadm, but it seems like there has to be a better way. Any and all advice and commentary much appreciated.
rg
On Jan 30, 2021, at 12:07 PM, Ron Garret <ron@flownet.com> wrote:
On Jan 30, 2021, at 11:54 AM, Tom Hendrikx <tom@whyscream.net> wrote:
IMHO you're still trying to re-invent the wheel :)
I don’t deny that. The goal of this project is as much (maybe more) to be a learning experience as it is to produce something useful.
FWIW, there are two reasons I don’t want to use a non-user-visible quarantine. First, there is always the possibility of a false positive, so all email must be made accessible to the user somehow. And second, there are occasions when you are expecting an email that looks spammy and you need to be able to get to it in a timely manner. The most common use case here is password reset links or 2FA authorization codes. It is not possible for a spam filter to distinguish a legitimate email of this type from a phishing attack. Only the user know if they recently requested a password reset. But *most* password reset emails are phishing attacks (at least most of the ones I get are) so I don’t want to see them by default.
rg
-----Original Message----- From: dovecot <dovecot-bounces@dovecot.org> On Behalf Of Ron Garret Sent: 30 January 2021 17:49 To: Dovecot <dovecot@dovecot.org> Subject: Processing incoming mail efficiently
I’ve asked a related question on this list before but I now have a much better handle on what I’m doing and I realize that I still don’t know the answer, so I’m going to ask this again in a slightly different form.
I’m writing a spam filter, so obviously I need to feed incoming mail to it somehow. The “obvious” way to do this is with a sieve script using the pipe extension. There are two problems with this:
No, that is not obvious, this would imply a dependency on sieve.
- This will always pipe the entire file no matter how big it is. The filter will often not need to process the body of the message,
Yes because your starting point is wrong. Using mailfromd you can process a specific milter state, see envfrom envrcpt etc.
https://puszcza.gnu.org.ua/software/mailfromd/manual/mailfromd.html#handler-...
only the
headers, or only the first part of a multipart MIME message. Is there any way to allow my filter to open the file in which the message is stored rather than piping it a copy of the message?
- Once the filter has processed the message and decided if it’s spam it still needs to move the message to the appropriate folder (INBOX or Junk). To do this it needs to somehow correlate the *content* of the message that was piped to it with the UID of the message that needs to be moved. One way to do this is to pull out the message-id header and then use doveadm
No, in what ever milter state you are processing. You can add a message header 'This is spam'. And you make just one sieve rule that moves messages on the existance of that specific header.
to find the file containing the message with that message-id, but there are two problems with this. First, not all messages have message-ids. I can work around this by adding my own
First you have crawl, before walking. So learn how to crawl. It does not make sense trying to make something, if you do not know specifics.
message-id to messages that don’t already have them, but this just feel wrong. And second, unless dovecot keeps an index of message-ids (does it?) then this will be horribly inefficient because it will have to essentially grep for the message id every time I want to move a message. So it seems like there has to be a better way, but I can’t think of what that would be.
Start playing with mailfromd. It has scripting language to configure it and all tools(funtions) are available to do whatever you can think of.
https://puszcza.gnu.org.ua/software/mailfromd/manual/mailfromd.html#Filter-S...
I figure this has to be a solved problem because I am obviously not the first person to write a spam filter for dovecot. What is the Right Way to do this?
As written above
participants (3)
-
Marc Roos
-
Ron Garret
-
Tom Hendrikx