[Dovecot] Optimisation opportunity for IMAP searches
Hello,
I love Dovecot, but when developing a small IMAP tool, I ran into searching behaviour can easily be optimised. Please forgive a rather detailed suggestion. This was on Dovecot 1.2.15 on Debian Squeeze.
My tool? It's called "midget" and retrieves documents from an IMAP box based on their mid: or cid: identifier, as per RFC 2392. I thought this would be useful to retrieve email attachments into a remote shell environment. When using Kerberos, the credentials and a strong hint of one's mail address are already present anyway, DNS SRV does the rest. If you want to see the early code, I'll be happy to share it here.
The three formats of the URIs in RFC 2392 are:
- mid:messageid
- cid:contentid
- mid:messageid/contentid
Searching for a mid: with only a Message-ID part is quick and probably indexed,
(HEADER Message-ID <messageid>)
Searching for a cid: meant ploughing through body text to get to the attachments, so it was a slow and costly and needed checking of the outcome:
(TEXT <contentid>)
The quicker URI form to get to a Content-Id should be to mention it at the end of a mid: URI, and search for
(HEADER Message-ID <messageid>)(TEXT <conentid>)
Much to my surprise, this took about as long as the cid: query -- ploughing through lots of body text while it only needed to look through a one message bodies! As a human, I can see a straightforward improvement. However, it is more difficult to find a general solution -- but that is what I'm describing below, in the hope that it will help to improve Dovecot's search performance.
In general, searches are Boolean expressions composed with implied AND and explicit (NOT …) and (OR ….) constructs.
Using the theorems of The Morgan, push all (NOT …) constructs inside as far as possible, until they surround elementary statements. Some may have a trivial solution, such as (NOT (ALL)). Others will retain the NOT but that might be handled with a different use of the indexes. In general, this phase turns all AND and OR constructs into positive logic, without duplicating search terms.
Estimate the effort involved in every part of the calculation. (HEADER) is lighter than (TEXT) and headers may be further split into with / without index. Larger sets could be more work to search than smaller ones. In general, an estimate of the number of comparisons could be a good metric for work to be done.
For AND compositions, start with the simplest one in the AND compostition, and continue with increasingly heavier ones. Apply later conditions only on the output of the previous conditions, either after collecting all of it or everytime something is found. Whatever is optimal -- bulk operations are more efficient, but may collect large sets to handle.
For OR compositions, either start a thread for each part, or switch focus and collect sorted streams into one collectively sorted stream.
I've been doing this stuff all through my PhD thesis work, so it come easy to me :-) But this optimisation does not appear to be implemented yet, so I thought I'd suggest it. I hope this is useful to Dovecot!
Best wishes,
Rick van Rein
Hi,
The search code works pretty much like you described. I was surprised to find out that in v2.2 it still didn't work correctly, but it was quite a small fix: http://hg.dovecot.org/dovecot-2.2/rev/4b0a736bc40c
On 11.10.2013, at 9.07, Rick van Rein (OpenFortress) rick@openfortress.nl wrote:
Hello,
I love Dovecot, but when developing a small IMAP tool, I ran into searching behaviour can easily be optimised. Please forgive a rather detailed suggestion. This was on Dovecot 1.2.15 on Debian Squeeze.
My tool? It's called "midget" and retrieves documents from an IMAP box based on their mid: or cid: identifier, as per RFC 2392. I thought this would be useful to retrieve email attachments into a remote shell environment. When using Kerberos, the credentials and a strong hint of one's mail address are already present anyway, DNS SRV does the rest. If you want to see the early code, I'll be happy to share it here.
The three formats of the URIs in RFC 2392 are:
- mid:messageid
- cid:contentid
- mid:messageid/contentid
Searching for a mid: with only a Message-ID part is quick and probably indexed,
(HEADER Message-ID <messageid>)
Searching for a cid: meant ploughing through body text to get to the attachments, so it was a slow and costly and needed checking of the outcome:
(TEXT <contentid>)
The quicker URI form to get to a Content-Id should be to mention it at the end of a mid: URI, and search for
(HEADER Message-ID <messageid>)(TEXT <conentid>)
Much to my surprise, this took about as long as the cid: query -- ploughing through lots of body text while it only needed to look through a one message bodies! As a human, I can see a straightforward improvement. However, it is more difficult to find a general solution -- but that is what I'm describing below, in the hope that it will help to improve Dovecot's search performance.
In general, searches are Boolean expressions composed with implied AND and explicit (NOT …) and (OR ….) constructs.
Using the theorems of The Morgan, push all (NOT …) constructs inside as far as possible, until they surround elementary statements. Some may have a trivial solution, such as (NOT (ALL)). Others will retain the NOT but that might be handled with a different use of the indexes. In general, this phase turns all AND and OR constructs into positive logic, without duplicating search terms.
Estimate the effort involved in every part of the calculation. (HEADER) is lighter than (TEXT) and headers may be further split into with / without index. Larger sets could be more work to search than smaller ones. In general, an estimate of the number of comparisons could be a good metric for work to be done.
For AND compositions, start with the simplest one in the AND compostition, and continue with increasingly heavier ones. Apply later conditions only on the output of the previous conditions, either after collecting all of it or everytime something is found. Whatever is optimal -- bulk operations are more efficient, but may collect large sets to handle.
For OR compositions, either start a thread for each part, or switch focus and collect sorted streams into one collectively sorted stream.
I've been doing this stuff all through my PhD thesis work, so it come easy to me :-) But this optimisation does not appear to be implemented yet, so I thought I'd suggest it. I hope this is useful to Dovecot!
Best wishes,
Rick van Rein
Hi Timo,
The search code works pretty much like you described.
Yeah, that's what I'd assumed from Dovecot until I ran into these strange findings.
I was surprised to find out that in v2.2 it still didn't work correctly, but it was quite a small fix: http://hg.dovecot.org/dovecot-2.2/rev/4b0a736bc40c
Cool. The future is smiling, and we both did something useful today ;-)
Thanks! -Rick
participants (2)
-
Rick van Rein (OpenFortress)
-
Timo Sirainen