[Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr

Wed Aug 31 16:24:24 EEST 2011

On Mon, 2011-05-23 at 13:11 +0200, Antonio Perez-Aranda wrote:
> Indexing mail attachments with Dovecot + Solr.

I've been looking at this and wondering about a few things:

The example solrconfig.xml contains:

>   <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
> ..
>       <!-- capture link hrefs but ignore div attributes -->
>       <str name="captureAttr">true</str>
>       <str name="fmap.a">links</str>
>       <str name="fmap.div">ignored_</str>
>     </lst>

To me it looks like this requires that there exists a "links" field that
is used for.. I guess content between <a>..</a> tags? Or also for the
href URLS? In any case there's no links field in the schema.xml so I
don't think this works?

Similarly it looks like stuff between <div>..</div> is ignored here,
which doesn't seem like a good idea.

> There is a new property for the section plugin to filter the mimetypes
> that you want to index.
>  * fts_solr_mimetype
>        files with this mimetype will be sent to solr.

In v2.1 I've added a generic "fts decoder" script that can handle
attachment decoding. The script contains stuff like:

formats='application/pdf pdf
application/x-pdf pdf
application/msword doc
..

So there already exists a place which can list supported MIME types and
also what filename extensions they have, so if there's
application/octet-stream with filename=foo.pdf, Dovecot's fts code can
change the MIME type to application/pdf. This sounds like it could be
useful for the Solr attachments too. Maybe instead of fts_solr_mimetype
setting the script could be modified a bit so that it would even allow
mixed Solr/script attachment extraction. For example:

formats='+application/pdf pdf
+application/x-pdf pdf
application/msword doc'

The "+" prefix could tell that the FTS backend (Solr) handles the MIME
type instead of the script. So with above config Solr would
decode .pdfs, but the script would decode .docs.

I was also thinking that the attachment documents could contain some
description fields as well, which could be useful if you're searching
the Solr index directly instead of via Dovecot. Maybe fields like
"attachment_filename" (parsed from Content-Disposition: header) and
"attachment_description" (parsed from Content-Description: header). They
could of course be empty if those fields don't exist (and probably
should be optional anyway).

Also there should be "attachment_part" field that would contain the IMAP
MIME part number of the attachment (e.g. "2.1.3"), so it would be easy
to find and fetch the attachment. This could also be used as part of the
ID string instead of the attachment_count.