[Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr

Antonio Perez-Aranda aperezaranda at yaco.es
Mon May 23 14:11:27 EEST 2011


Indexing mail attachments with Dovecot + Solr.

This patch has been tested with these versions:
 * dovecot 2.0.9
 * apache-solr 1.4.1

This is a patch for the fts-solr plugin (that indexes mail messages
for Dovecot with Solr). In main stream, the plugin does not index
attachments; With this patch, you can index mails and their
attachments (pdf, docs, openoffice docs...) . You can get others
goodies with this patch and the Solr
Config provided, like Synonyms and Stemming (Spanish by default).

Attachment indexing is provided by Solr Cell and Tika (ExtractingRequestHandler)
 * http://wiki.apache.org/solr/ExtractingRequestHandler

Synonyms and Stemming are provided by SnowballPorterFilterFactory from
Solr Language Analysis:
 * http://wiki.apache.org/solr/LanguageAnalysis

We have tested Solr with Tomcat and Jetty. Tomcat is better to handle
UTF-8 and bigger POSTS.

Attachments file format supported
 * http://tika.apache.org/0.9/formats.html

At present, attachments in attachments (like, for example, attachments
in fordwarded "eml" attachments) are not indexed. Also, keep in mind
that there are many types of files, and many variants of the same file
type. Per Example, some pdf files are "not readable" by solr pdf
reader.

Config:

There are two new options added to fts_solr property:
 * index-attachments
       Enable attachments indexing.
 * manual-update
       Avoid index on user search. You can trigger indexing using
doveadm search or doveadm index commands.

There is a new property for the section plugin to filter the mimetypes
that you want to index.
 * fts_solr_mimetype
       files with this mimetype will be sent to solr.

After integrating solr directory in your solr config, and building
Dovecot with fts-solr support and with fts-solr-attachments-r885.patch
applied, you can update your dovecot config by adding to your
dovecot.conf:

...
mail_pluings = $mail_plugins fts fts_solr

plugin {
   fts = solr
   fts_solr = url=http://solrhost:8983/solr/ break-imap-search
index-attachments
   fts_solr_mimetype = application/x-pdf
application/vnd.openxmlformats-officedocument.wordprocessingml.document
}
...



-- 
Antonio Pérez-Aranda Alcaide
aperezaranda at yaco.es

Yaco Sistemas S.L.
http://www.yaco.es/
C/ Rioja 5, 41001 Sevilla
Teléfono +34 954 50 00 57
Fax      +34 954 50 09 29


More information about the dovecot mailing list