Indexing mail attachments with Dovecot + Solr.
This patch has been tested with these versions:
- dovecot 2.0.9
- apache-solr 1.4.1
This is a patch for the fts-solr plugin (that indexes mail messages for Dovecot with Solr). In main stream, the plugin does not index attachments; With this patch, you can index mails and their attachments (pdf, docs, openoffice docs...) . You can get others goodies with this patch and the Solr Config provided, like Synonyms and Stemming (Spanish by default).
Attachment indexing is provided by Solr Cell and Tika (ExtractingRequestHandler)
Synonyms and Stemming are provided by SnowballPorterFilterFactory from Solr Language Analysis:
We have tested Solr with Tomcat and Jetty. Tomcat is better to handle UTF-8 and bigger POSTS.
Attachments file format supported
At present, attachments in attachments (like, for example, attachments in fordwarded "eml" attachments) are not indexed. Also, keep in mind that there are many types of files, and many variants of the same file type. Per Example, some pdf files are "not readable" by solr pdf reader.
Config:
There are two new options added to fts_solr property:
- index-attachments Enable attachments indexing.
- manual-update Avoid index on user search. You can trigger indexing using doveadm search or doveadm index commands.
There is a new property for the section plugin to filter the mimetypes that you want to index.
- fts_solr_mimetype files with this mimetype will be sent to solr.
After integrating solr directory in your solr config, and building Dovecot with fts-solr support and with fts-solr-attachments-r885.patch applied, you can update your dovecot config by adding to your dovecot.conf:
... mail_pluings = $mail_plugins fts fts_solr
plugin { fts = solr fts_solr = url=http://solrhost:8983/solr/ break-imap-search index-attachments fts_solr_mimetype = application/x-pdf application/vnd.openxmlformats-officedocument.wordprocessingml.document } ...
-- Antonio Pérez-Aranda Alcaide aperezaranda@yaco.es
Yaco Sistemas S.L. http://www.yaco.es/ C/ Rioja 5, 41001 Sevilla Teléfono +34 954 50 00 57 Fax +34 954 50 09 29