[Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr

Antonio Perez-Aranda aperezaranda at yaco.es
Mon May 23 14:14:01 EEST 2011


Sorry, I forgot to include the attachment.

2011/5/23 Antonio Perez-Aranda <aperezaranda at yaco.es>:
> Indexing mail attachments with Dovecot + Solr.
>
> This patch has been tested with these versions:
>  * dovecot 2.0.9
>  * apache-solr 1.4.1
>
> This is a patch for the fts-solr plugin (that indexes mail messages
> for Dovecot with Solr). In main stream, the plugin does not index
> attachments; With this patch, you can index mails and their
> attachments (pdf, docs, openoffice docs...) . You can get others
> goodies with this patch and the Solr
> Config provided, like Synonyms and Stemming (Spanish by default).
>
> Attachment indexing is provided by Solr Cell and Tika (ExtractingRequestHandler)
>  * http://wiki.apache.org/solr/ExtractingRequestHandler
>
> Synonyms and Stemming are provided by SnowballPorterFilterFactory from
> Solr Language Analysis:
>  * http://wiki.apache.org/solr/LanguageAnalysis
>
> We have tested Solr with Tomcat and Jetty. Tomcat is better to handle
> UTF-8 and bigger POSTS.
>
> Attachments file format supported
>  * http://tika.apache.org/0.9/formats.html
>
> At present, attachments in attachments (like, for example, attachments
> in fordwarded "eml" attachments) are not indexed. Also, keep in mind
> that there are many types of files, and many variants of the same file
> type. Per Example, some pdf files are "not readable" by solr pdf
> reader.
>
> Config:
>
> There are two new options added to fts_solr property:
>  * index-attachments
>       Enable attachments indexing.
>  * manual-update
>       Avoid index on user search. You can trigger indexing using
> doveadm search or doveadm index commands.
>
> There is a new property for the section plugin to filter the mimetypes
> that you want to index.
>  * fts_solr_mimetype
>       files with this mimetype will be sent to solr.
>
> After integrating solr directory in your solr config, and building
> Dovecot with fts-solr support and with fts-solr-attachments-r885.patch
> applied, you can update your dovecot config by adding to your
> dovecot.conf:
>
> ...
> mail_pluings = $mail_plugins fts fts_solr
>
> plugin {
>   fts = solr
>   fts_solr = url=http://solrhost:8983/solr/ break-imap-search
> index-attachments
>   fts_solr_mimetype = application/x-pdf
> application/vnd.openxmlformats-officedocument.wordprocessingml.document
> }
> ...
>
>
>
> --
> Antonio Pérez-Aranda Alcaide
> aperezaranda at yaco.es
>
> Yaco Sistemas S.L.
> http://www.yaco.es/
> C/ Rioja 5, 41001 Sevilla
> Teléfono +34 954 50 00 57
> Fax      +34 954 50 09 29
>



-- 
Antonio Pérez-Aranda Alcaide
aperezaranda at yaco.es

Yaco Sistemas S.L.
http://www.yaco.es/
C/ Rioja 5, 41001 Sevilla
Teléfono +34 954 50 00 57
Fax      +34 954 50 09 29
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fts-solr-attachments-r885.tar.gz
Type: application/x-gzip
Size: 28370 bytes
Desc: not available
URL: <http://dovecot.org/pipermail/dovecot/attachments/20110523/c89bac0c/attachment-0001.gz>


More information about the dovecot mailing list