Dovecot fts script with solr

Aki Tuomi aki.tuomi at dovecot.fi
Mon May 21 12:39:49 EEST 2018



On 06.05.2018 13:13, André Rodier wrote:
> Hello again,
>
> I have created a parser script, a little bit more advanced than the
> one provided with Dovecot. The main feature is probably to index
> documents inside zip/rar/tgz archives...
>
> I am using Ansible, swaks and doveadm to run automatic tests for each
> supported content. For specific reasons, I am not yet able to add
> Apache Tika to the distribution. However, I already made some tests
> with it. For now, I want to talk about the indexing script.
>
> I also have noticed a few weird behaviours. I will mention them at the
> end, albeit I am not 100% sure where they are coming from. I realised
> last week that using QEMU snapshots was not working as expected, so I
> am now more careful with this feature.
>
> For the developers or users who would be interested and the Dovecot
> team members to understand my questions, here how the tests are working:
>
> To run my tests, I have a set of files in various formats, with a UUID
> inside. They are office files, text files, or even archives with a
> text file inside...
>
> The first test I am running is the script alone. I check that the
> script can convert the file to text, and then I use grep to check the
> UUID is present. This works *perfectly* for all the content, except
> ppt, but it's minor.
>
> The second test is full:
> - I use swaks to send the email with an attachment and the appropriate
> mime type.
> - I then refresh the index using doveadm rescan.
> - I check that fts search returns a line, with doveadm fts search.
> - I then expunge the mailbox to be sure that the next test is valid.
>
> For the second test, it works almost all the time, except in the
> following situations:
> - When the attachment is an email (mime type message/rfc822)
> - RTF (could be a bug in my script)
> - Text file in UTF16 (Even if this file is converted to UTF8)
>
> *Questions:*
> 1 - Is there any limitation or special case for the mime message/rfc822

Not that I can see in decoder.

> 2 - Is the mime type received coming from the email headers?

Mime type received comes from mail header, unless it's
"application/octet-stream", in which case autodetection is attempted
based on file suffix.

> 3 - When the script is called without arguments, what is the purpose
> of the extension at the end of each supported mime types?

The idea is to provide mappings for decoder, so that if the content type
is "application/octet-stream", autodetection can be performed.

> 4 - Can I return a wildcard in the supported mime types, for instance
> "text/* *" ?

Content type matching is done with strcmp, which is probably bit
suboptimal. Have to take a note of this.

> 5 - I would like to handle attachments of types
> application/octet-stream. I have added "application/octet-stream *",
> but I am not sure if dovecot will pass the attachments with these mime
> type or not.
>

application/octet-stream is already handled in code.

> *Notes:*
> 1 - I used netcat to monitor the solr server. I realise that
> sometimes, the data sent to the solr server only contains the headers
> of the email, not the text returned by the parser. Especially with
> rfc822 messages. I will do more tests.
> 2 - I just finished to write the script, it's not yet refactored, but
> at list it is well documented. I will do a full security audit later.
> I am actually testing an associated AppArmor profile.
> 3 - I will do more intensive test on the script on bigger mail boxes
> with more attachments.
> 4 - I may rewrite the script in Python
> 5 - Suggestions welcome.
>
> I initially attached the current version of the script, but the email
> is probably pending for review...In this case, the last development
> version is on Github:
> https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dovecot/files/fts/decode2text
> The configuration of supported mime types is a simple file, accessible
> on github as well:
> https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dovecot/templates/fts/mime-supported.conf
>
> Thanks for your advices or suggestions.

Aki Tuomi
Dovecot oy



More information about the dovecot mailing list