Dovecot fts script with solr

Sun May 6 13:13:01 EEST 2018

Hello again,

I have created a parser script, a little bit more advanced than the one 
provided with Dovecot. The main feature is probably to index documents 
inside zip/rar/tgz archives...

I am using Ansible, swaks and doveadm to run automatic tests for each 
supported content. For specific reasons, I am not yet able to add Apache 
Tika to the distribution. However, I already made some tests with it. 
For now, I want to talk about the indexing script.

I also have noticed a few weird behaviours. I will mention them at the 
end, albeit I am not 100% sure where they are coming from. I realised 
last week that using QEMU snapshots was not working as expected, so I am 
now more careful with this feature.

For the developers or users who would be interested and the Dovecot team 
members to understand my questions, here how the tests are working:

To run my tests, I have a set of files in various formats, with a UUID 
inside. They are office files, text files, or even archives with a text 
file inside...

The first test I am running is the script alone. I check that the script 
can convert the file to text, and then I use grep to check the UUID is 
present. This works *perfectly* for all the content, except ppt, but 
it's minor.

The second test is full:
- I use swaks to send the email with an attachment and the appropriate 
mime type.
- I then refresh the index using doveadm rescan.
- I check that fts search returns a line, with doveadm fts search.
- I then expunge the mailbox to be sure that the next test is valid.

For the second test, it works almost all the time, except in the 
following situations:
- When the attachment is an email (mime type message/rfc822)
- RTF (could be a bug in my script)
- Text file in UTF16 (Even if this file is converted to UTF8)

*Questions:*
1 - Is there any limitation or special case for the mime message/rfc822
2 - Is the mime type received coming from the email headers?
3 - When the script is called without arguments, what is the purpose of 
the extension at the end of each supported mime types?
4 - Can I return a wildcard in the supported mime types, for instance 
"text/* *" ?
5 - I would like to handle attachments of types 
application/octet-stream. I have added "application/octet-stream *", but 
I am not sure if dovecot will pass the attachments with these mime type 
or not.

*Notes:*
1 - I used netcat to monitor the solr server. I realise that sometimes, 
the data sent to the solr server only contains the headers of the email, 
not the text returned by the parser. Especially with rfc822 messages. I 
will do more tests.
2 - I just finished to write the script, it's not yet refactored, but at 
list it is well documented. I will do a full security audit later. I am 
actually testing an associated AppArmor profile.
3 - I will do more intensive test on the script on bigger mail boxes 
with more attachments.
4 - I may rewrite the script in Python
5 - Suggestions welcome.

I initially attached the current version of the script, but the email is 
probably pending for review...In this case, the last development version 
is on Github: 
https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dovecot/files/fts/decode2text
The configuration of supported mime types is a simple file, accessible 
on github as well: 
https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dovecot/templates/fts/mime-supported.conf

Thanks for your advices or suggestions.