Dovecot fts script with solr
Hello again,
I have created a parser script, a little bit more advanced than the one provided with Dovecot. The main feature is probably to index documents inside zip/rar/tgz archives...
I am using Ansible, swaks and doveadm to run automatic tests for each supported content. For specific reasons, I am not yet able to add Apache Tika to the distribution. However, I already made some tests with it. For now, I want to talk about the indexing script.
I also have noticed a few weird behaviours. I will mention them at the end, albeit I am not 100% sure where they are coming from. I realised last week that using QEMU snapshots was not working as expected, so I am now more careful with this feature.
For the developers or users who would be interested and the Dovecot team members to understand my questions, here how the tests are working:
To run my tests, I have a set of files in various formats, with a UUID inside. They are office files, text files, or even archives with a text file inside...
The first test I am running is the script alone. I check that the script can convert the file to text, and then I use grep to check the UUID is present. This works *perfectly* for all the content, except ppt, but it's minor.
The second test is full:
- I use swaks to send the email with an attachment and the appropriate mime type.
- I then refresh the index using doveadm rescan.
- I check that fts search returns a line, with doveadm fts search.
- I then expunge the mailbox to be sure that the next test is valid.
For the second test, it works almost all the time, except in the following situations:
- When the attachment is an email (mime type message/rfc822)
- RTF (could be a bug in my script)
- Text file in UTF16 (Even if this file is converted to UTF8)
*Questions:* 1 - Is there any limitation or special case for the mime message/rfc822 2 - Is the mime type received coming from the email headers? 3 - When the script is called without arguments, what is the purpose of the extension at the end of each supported mime types? 4 - Can I return a wildcard in the supported mime types, for instance "text/* *" ? 5 - I would like to handle attachments of types application/octet-stream. I have added "application/octet-stream *", but I am not sure if dovecot will pass the attachments with these mime type or not.
*Notes:* 1 - I used netcat to monitor the solr server. I realise that sometimes, the data sent to the solr server only contains the headers of the email, not the text returned by the parser. Especially with rfc822 messages. I will do more tests. 2 - I just finished to write the script, it's not yet refactored, but at list it is well documented. I will do a full security audit later. I am actually testing an associated AppArmor profile. 3 - I will do more intensive test on the script on bigger mail boxes with more attachments. 4 - I may rewrite the script in Python 5 - Suggestions welcome.
I initially attached the current version of the script, but the email is probably pending for review...In this case, the last development version is on Github: https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dov... The configuration of supported mime types is a simple file, accessible on github as well: https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dov...
Thanks for your advices or suggestions.
On 06.05.2018 13:13, André Rodier wrote:
Hello again,
I have created a parser script, a little bit more advanced than the one provided with Dovecot. The main feature is probably to index documents inside zip/rar/tgz archives...
I am using Ansible, swaks and doveadm to run automatic tests for each supported content. For specific reasons, I am not yet able to add Apache Tika to the distribution. However, I already made some tests with it. For now, I want to talk about the indexing script.
I also have noticed a few weird behaviours. I will mention them at the end, albeit I am not 100% sure where they are coming from. I realised last week that using QEMU snapshots was not working as expected, so I am now more careful with this feature.
For the developers or users who would be interested and the Dovecot team members to understand my questions, here how the tests are working:
To run my tests, I have a set of files in various formats, with a UUID inside. They are office files, text files, or even archives with a text file inside...
The first test I am running is the script alone. I check that the script can convert the file to text, and then I use grep to check the UUID is present. This works *perfectly* for all the content, except ppt, but it's minor.
The second test is full:
- I use swaks to send the email with an attachment and the appropriate mime type.
- I then refresh the index using doveadm rescan.
- I check that fts search returns a line, with doveadm fts search.
- I then expunge the mailbox to be sure that the next test is valid.
For the second test, it works almost all the time, except in the following situations:
- When the attachment is an email (mime type message/rfc822)
- RTF (could be a bug in my script)
- Text file in UTF16 (Even if this file is converted to UTF8)
*Questions:* 1 - Is there any limitation or special case for the mime message/rfc822
Not that I can see in decoder.
2 - Is the mime type received coming from the email headers?
Mime type received comes from mail header, unless it's "application/octet-stream", in which case autodetection is attempted based on file suffix.
3 - When the script is called without arguments, what is the purpose of the extension at the end of each supported mime types?
The idea is to provide mappings for decoder, so that if the content type is "application/octet-stream", autodetection can be performed.
4 - Can I return a wildcard in the supported mime types, for instance "text/* *" ?
Content type matching is done with strcmp, which is probably bit suboptimal. Have to take a note of this.
5 - I would like to handle attachments of types application/octet-stream. I have added "application/octet-stream *", but I am not sure if dovecot will pass the attachments with these mime type or not.
application/octet-stream is already handled in code.
*Notes:* 1 - I used netcat to monitor the solr server. I realise that sometimes, the data sent to the solr server only contains the headers of the email, not the text returned by the parser. Especially with rfc822 messages. I will do more tests. 2 - I just finished to write the script, it's not yet refactored, but at list it is well documented. I will do a full security audit later. I am actually testing an associated AppArmor profile. 3 - I will do more intensive test on the script on bigger mail boxes with more attachments. 4 - I may rewrite the script in Python 5 - Suggestions welcome.
I initially attached the current version of the script, but the email is probably pending for review...In this case, the last development version is on Github: https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dov... The configuration of supported mime types is a simple file, accessible on github as well: https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dov...
Thanks for your advices or suggestions.
Aki Tuomi Dovecot oy
participants (2)
-
Aki Tuomi
-
André Rodier