Re: Dovecot fts script with solr

21 May 2018 · *perfectly*

      On 06.05.2018 13:13, André Rodier wrote:
...
Hello again,
I have created a parser script, a little bit more advanced than the
one provided with Dovecot. The main feature is probably to index
documents inside zip/rar/tgz archives...
I am using Ansible, swaks and doveadm to run automatic tests for each
supported content. For specific reasons, I am not yet able to add
Apache Tika to the distribution. However, I already made some tests
with it. For now, I want to talk about the indexing script.
I also have noticed a few weird behaviours. I will mention them at the
end, albeit I am not 100% sure where they are coming from. I realised
last week that using QEMU snapshots was not working as expected, so I
am now more careful with this feature.
For the developers or users who would be interested and the Dovecot
team members to understand my questions, here how the tests are working:
To run my tests, I have a set of files in various formats, with a UUID
inside. They are office files, text files, or even archives with a
text file inside...
The first test I am running is the script alone. I check that the
script can convert the file to text, and then I use grep to check the
UUID is present. This works *perfectly* for all the content, except
ppt, but it's minor.
The second test is full:

I use swaks to send the email with an attachment and the appropriate
mime type.
I then refresh the index using doveadm rescan.
I check that fts search returns a line, with doveadm fts search.
I then expunge the mailbox to be sure that the next test is valid.

For the second test, it works almost all the time, except in the
following situations:

When the attachment is an email (mime type message/rfc822)
RTF (could be a bug in my script)
Text file in UTF16 (Even if this file is converted to UTF8)

*Questions:*
1 - Is there any limitation or special case for the mime message/rfc822
Not that I can see in decoder.
...
2 - Is the mime type received coming from the email headers?
Mime type received comes from mail header, unless it's
"application/octet-stream", in which case autodetection is attempted
based on file suffix.
...
3 - When the script is called without arguments, what is the purpose
of the extension at the end of each supported mime types?
The idea is to provide mappings for decoder, so that if the content type
is "application/octet-stream", autodetection can be performed.
...
4 - Can I return a wildcard in the supported mime types, for instance
"text/* *" ?
Content type matching is done with strcmp, which is probably bit
suboptimal. Have to take a note of this.
...
5 - I would like to handle attachments of types
application/octet-stream. I have added "application/octet-stream *",
but I am not sure if dovecot will pass the attachments with these mime
type or not.
application/octet-stream is already handled in code.
...
*Notes:*
1 - I used netcat to monitor the solr server. I realise that
sometimes, the data sent to the solr server only contains the headers
of the email, not the text returned by the parser. Especially with
rfc822 messages. I will do more tests.
2 - I just finished to write the script, it's not yet refactored, but
at list it is well documented. I will do a full security audit later.
I am actually testing an associated AppArmor profile.
3 - I will do more intensive test on the script on bigger mail boxes
with more attachments.
4 - I may rewrite the script in Python
5 - Suggestions welcome.
I initially attached the current version of the script, but the email
is probably pending for review...In this case, the last development
version is on Github:
https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dov...
The configuration of supported mime types is a simple file, accessible
on github as well:
https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dov...
Thanks for your advices or suggestions.
Aki Tuomi
Dovecot oy