Hi
As mentioned in a previous thread, the solr + tika combination has caused me some issues due to attachment size. While tika seems to be able to parse large attachments, the resulting volume of text can overwhelm the solr server.
One solution would be to throw resources at the problem, but in my case such large attachments don't contain anything worthwile indexing. Additionally I don't want people to be able to randomly crash my solr server by sending large compressed attachments that expand into huge volumes for solr. It's also a safety feature to have sane limits on what can be indexed.
Attached is a first attempt to address the problem. I did not find a way to easily get actual attachment sizes, so I used an already available information - the overall message size. It may not be ideal but at least introduces limits where none existed.
I have introduced two new parameters for the plugin section, for example:
plugin {
fts_max_size = 2M fts_max_size_tika = 1M
} They can be used separately or together. Both sizes refer to the overall message size. The meaning is:
fts_max_size - do not parse message bodies if the message size exceeds this value. A value of 0 indicates no limit. If the message body is not parsed, attachments are also not parsed.
fts_max_size_tika - do not parse message attachments with tika if the message size exceeds this value. A value of 0 indicates no limit.
If using both settings it makes sense to have fts_max_size > fts_max_size_tika, since with a smaller fts_max_size bodies are not indexed including attachments and the fts_max_size_tika will have no effect.
The difference (ft_max_size - fts_max_size_tika) places an upper bound on the size of the non attachment body text that will be indexed. However, any attachments over the fts_size will automatically consume this limit and no body text will be indexed for those messages. I've only updated the tika parser not the script parser though the script parser potentially could benefit from this approach.
The attached patch also includes the rolled up patch for using basic auth with the tika server and the previous posted patch (not mine) which solves an assert when using solr and tika together.
John