On 15/11/2020 21:54, PGNet Dev wrote:
On 11/15/20 12:21 PM, John Fawcett wrote:
I'm using tika-server.jar installed as a service

yup. same here.

atm, listening on localhost, with Dovecot -> Tika direct, no proxy.

similarly fragile under load.  throwing ~10 messages with .5-5MB attachments at it at once causes all sorts of complaints.

one at a time seems OK ...

Dovecot currently implements separate integrations, first the
attachments are sent to tika, then the results are sent to solr.

ah, so tika first ...

The two could even be running on separate servers.

Not sure when that's a useful usecase.  I can certainly see a separate, integrated solr+tika server.

ExtremelyhHeavy loads, I guess.
Not sure when it would be useful, but that was just to underline the current integration model for Dovecot.

Yes that could be an alternative way, so instead of sending the
attachments to tika, send the attachments to solr and let it send them
to tika. It would be more than configuration in Dovecot though.

yup.  taking a look at solr cell + tika integration to see where the config makes most sense.

this is a useful 1st read

  https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html

It's an approach that could be worthwhile looking into, though not using solr cell, given the following statements at that link:

"If any exceptions cause the ExtractingRequestHandler and/or Tika to crash, Solr as a whole will also crash because the request handler is running in the same JVM that Solr uses for other operations.

Indexing can also consume all available Solr resources, particularly with large PDFs, presentations, or other files that have a lot of rich media embedded in them.

For these reasons, Solr Cell is not recommended for use in a production system."


Yes, I think limits on Dovecot are useful in any case, otherwise you end
up sending arbitrary sized files across the network to have them thrown
away on the server.

point taken.

afaict, fts_solr has only a batch_size limit -- but neither a total message size, or an attachment size limit.

Yes, batch_size was an attempt to introduce some configurable limit. If attachments are being sent across it many not be sufficient.

John