[patch] enhancement for tika server protected by user/password basic auth
Hi The existing code is designed for a tika server on local host, or on a remote host that is either accessable to all or for example taht is protected by ip restrictions via a proxy in front of it. I've configured a tika server behind an apache proxy which enforces basic auth, but sending basic auth credentials for a tika server is not currently supported by Dovecot. The following patch allows to have user and password specified in the fts_tika url in much the same way you can for fts_solr. fts_tika = https://user:password@tika_host/tika John --- dovecot-2.3.11.3-orig/src/plugins/fts/fts-parser-tika.c 2020-08-12 14:20:41.000000000 +0200 +++ dovecot-2.3.11.3/src/plugins/fts/fts-parser-tika.c 2020-11-15 15:18:24.351281064 +0100 @@ -57,7 +57,7 @@ tuser = p_new(user->pool, struct fts_parser_tika_user, 1); MODULE_CONTEXT_SET(user, fts_parser_tika_user_module, tuser); - if (http_url_parse(url, NULL, 0, user->pool, + if (http_url_parse(url, NULL, HTTP_URL_ALLOW_USERINFO_PART, user->pool, &tuser->http_url, &error) < 0) { i_error("fts_tika: Failed to parse HTTP url %s: %s", url, error); return -1; @@ -152,6 +152,11 @@ http_url->host.name, t_strconcat(http_url->path, http_url->enc_query, NULL), fts_tika_parser_response, parser); + if (http_url->user != NULL) { + http_client_request_set_auth_simple( + http_req, http_url->user, http_url->password); + } + http_client_request_set_port(http_req, http_url->port); http_client_request_set_ssl(http_req, http_url->have_ssl); if (parser_context->content_type != NULL)
On 11/15/20 6:33 AM, John Fawcett wrote:
I've configured a tika server behind an apache proxy which enforces basic auth, but sending basic auth credentials for a tika server is not currently supported by Dovecot.
i was _just_ setting up a tika instance behind a nginx proxy with basicauth in place.
hadn't yet gotten to the "can't pass auth creds in dovecot" bit. thx! for the patch; hopefully the premise/patch will get picked up. (ya-request for a proper @dovecot public bug/issue queue!)
have you found any other 'magic required' to get solr & tika indexing text/attachments, respectively, in Dovecot context? is it as straightforward as spec'ing the 'fts_solr' & 'fts_tika' urls, and Dovecot does the passing-around correctly?
On 15/11/2020 15:49, PGNet Dev wrote:
On 11/15/20 6:33 AM, John Fawcett wrote:
I've configured a tika server behind an apache proxy which enforces basic auth, but sending basic auth credentials for a tika server is not currently supported by Dovecot.
i was _just_ setting up a tika instance behind a nginx proxy with basicauth in place.
hadn't yet gotten to the "can't pass auth creds in dovecot" bit. thx! for the patch; hopefully the premise/patch will get picked up. (ya-request for a proper @dovecot public bug/issue queue!)
have you found any other 'magic required' to get solr & tika indexing text/attachments, respectively, in Dovecot context? is it as straightforward as spec'ing the 'fts_solr' & 'fts_tika' urls, and Dovecot does the passing-around correctly?
I've just started using tika myself, but from my tests, it's as simple as adding fts_tika to a working solr integration.
John
On 15/11/2020 18:10, John Fawcett wrote:
On 15/11/2020 15:49, PGNet Dev wrote:
On 11/15/20 6:33 AM, John Fawcett wrote:
I've configured a tika server behind an apache proxy which enforces basic auth, but sending basic auth credentials for a tika server is not currently supported by Dovecot. i was _just_ setting up a tika instance behind a nginx proxy with basicauth in place.
hadn't yet gotten to the "can't pass auth creds in dovecot" bit. thx! for the patch; hopefully the premise/patch will get picked up. (ya-request for a proper @dovecot public bug/issue queue!)
have you found any other 'magic required' to get solr & tika indexing text/attachments, respectively, in Dovecot context? is it as straightforward as spec'ing the 'fts_solr' & 'fts_tika' urls, and Dovecot does the passing-around correctly? I've just started using tika myself, but from my tests, it's as simple as adding fts_tika to a working solr integration.
John
Just a couple of updates about Tika and Solr together. 1. On mass reindexing I'm seeing panics - see below. These are present with Dovecot 2.3.10 and 2.3.11.3. Seem to go away with the fix which was previously posted on this list by Josef 'Jeff' Sipek, which I repeat below for easy of reference. 2. On mass reindexing my Tika server seems to get a bit overwhelmed. I think I'll need to look into how resources are allocated and do some tuning. This produces 502 Proxy Error responses back to Dovecot. As far as Dovecot integration with Tika, I believe that some resource limits would be helpful. I think it would make sense to have a limit in Dovecot about the maximum file size it will try to send to Tika. Potentially, it could be useful also to allow configuration of the types of file to send to Tika. For example I see lots of image files going across, but I'd probably be happy not to have them indexed. It won't be perfect, since those file types could exist inside zip files, but maybe would cut out a bit of the load. John Nov 15 17:58:19 server02 dovecot: indexer-worker(user@example.com)<11132><kMrwLCpesV98KwAAAJEHgA>: Panic: file http-client-request.c: line 1235 (http_client_request_send_more): assertion failed: (req->payload_input != NULL) Nov 15 17:58:19 server02 dovecot: indexer-worker(user@example.com)<11132><kMrwLCpesV98KwAAAJEHgA>: Error: Raw backtrace: /usr/local/lib/dovecot/libdovecot.so.0(backtrace_append+0x42) [0x7f87c271adf2] -> /usr/local/lib/dovecot/libdovecot.so.0(backtrace_get+0x1e) [0x7f87c271aefe] -> /usr/local/lib/dovecot/libdovecot.so.0(+0xec44e) [0x7f87c272544e] -> /usr/local/lib/dovecot/libdovecot.so.0(+0xec4f1) [0x7f87c27254f1] -> /usr/local/lib/dovecot/libdovecot.so.0(i_fatal+0) [0x7f87c267c4ea] -> /usr/local/lib/dovecot/libdovecot.so.0(http_client_request_send_more+0x3dd) [0x7f87c26c449d] -> /usr/local/lib/dovecot/libdovecot.so.0(http_client_connection_output+0xf1) [0x7f87c26c8bf1] -> /usr/local/lib/dovecot/libssl_iostream_openssl.so(+0x918f) [0x7f87bea4818f] -> /usr/local/lib/dovecot/libdovecot.so.0(+0x115710) [0x7f87c274e710] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_call_io+0x65) [0x7f87c273db65] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_handler_run_internal+0x12b) [0x7f87c273f4ab] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_handler_run+0x59) [0x7f87c273dc69] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_run+0x38) [0x7f87c273dea8] -> /usr/local/lib/dovecot/libdovecot.so.0(+0x8a9c6) [0x7f87c26c39c6] -> /usr/local/lib/dovecot/libdovecot.so.0(http_client_request_send_payload+0x2c) [0x7f87c26c3c4c] -> /usr/local/lib/dovecot/lib20_fts_plugin.so(+0xdbdd) [0x7f87c1a1abdd] -> /usr/local/lib/dovecot/lib20_fts_plugin.so(fts_parser_more+0x27) [0x7f87c1a19b67] -> /usr/local/lib/dovecot/lib20_fts_plugin.so(+0xa951) [0x7f87c1a17951] -> /usr/local/lib/dovecot/lib20_fts_plugin.so(fts_build_mail+0x54) [0x7f87c1a182b4] -> /usr/local/lib/dovecot/lib20_fts_plugin.so(+0x11502) [0x7f87c1a1e502] -> /usr/local/lib/dovecot/libdovecot-storage.so.0(mail_precache+0x2e) [0x7f87c2a2519e] -> dovecot/indexer-worker [user@example.com Sent](+0x2834) [0x55bd6355f834] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_call_io+0x65) [0x7f87c273db65] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_handler_run_internal+0x12b) [0x7f87c273f4ab] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_handler_run+0x59) [0x7f87c273dc69] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_run+0x38) [0x7f87c273dea8] -> /usr/local/lib/dovecot/libdovecot.so.0(master_service_run+0x13) [0x7f87c26ad383] -> dovecot/indexer-worker [user@example.com Sent](main+0xd7) [0x55bd6355f227] -> /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f87c228d555] -> dovecot/indexer-worker [user@example.com Sent](+0x22ee) [0x55bd6355f2ee] diff --git a/src/plugins/fts-solr/solr-connection.c b/src/plugins/fts-solr/solr-connection.c index ae720b5e2870a852c1b6c440939e3c7c0fa72b5c..9d364f93e2cd1b716b9ab61bd39656a6c5b1ea04 100644 --- a/src/plugins/fts-solr/solr-connection.c +++ b/src/plugins/fts-solr/solr-connection.c @@ -103,7 +103,7 @@ int solr_connection_init(const struct fts_solr_settings *solr_set, http_set.ssl = ssl_client_set; http_set.debug = solr_set->debug; http_set.rawlog_dir = solr_set->rawlog_dir; - solr_http_client = http_client_init(&http_set); + solr_http_client = http_client_init_private(&http_set); } *conn_r = conn; diff --git a/src/plugins/fts/fts-parser-tika.c b/src/plugins/fts/fts-parser-tika.c index a4b8b5c3034f57e22e77caa759c090da6b62f8ba..b8b57a350b9a710d101ac7ccbcc14560d415d905 100644 --- a/src/plugins/fts/fts-parser-tika.c +++ b/src/plugins/fts/fts-parser-tika.c @@ -77,7 +77,7 @@ tika_get_http_client_url(struct mail_user *user, struct http_url **http_url_r) http_set.request_timeout_msecs = 60*1000; http_set.ssl = &ssl_set; http_set.debug = user->mail_debug; - tika_http_client = http_client_init(&http_set); + tika_http_client = http_client_init_private(&http_set); } *http_url_r = tuser->http_url; return 0;
On 11/15/20 11:13 AM, John Fawcett wrote:
Just a couple of updates about Tika and Solr together.
On mass reindexing I'm seeing panics - see below. These are present with Dovecot 2.3.10 and 2.3.11.3. Seem to go away with the fix which was previously posted on this list by Josef 'Jeff' Sipek, which I repeat below for easy of reference.
On mass reindexing my Tika server seems to get a bit overwhelmed. I think I'll need to look into how resources are allocated and do some tuning. This produces 502 Proxy Error responses back to Dovecot.
Which tika instance are you running on the backend?
The tika-app.jar, with --server? or the JAXRS tika-server.jar?
As far as Dovecot integration with Tika, I believe that some resource limits would be helpful. I think it would make sense to have a limit in Dovecot about the maximum file size it will try to send to Tika. Potentially, it could be useful also to allow configuration of the types of file to send to Tika. For example I see lots of image files going across, but I'd probably be happy not to have them indexed. It won't be perfect, since those file types could exist inside zip files, but maybe would cut out a bit of the load.
Solr itself apparently has 'tika integration' out of the box. Since the solr server instance bundles jetty _anyway_, and it _is_ already up/running ... wondering if the indexing load can be better managed there.
iiuc, limits and types can be specified in solr/tika config directly.
perhaps Dovecot can be configured to send all messages+attachments, and let solr/tika config 'choose' to index just the message, or the attachment as well.
that said, config in Dovecot is certainly convenient.
On 15/11/2020 20:48, PGNet Dev wrote:
On 11/15/20 11:13 AM, John Fawcett wrote:
Just a couple of updates about Tika and Solr together.
On mass reindexing I'm seeing panics - see below. These are present with Dovecot 2.3.10 and 2.3.11.3. Seem to go away with the fix which was previously posted on this list by Josef 'Jeff' Sipek, which I repeat below for easy of reference.
On mass reindexing my Tika server seems to get a bit overwhelmed. I think I'll need to look into how resources are allocated and do some tuning. This produces 502 Proxy Error responses back to Dovecot.
Which tika instance are you running on the backend?
The tika-app.jar, with --server? or the JAXRS tika-server.jar?
I'm using tika-server.jar installed as a service
As far as Dovecot integration with Tika, I believe that some resource limits would be helpful. I think it would make sense to have a limit in Dovecot about the maximum file size it will try to send to Tika. Potentially, it could be useful also to allow configuration of the types of file to send to Tika. For example I see lots of image files going across, but I'd probably be happy not to have them indexed. It won't be perfect, since those file types could exist inside zip files, but maybe would cut out a bit of the load.
Solr itself apparently has 'tika integration' out of the box. Since the solr server instance bundles jetty _anyway_, and it _is_ already up/running ... wondering if the indexing load can be better managed there.
Dovecot currently implements separate integrations, first the attachments are sent to tika, then the results are sent to solr. The two could even be running on separate servers.
iiuc, limits and types can be specified in solr/tika config directly.
perhaps Dovecot can be configured to send all messages+attachments, and let solr/tika config 'choose' to index just the message, or the attachment as well.
Yes that could be an alternative way, so instead of sending the attachments to tika, send the attachments to solr and let it send them to tika. It would be more than configuration in Dovecot though.
that said, config in Dovecot is certainly convenient.
Yes, I think limits on Dovecot are useful in any case, otherwise you end up sending arbitrary sized files across the network to have them thrown away on the server.
John
On 11/15/20 12:21 PM, John Fawcett wrote:
I'm using tika-server.jar installed as a service
yup. same here.
atm, listening on localhost, with Dovecot -> Tika direct, no proxy.
similarly fragile under load. throwing ~10 messages with .5-5MB attachments at it at once causes all sorts of complaints.
one at a time seems OK ...
Dovecot currently implements separate integrations, first the attachments are sent to tika, then the results are sent to solr.
ah, so tika first ...
The two could even be running on separate servers.
Not sure when that's a useful usecase. I can certainly see a separate, integrated solr+tika server.
ExtremelyhHeavy loads, I guess.
Yes that could be an alternative way, so instead of sending the attachments to tika, send the attachments to solr and let it send them to tika. It would be more than configuration in Dovecot though.
yup. taking a look at solr cell + tika integration to see where the config makes most sense.
this is a useful 1st read
https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using...
Yes, I think limits on Dovecot are useful in any case, otherwise you end up sending arbitrary sized files across the network to have them thrown away on the server.
point taken.
afaict, fts_solr has only a batch_size limit -- but neither a total message size, or an attachment size limit.
On 15/11/2020 21:54, PGNet Dev wrote:
On 11/15/20 12:21 PM, John Fawcett wrote:
I'm using tika-server.jar installed as a service
yup. same here.
atm, listening on localhost, with Dovecot -> Tika direct, no proxy.
similarly fragile under load. throwing ~10 messages with .5-5MB attachments at it at once causes all sorts of complaints.
one at a time seems OK ...
Dovecot currently implements separate integrations, first the attachments are sent to tika, then the results are sent to solr.
ah, so tika first ...
The two could even be running on separate servers.
Not sure when that's a useful usecase. I can certainly see a separate, integrated solr+tika server.
ExtremelyhHeavy loads, I guess. Not sure when it would be useful, but that was just to underline the current integration model for Dovecot.
Yes that could be an alternative way, so instead of sending the attachments to tika, send the attachments to solr and let it send them to tika. It would be more than configuration in Dovecot though.
yup. taking a look at solr cell + tika integration to see where the config makes most sense.
this is a useful 1st read
https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using...
It's an approach that could be worthwhile looking into, though not using solr cell, given the following statements at that link:
"If any exceptions cause the |ExtractingRequestHandler| and/or Tika to crash, Solr as a whole will also crash because the request handler is running in the same JVM that Solr uses for other operations.
Indexing can also consume all available Solr resources, particularly with large PDFs, presentations, or other files that have a lot of rich media embedded in them.
For these reasons, Solr Cell is not recommended for use in a production system."
Yes, I think limits on Dovecot are useful in any case, otherwise you end up sending arbitrary sized files across the network to have them thrown away on the server.
point taken.
afaict, fts_solr has only a batch_size limit -- but neither a total message size, or an attachment size limit.
Yes, batch_size was an attempt to introduce some configurable limit. If attachments are being sent across it many not be sufficient.
John
On 11/15/20 1:29 PM, John Fawcett wrote:
atm, listening on localhost, with Dovecot -> Tika direct, no proxy.
similarly fragile under load. throwing ~10 messages with .5-5MB attachments at it at once causes all sorts of complaints.
frequently, like this
Nov 15 15:59:40 test.loc tika[35696]: INFO tika/ (message/rfc822) Nov 15 15:59:41 test.loc tika[35696]: WARN tika/: Text extraction failed (null) Nov 15 15:59:41 test.loc tika[35696]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:409) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:521) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1472) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1300) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1215) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.Server.handle(Server.java:500) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) Nov 15 15:59:41 test.loc tika[35696]: at java.base/java.lang.Thread.run(Thread.java:832) Nov 15 15:59:41 test.loc tika[35696]: ERROR Problem with writing the data, class org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain Nov 15 15:59:41 test.loc tika[35696]: INFO tika/ (message/rfc822) Nov 15 15:59:41 test.loc tika[35696]: WARN tika/: Text extraction failed (Tried to contact you | Quote #Q4889744.eml) Nov 15 15:59:41 test.loc tika[35696]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:409) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:521) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1472) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) Nov 15 15:59:41 test.loc tika[35696]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1300) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1215) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.Server.handle(Server.java:500) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806) Nov 15 15:59:41 test.loc tika[35696]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) Nov 15 15:59:41 test.loc tika[35696]: at java.base/java.lang.Thread.run(Thread.java:832) Nov 15 15:59:41 test.loc tika[35696]: ERROR Problem with writing the data, class org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain Nov 15 15:59:41 test.loc tika[35696]: INFO tika/ (image/jpeg) Nov 15 15:59:41 test.loc tika[35696]: INFO tika/ (image/png)
seems fts_tika isn't going to be a well-behaved black box.
pulling it out of dovecot usage for now, to setup a standalone instance and throw test attachments at it directly ...
On 16/11/2020 01:14, PGNet Dev wrote:
On 11/15/20 1:29 PM, John Fawcett wrote:
atm, listening on localhost, with Dovecot -> Tika direct, no proxy.
similarly fragile under load. throwing ~10 messages with .5-5MB attachments at it at once causes all sorts of complaints.
frequently, like this
<snip>
seems fts_tika isn't going to be a well-behaved black box.
pulling it out of dovecot usage for now, to setup a standalone instance and throw test attachments at it directly ...
I have to admit that despite all the warnings and errors in the Tika log, that was the part that gave me least difficulty. Though once Tika runs out of memory, I start to see 502s returned to Dovecot, this does not ultimately end up as blocking indexing on Dovecot since after restart the emails that were not indexed are resubmitted. Also I suppose that it can be resolved by adding more resources.
My main issue is the following example, which blocks indexing of the relevant folder. When reindexing a specific sent folder that had a 4.3MB zip attachment containing 132MB of files, Tika passed back 139MB of output to Dovecot which then sent 228MB of output to Solr. I got back a 502 error from the apache proxy for that and haven't worked out the reason. However these files contain nothing worth indexing. I'd be happy to skip indexing any attachment larger than say 1MB (in terms of the original file, or the output from Tika or the output to send to Solr).
John
participants (2)
-
John Fawcett
-
PGNet Dev