This looks like zero-bytes are getting passed to Tika via dovecot. I don't know enough about dovecot to figure out what's going on.
On Sat, Jul 30, 2022 at 7:51 PM PGNet Dev pgnet.dev@gmail.com wrote:
i'm running
dovecot 2.3.19.1 + fts tika-server-standard 2.4.1
dovecot is feeding tika backend via fts_tika
when dovecot passes data with *.eml attachments embedded, tika fails to correctly parse/extract content
not clear if the issue is with tika, or what dovecot's passing in this case.
other non-.eml attachments are fine.
here's the current failing procedure,
(1) create a simple pdf
enscript -p mime.ps /etc/mime.types ps2pdf mime.ps mime.pdf
(2) send an email *with* mime.pdf attachment to
echo "test" | mailx -s "test" -a ./mime.pdf testuser@example.com
tika processes OK
journalctl -f -u tika ... Jul 30 19:09:24 mx-test tika[19682]: INFO
[qtp2112135199-30] 19:09:24,165 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf) ...
save the just-received email with .pdf attachment as mime.eml
(3) send an email with NO .pdf attachment save the just-received email with .pdf attachment as mime2.eml
(4) send an email with mime.eml attachment, containing the embedded mime.pdf
echo "test" | mailx -s "test" -a ./mime.eml testuser@example.com
tika fails to extract message/rfc822
journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:28:00 mx-test tika[20049]: INFO
[qtp2112135199-30] 19:28:00,834 org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822) Jul 30 19:28:00 mx-test tika[20049]: WARN [qtp2112135199-30] 19:28:00,840 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (mime.eml) Jul 30 19:28:00 mx-test tika[20049]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:153) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at java.lang.Thread.run(Thread.java:833) ~[?:?] Jul 30 19:28:00 mx-test tika[20049]: ERROR [qtp2112135199-30] 19:28:00,845 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$338/0x0000000800eb4a38, ContentType: text/plain
(5) send an email with mime2.eml attachment, WITHOUT an embedded .pdf
echo "test" | mailx -s "test" -a ./mime.eml testuser@example.com
again, tika fails to extract message/rfc822
journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:28:33 mx-test tika[20049]: INFO
[qtp2112135199-30] 19:28:33,607 org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822) Jul 30 19:28:33 mx-test tika[20049]: WARN [qtp2112135199-30] 19:28:33,616 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (mime2.eml) Jul 30 19:28:33 mx-test tika[20049]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:153) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at java.lang.Thread.run(Thread.java:833) ~[?:?] Jul 30 19:28:33 mx-test tika[20049]: ERROR [qtp2112135199-30] 19:28:33,630 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$338/0x0000000800eb4a38, ContentType: text/plain
(6) submit mime.eml directly to tika
curl -T ./mime.eml http://127.0.0.1:9998/tika journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:30:08 mx-test tika[20049]: INFO
[qtp2112135199-34] 19:30:08,073 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
(7) submit mime2.eml directly to tika
curl -T ./mime2.eml http://127.0.0.1:9998/tika journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:30:52 mx-test tika[20049]: INFO
[qtp2112135199-30] 19:30:52,349 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
(8) where,
cat mime.eml Return-Path: <msmtp@pgnd.example.com> Delivered-To: testuser@example.com ... From: msmtp@pgnd.example.com Date: Sat, 30 Jul 2022 18:53:38 -0400 To: testuser@example.com Subject: test User-Agent: Heirloom mailx 12.5 7/5/10 Content-Type: multipart/mixed;
boundary="=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+" Message-Id: 4LwKS35QWSzWf3Q@mx-test.example.com
This is a multi-part message in MIME format.
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+ Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline
test
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+ Content-Type: application/pdf Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="mime.pdf"
JVBERi0xLjQKJcfsj6IKJSVJbnZvY2F0aW9uOiBwYXRoL2dzIC1QLSAtZFNBRkVSIC1kQ29t ... Rgo=
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+--
and,
cat mime2.eml Return-Path: <msmtp@pgnd.example.com> Delivered-To: testuser@example.com ... From: msmtp@pgnd.example.com Date: Sat, 30 Jul 2022 19:14:59 -0400 To: testuser@example.com Subject: test User-Agent: Heirloom mailx 12.5 7/5/10 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <4LwKwh5brVzWf3Q@mx-test.example.com> test