tika 2.4.1 'Text extraction failed' errors when dovecot+fts 2.3.19.1 passes embedded *.eml (message/rfc822) files ; org.apache.tika.parser.mail.RFC822Parser or dovecot ?
i'm running
dovecot 2.3.19.1 + fts
tika-server-standard 2.4.1
dovecot is feeding tika backend via fts_tika
when dovecot passes data with *.eml attachments embedded, tika fails to correctly parse/extract content
not clear if the issue is with tika, or what dovecot's passing in this case.
other non-.eml attachments are fine.
here's the current failing procedure,
(1) create a simple pdf
enscript -p mime.ps /etc/mime.types
ps2pdf mime.ps mime.pdf
(2) send an email *with* mime.pdf attachment to
echo "test" | mailx -s "test" -a ./mime.pdf testuser@example.com
tika processes OK
journalctl -f -u tika
...
Jul 30 19:09:24 mx-test tika[19682]: INFO [qtp2112135199-30] 19:09:24,165 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
...
save the just-received email with .pdf attachment as mime.eml
(3) send an email with NO .pdf attachment save the just-received email with .pdf attachment as mime2.eml
(4) send an email with mime.eml attachment, containing the embedded mime.pdf
echo "test" | mailx -s "test" -a ./mime.eml testuser@example.com
tika fails to extract message/rfc822
journalctl -f -u tika | grep -v StatusLogger
...
Jul 30 19:28:00 mx-test tika[20049]: INFO [qtp2112135199-30] 19:28:00,834 org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822)
Jul 30 19:28:00 mx-test tika[20049]: WARN [qtp2112135199-30] 19:28:00,840 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (mime.eml)
Jul 30 19:28:00 mx-test tika[20049]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:153) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at java.lang.Thread.run(Thread.java:833) ~[?:?]
Jul 30 19:28:00 mx-test tika[20049]: ERROR [qtp2112135199-30] 19:28:00,845 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$338/0x0000000800eb4a38, ContentType: text/plain
(5) send an email with mime2.eml attachment, WITHOUT an embedded .pdf
echo "test" | mailx -s "test" -a ./mime.eml testuser@example.com
again, tika fails to extract message/rfc822
journalctl -f -u tika | grep -v StatusLogger
...
Jul 30 19:28:33 mx-test tika[20049]: INFO [qtp2112135199-30] 19:28:33,607 org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822)
Jul 30 19:28:33 mx-test tika[20049]: WARN [qtp2112135199-30] 19:28:33,616 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (mime2.eml)
Jul 30 19:28:33 mx-test tika[20049]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:153) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at java.lang.Thread.run(Thread.java:833) ~[?:?]
Jul 30 19:28:33 mx-test tika[20049]: ERROR [qtp2112135199-30] 19:28:33,630 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$338/0x0000000800eb4a38, ContentType: text/plain
(6) submit mime.eml directly to tika
curl -T ./mime.eml http://127.0.0.1:9998/tika
journalctl -f -u tika | grep -v StatusLogger
...
Jul 30 19:30:08 mx-test tika[20049]: INFO [qtp2112135199-34] 19:30:08,073 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
(7) submit mime2.eml directly to tika
curl -T ./mime2.eml http://127.0.0.1:9998/tika
journalctl -f -u tika | grep -v StatusLogger
...
Jul 30 19:30:52 mx-test tika[20049]: INFO [qtp2112135199-30] 19:30:52,349 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
(8) where,
cat mime.eml
Return-Path: <msmtp@pgnd.example.com>
Delivered-To: testuser@example.com
...
From: msmtp@pgnd.example.com
Date: Sat, 30 Jul 2022 18:53:38 -0400
To: testuser@example.com
Subject: test
User-Agent: Heirloom mailx 12.5 7/5/10
Content-Type: multipart/mixed;
boundary="=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+"
Message-Id: <4LwKS35QWSzWf3Q@mx-test.example.com>
This is a multi-part message in MIME format.
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
test
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+
Content-Type: application/pdf
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="mime.pdf"
JVBERi0xLjQKJcfsj6IKJSVJbnZvY2F0aW9uOiBwYXRoL2dzIC1QLSAtZFNBRkVSIC1kQ29t
...
Rgo=
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+--
and,
cat mime2.eml
Return-Path: <msmtp@pgnd.example.com>
Delivered-To: testuser@example.com
...
From: msmtp@pgnd.example.com
Date: Sat, 30 Jul 2022 19:14:59 -0400
To: testuser@example.com
Subject: test
User-Agent: Heirloom mailx 12.5 7/5/10
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <4LwKwh5brVzWf3Q@mx-test.example.com>
test
This looks like zero-bytes are getting passed to Tika via dovecot. I don't know enough about dovecot to figure out what's going on.
On Sat, Jul 30, 2022 at 7:51 PM PGNet Dev pgnet.dev@gmail.com wrote:
i'm running
dovecot 2.3.19.1 + fts tika-server-standard 2.4.1
dovecot is feeding tika backend via fts_tika
when dovecot passes data with *.eml attachments embedded, tika fails to correctly parse/extract content
not clear if the issue is with tika, or what dovecot's passing in this case.
other non-.eml attachments are fine.
here's the current failing procedure,
(1) create a simple pdf
enscript -p mime.ps /etc/mime.types ps2pdf mime.ps mime.pdf
(2) send an email *with* mime.pdf attachment to
echo "test" | mailx -s "test" -a ./mime.pdf testuser@example.com
tika processes OK
journalctl -f -u tika ... Jul 30 19:09:24 mx-test tika[19682]: INFO
[qtp2112135199-30] 19:09:24,165 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf) ...
save the just-received email with .pdf attachment as mime.eml
(3) send an email with NO .pdf attachment save the just-received email with .pdf attachment as mime2.eml
(4) send an email with mime.eml attachment, containing the embedded mime.pdf
echo "test" | mailx -s "test" -a ./mime.eml testuser@example.com
tika fails to extract message/rfc822
journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:28:00 mx-test tika[20049]: INFO
[qtp2112135199-30] 19:28:00,834 org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822) Jul 30 19:28:00 mx-test tika[20049]: WARN [qtp2112135199-30] 19:28:00,840 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (mime.eml) Jul 30 19:28:00 mx-test tika[20049]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:153) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at java.lang.Thread.run(Thread.java:833) ~[?:?] Jul 30 19:28:00 mx-test tika[20049]: ERROR [qtp2112135199-30] 19:28:00,845 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$338/0x0000000800eb4a38, ContentType: text/plain
(5) send an email with mime2.eml attachment, WITHOUT an embedded .pdf
echo "test" | mailx -s "test" -a ./mime.eml testuser@example.com
again, tika fails to extract message/rfc822
journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:28:33 mx-test tika[20049]: INFO
[qtp2112135199-30] 19:28:33,607 org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822) Jul 30 19:28:33 mx-test tika[20049]: WARN [qtp2112135199-30] 19:28:33,616 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (mime2.eml) Jul 30 19:28:33 mx-test tika[20049]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:153) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at java.lang.Thread.run(Thread.java:833) ~[?:?] Jul 30 19:28:33 mx-test tika[20049]: ERROR [qtp2112135199-30] 19:28:33,630 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$338/0x0000000800eb4a38, ContentType: text/plain
(6) submit mime.eml directly to tika
curl -T ./mime.eml http://127.0.0.1:9998/tika journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:30:08 mx-test tika[20049]: INFO
[qtp2112135199-34] 19:30:08,073 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
(7) submit mime2.eml directly to tika
curl -T ./mime2.eml http://127.0.0.1:9998/tika journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:30:52 mx-test tika[20049]: INFO
[qtp2112135199-30] 19:30:52,349 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
(8) where,
cat mime.eml Return-Path: <msmtp@pgnd.example.com> Delivered-To: testuser@example.com ... From: msmtp@pgnd.example.com Date: Sat, 30 Jul 2022 18:53:38 -0400 To: testuser@example.com Subject: test User-Agent: Heirloom mailx 12.5 7/5/10 Content-Type: multipart/mixed;
boundary="=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+" Message-Id: 4LwKS35QWSzWf3Q@mx-test.example.com
This is a multi-part message in MIME format.
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+ Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline
test
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+ Content-Type: application/pdf Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="mime.pdf"
JVBERi0xLjQKJcfsj6IKJSVJbnZvY2F0aW9uOiBwYXRoL2dzIC1QLSAtZFNBRkVSIC1kQ29t ... Rgo=
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+--
and,
cat mime2.eml Return-Path: <msmtp@pgnd.example.com> Delivered-To: testuser@example.com ... From: msmtp@pgnd.example.com Date: Sat, 30 Jul 2022 19:14:59 -0400 To: testuser@example.com Subject: test User-Agent: Heirloom mailx 12.5 7/5/10 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <4LwKwh5brVzWf3Q@mx-test.example.com> test
On 8/1/22 9:35 AM, Tim Allison wrote:
This looks like zero-bytes are getting passed to Tika via dovecot. I don't know enough about dovecot to figure out what's going on.
ok. let's see what response from Dovecot ML.
atm, it's only in the case of submissions with attached/embedded *.eml ...
On Sat, Jul 30, 2022 at 7:51 PM PGNet Dev
mailto:pgnet.dev@gmail.com> wrote: i'm running dovecot 2.3.19.1 + fts tika-server-standard 2.4.1 dovecot is feeding tika backend via fts_tika when dovecot passes data with *.eml attachments embedded, tika fails to correctly parse/extract content not clear if the issue is with tika, or what dovecot's passing in this case. other non-.eml attachments are fine. here's the current failing procedure, (1) create a simple pdf enscript -p mime.ps <http://mime.ps> /etc/mime.types ps2pdf mime.ps <http://mime.ps> mime.pdf (2) send an email *with* mime.pdf attachment to echo "test" | mailx -s "test" -a ./mime.pdf testuser@example.com <mailto:testuser@example.com> tika processes OK journalctl -f -u tika ... Jul 30 19:09:24 mx-test tika[19682]: INFO [qtp2112135199-30] 19:09:24,165 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf) ... save the just-received email with .pdf attachment as mime.eml (3) send an email with NO .pdf attachment save the just-received email with .pdf attachment as mime2.eml (4) send an email with mime.eml attachment, containing the embedded mime.pdf echo "test" | mailx -s "test" -a ./mime.eml testuser@example.com <mailto:testuser@example.com> tika fails to extract message/rfc822 journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:28:00 mx-test tika[20049]: INFO [qtp2112135199-30] 19:28:00,834 org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822) Jul 30 19:28:00 mx-test tika[20049]: WARN [qtp2112135199-30] 19:28:00,840 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (mime.eml) Jul 30 19:28:00 mx-test tika[20049]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:153) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io <http://org.eclipse.jetty.io>.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io <http://org.eclipse.jetty.io>.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io <http://org.eclipse.jetty.io>.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:00 mx-test tika[20049]: at java.lang.Thread.run(Thread.java:833) ~[?:?] Jul 30 19:28:00 mx-test tika[20049]: ERROR [qtp2112135199-30] 19:28:00,845 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$338/0x0000000800eb4a38, ContentType: text/plain (5) send an email with mime2.eml attachment, WITHOUT an embedded .pdf echo "test" | mailx -s "test" -a ./mime.eml testuser@example.com <mailto:testuser@example.com> again, tika fails to extract message/rfc822 journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:28:33 mx-test tika[20049]: INFO [qtp2112135199-30] 19:28:33,607 org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822) Jul 30 19:28:33 mx-test tika[20049]: WARN [qtp2112135199-30] 19:28:33,616 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (mime2.eml) Jul 30 19:28:33 mx-test tika[20049]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:153) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io <http://org.eclipse.jetty.io>.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io <http://org.eclipse.jetty.io>.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io <http://org.eclipse.jetty.io>.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1] Jul 30 19:28:33 mx-test tika[20049]: at java.lang.Thread.run(Thread.java:833) ~[?:?] Jul 30 19:28:33 mx-test tika[20049]: ERROR [qtp2112135199-30] 19:28:33,630 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$338/0x0000000800eb4a38, ContentType: text/plain (6) submit mime.eml directly to tika curl -T ./mime.eml http://127.0.0.1:9998/tika <http://127.0.0.1:9998/tika> journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:30:08 mx-test tika[20049]: INFO [qtp2112135199-34] 19:30:08,073 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type) (7) submit mime2.eml directly to tika curl -T ./mime2.eml http://127.0.0.1:9998/tika <http://127.0.0.1:9998/tika> journalctl -f -u tika | grep -v StatusLogger ... Jul 30 19:30:52 mx-test tika[20049]: INFO [qtp2112135199-30] 19:30:52,349 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type) (8) where, cat mime.eml Return-Path: <msmtp@pgnd.example.com <mailto:msmtp@pgnd.example.com>> Delivered-To: testuser@example.com <mailto:testuser@example.com> ... From: msmtp@pgnd.example.com <mailto:msmtp@pgnd.example.com> Date: Sat, 30 Jul 2022 18:53:38 -0400 To: testuser@example.com <mailto:testuser@example.com> Subject: test User-Agent: Heirloom mailx 12.5 7/5/10 Content-Type: multipart/mixed; boundary="=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+" Message-Id: <4LwKS35QWSzWf3Q@mx-test.example.com <mailto:4LwKS35QWSzWf3Q@mx-test.example.com>> This is a multi-part message in MIME format. --=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+ Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline test --=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+ Content-Type: application/pdf Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="mime.pdf" JVBERi0xLjQKJcfsj6IKJSVJbnZvY2F0aW9uOiBwYXRoL2dzIC1QLSAtZFNBRkVSIC1kQ29t ... Rgo= --=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+-- and, cat mime2.eml Return-Path: <msmtp@pgnd.example.com <mailto:msmtp@pgnd.example.com>> Delivered-To: testuser@example.com <mailto:testuser@example.com> ... From: msmtp@pgnd.example.com <mailto:msmtp@pgnd.example.com> Date: Sat, 30 Jul 2022 19:14:59 -0400 To: testuser@example.com <mailto:testuser@example.com> Subject: test User-Agent: Heirloom mailx 12.5 7/5/10 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <4LwKwh5brVzWf3Q@mx-test.example.com <mailto:4LwKwh5brVzWf3Q@mx-test.example.com>> test
participants (2)
-
PGNet Dev
-
Tim Allison