tika 2.4.1 'Text extraction failed' errors when dovecot+fts 2.3.19.1 passes embedded *.eml (message/rfc822) files ; org.apache.tika.parser.mail.RFC822Parser or dovecot ?
PGNet Dev
pgnet.dev at gmail.com
Sat Jul 30 23:51:17 UTC 2022
i'm running
dovecot 2.3.19.1 + fts
tika-server-standard 2.4.1
dovecot is feeding tika backend via fts_tika
when dovecot passes data with *.eml attachments embedded, tika fails to correctly parse/extract content
not clear if the issue is with tika, or what dovecot's passing in this case.
other non-.eml attachments are fine.
here's the current failing procedure,
(1)
create a simple pdf
enscript -p mime.ps /etc/mime.types
ps2pdf mime.ps mime.pdf
(2)
send an email *with* mime.pdf attachment to
echo "test" | mailx -s "test" -a ./mime.pdf testuser at example.com
tika processes OK
journalctl -f -u tika
...
Jul 30 19:09:24 mx-test tika[19682]: INFO [qtp2112135199-30] 19:09:24,165 org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
...
save the just-received email with .pdf attachment as mime.eml
(3)
send an email with NO .pdf attachment
save the just-received email with .pdf attachment as mime2.eml
(4)
send an email with mime.eml attachment, containing the embedded mime.pdf
echo "test" | mailx -s "test" -a ./mime.eml testuser at example.com
tika fails to extract message/rfc822
journalctl -f -u tika | grep -v StatusLogger
...
Jul 30 19:28:00 mx-test tika[20049]: INFO [qtp2112135199-30] 19:28:00,834 org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822)
Jul 30 19:28:00 mx-test tika[20049]: WARN [qtp2112135199-30] 19:28:00,840 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (mime.eml)
Jul 30 19:28:00 mx-test tika[20049]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:153) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:00 mx-test tika[20049]: at java.lang.Thread.run(Thread.java:833) ~[?:?]
Jul 30 19:28:00 mx-test tika[20049]: ERROR [qtp2112135199-30] 19:28:00,845 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$338/0x0000000800eb4a38, ContentType: text/plain
(5)
send an email with mime2.eml attachment, WITHOUT an embedded .pdf
echo "test" | mailx -s "test" -a ./mime.eml testuser at example.com
again, tika fails to extract message/rfc822
journalctl -f -u tika | grep -v StatusLogger
...
Jul 30 19:28:33 mx-test tika[20049]: INFO [qtp2112135199-30] 19:28:33,607 org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822)
Jul 30 19:28:33 mx-test tika[20049]: WARN [qtp2112135199-30] 19:28:33,616 org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed (mime2.eml)
Jul 30 19:28:33 mx-test tika[20049]: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:153) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 30 19:28:33 mx-test tika[20049]: at java.lang.Thread.run(Thread.java:833) ~[?:?]
Jul 30 19:28:33 mx-test tika[20049]: ERROR [qtp2112135199-30] 19:28:33,630 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$338/0x0000000800eb4a38, ContentType: text/plain
(6)
submit mime.eml directly to tika
curl -T ./mime.eml http://127.0.0.1:9998/tika
journalctl -f -u tika | grep -v StatusLogger
...
Jul 30 19:30:08 mx-test tika[20049]: INFO [qtp2112135199-34] 19:30:08,073 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
(7)
submit mime2.eml directly to tika
curl -T ./mime2.eml http://127.0.0.1:9998/tika
journalctl -f -u tika | grep -v StatusLogger
...
Jul 30 19:30:52 mx-test tika[20049]: INFO [qtp2112135199-30] 19:30:52,349 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
(8)
where,
cat mime.eml
Return-Path: <msmtp at pgnd.example.com>
Delivered-To: testuser at example.com
...
From: msmtp at pgnd.example.com
Date: Sat, 30 Jul 2022 18:53:38 -0400
To: testuser at example.com
Subject: test
User-Agent: Heirloom mailx 12.5 7/5/10
Content-Type: multipart/mixed;
boundary="=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+"
Message-Id: <4LwKS35QWSzWf3Q at mx-test.example.com>
This is a multi-part message in MIME format.
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
test
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+
Content-Type: application/pdf
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="mime.pdf"
JVBERi0xLjQKJcfsj6IKJSVJbnZvY2F0aW9uOiBwYXRoL2dzIC1QLSAtZFNBRkVSIC1kQ29t
...
Rgo=
--=_62e5b672.wAyBX+sGMbS7ZcNv8O/A1QeYuseaJ2NDRf8hfdbm/x8Vayp+--
and,
cat mime2.eml
Return-Path: <msmtp at pgnd.example.com>
Delivered-To: testuser at example.com
...
From: msmtp at pgnd.example.com
Date: Sat, 30 Jul 2022 19:14:59 -0400
To: testuser at example.com
Subject: test
User-Agent: Heirloom mailx 12.5 7/5/10
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <4LwKwh5brVzWf3Q at mx-test.example.com>
test
More information about the dovecot
mailing list