I think with the soon-to-be-next-release of Tika, you can turn off throwing zero-byte file exceptions via the config. The exceptions should be harmless and you can safely ignore them.
Just upgraded to tika 2.9.0. Testing, as below, same error thrown.
not certain of the correct config here :-/
added to
edit /etc/tika/tika-server-config-custom.xml ...
<parser class="org.apache.tika.parser.AutoDetectParserConfig">
<params>
<param name="ThrowOnZeroBytes" type="bool">false</param>
</params>
</parser> ...
Reading
https://downloads.apache.org/tika/2.9.0/CHANGES-2.9.0.txt
* Users may now avoid the ZeroByteFileException via asetting on the AutoDetectParserConfig (TIKA-3976).
restarted tika, re-testing, still end with
tika[32035]: Aug 28, 2023 7:13:15 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerProblem tika[32035]: SEVERE: Problem with writing the data, class org.apache.tika.server.core.resource.TikaResource$$Lambda$388/0x00007f4fb42aa2d0, ContentType: text/plain
so, atm, dovecot's still sending zero bytes, and tika's still unhappy about it
-------- Original Message -------- From: dovecot@dovecot.org Sent: at Friday, Aug 18, 2023, 16:06 PM EDT To: tallison@apache.org Cc: dovecot@dovecot.org Subject: Re: [bug] dovecot passes zero byte input stream when passing email with .eml attachment to apache tika parser, causes 'SEVERE' error
soon-to-be-next-release of Tika,
i saw that was coming
you can turn off throwing zero-byte file exceptions via the config
can you point to the config toggle, or docs, in https://github.com/apache/tika ?
The exceptions should be harmless and you can safely ignore them.
including the SEVERE notice?
For some users, they need to know that there's a zero-byte file, hence the default behavior. It can also be useful while doing parser development to find files where embedded files are zero-byte files. Sometimes things go wrong in the container parser.
iiuc, the exception's thrown WHEN input's a zero-byte file.
in this dovecot <-> tika case, that only occurs when the attachment sent is a .eml, not with any other attachment type (so far)
is current-release tika known/verified to handle .eml (iirc, there were some issues awhile ago ...) ? and not mistakenly munging the input size to zero?
if it's demonstrated OK, then it's likely Dovecot mistakenly sending no input in the .eml-attachment case, no?
dovecot mailing list -- dovecot@dovecot.org To unsubscribe send an email to dovecot-leave@dovecot.org