On 18 Sep 2018, at 13.29, Simone Lazzaris s.lazzaris@interactive.eu wrote:
Hi all, again;
I've enabled the core dumps and let it go for some day waiting for the issue to reoccur.
Meantime I've also upgraded the poolmon script, as Sami suggested.
It seems that the upgrade has scared the issue away, because it no longer occurred.
Maybe the problem is related to the way the old poolmon talked to the director daemon? I'm not very inclined to downgrade poolmon to catch a traceback, but can do if neccessary.
Well, maybe it's not necessary ;) I've performed some maintenance operations on the backends and that triggered the crash. It seems that something goes wrong where one backend come back online.
It's weird how easily you can reproduce the crash. I've ran all kinds of (stress) tests and I can't reproduce this crash. I was able to reproduce the original hang though.
Unfortunately, the core was not dumped.... And I don't know what to do: the director service was not chrooted, and ulimit -c is unlimited.
Do you have: sysctl -w fs.suid_dumpable=2