Polling Lock Issues in 20.0.1 (With Minions)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Polling Lock Issues in 20.0.1 (With Minions)

Joshua McAdam
Hi,

We upgraded yesterday from 20.0.0 to 20.0.1 to see if it resolved some issues we had with pollers hanging as mentioned here: https://issues.opennms.org/browse/NMS-9466 which was fixed in 20.0.1 according to Jira.

Upgrading fixed the Errors in output.log and poller logging that we had seen related to IllegalMonitorStateException related, however it has not fixed the problem overall, with nodes showing failed services that are actually available or stuck in a down state after the outage has cleared and we see constant stream of errors in the poller log like this in between the odd successful poll happening:-

2017-07-25 11:37:10,395 INFO  [Poller-Thread-234-of-500] o.o.n.p.p.PollableService: Postponing poll for PollableService[location=XXXXXX, interface=PollableInterface [PollableNode [65]:10.X.X.X], svcName=SSH]
org.opennms.netmgt.poller.pollables.LockUnavailable: Unable to obtain lock for PollableNode [65] within 500 milliseconds
        at org.opennms.netmgt.poller.pollables.PollableNode.obtainTreeLock(PollableNode.java:264) ~[opennms-services-20.0.1.jar:?]
        at org.opennms.netmgt.poller.pollables.PollableElement.obtainTreeLock(PollableElement.java:211) ~[opennms-services-20.0.1.jar:?]
        at org.opennms.netmgt.poller.pollables.PollableElement.withTreeLock(PollableElement.java:274) ~[opennms-services-20.0.1.jar:?]
        at org.opennms.netmgt.poller.pollables.PollableElement.withTreeLock(PollableElement.java:259) ~[opennms-services-20.0.1.jar:?]
        at org.opennms.netmgt.poller.pollables.PollableService.doRun(PollableService.java:404) [opennms-services-20.0.1.jar:?]
        at org.opennms.netmgt.poller.pollables.PollableService.run(PollableService.java:379) [opennms-services-20.0.1.jar:?]
        at org.opennms.netmgt.scheduler.Schedule.run(Schedule.java:142) [opennms-services-20.0.1.jar:?]
        at org.opennms.netmgt.scheduler.Schedule$ScheduleEntry.run(Schedule.java:86) [opennms-services-20.0.1.jar:?]
        at org.opennms.netmgt.scheduler.LegacyScheduler$1.run(LegacyScheduler.java:179) [opennms-services-20.0.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
        at org.opennms.core.concurrent.LogPreservingThreadFactory$3.run(LogPreservingThreadFactory.java:124) [opennms-util-20.0.1.jar:?]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]

In our environment we have 10 minions connected, with around 800 services monitored overall across around 150 nodes mostly accessed via the minions with average RTT to the minions from the central Opennms around 140ms.

We are polling only simple services like SNMP, ICMP, HTTP and HTTPS with minimal data collection only for Net-SNMP OIDs, so far disabling parts of the config to try and rule them out hasn't helped either.

This is running on an AWS CentOS 7 machine with 4 CPUs and 16GB RAM, a m3.xlarge instance for those familar with AWS - our heap size is set to 8GB currently.

Does anyone have any suggestions how to troubleshoot this further? We have tried changing the poller thread pool size up from our original at 200 to 500 and anything else we could think of.

After spending a day trying to get to the bottom of it I've not managed to make much progress. the only thing i have observed is i've been unable to find a lock error for a directly polled node and only errors for nodes polled via minion.

Thanks,

Josh M




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Polling Lock Issues in 20.0.1 (With Minions)

Jesse White-3
Hi Josh,

On 07/25/2017 07:55 AM, Joshua McAdam wrote:
> org.opennms.netmgt.poller.pollables.LockUnavailable: Unable to obtain lock for PollableNode [65] within 500 milliseconds

These messages may look scary but they are only informational and are not indicative of any problems:
   When executing a poll (a monitor) the poller daemon will lock the node, ensuring that only one poll is active against
a node
   at any given time. If additional polls are scheduled while the lock is held, you will see messages like the one you
shared, and the
   associated poll will be rescheduled at some time in the near future.

To troubleshoot your problem further, I would try executing the monitors manually via the Karaf shell and validating
that these give the expected results:
   $ ssh -p 8101 admin@localhost
   admin@opennms> poller:poll -l HQ org.opennms.netmgt.poller.monitors.SshMonitor 127.0.0.1 port=8201

These polls will be triggered irrespective of any locks held by the poller and will allow you to verify that all Minion
related plumbing is working properly.

-Jesse



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss