Our research group has a Hadoop cluster that I frequently utilize and sometimes break. It’s a bit of a hassle via the Hadoop JobTracker to figure out which nodes (or TaskTrackers) from the cluster are missing, so I needed a way to quickly identify problematic nodes for reboots or further diagnosis.
There’s a lot of these network monitoring tools available: cacti, nagios, and ganglia are a few that I’ve used and they’re a a pain to install and maintain. I don’t need that level of granularity in monitoring: either a machine is up and I can SSH into it at the moment or something’s wrong.
I scripted a simple tool using bash
and gnu parallel
that will ping
and SSH into systems and compile the exit code results into a nice text
file for viewing via a browser or the command-line.
We’ll utilize network-level (ICMP) and application-level (SSH) probes for the machines we want to monitor; other application-level probes like HTTP are easily implemented, but most of the systems I monitor aren’t running an HTTP daemon.
It’s simple and inefficient.
If this has helped you or if you have any comments, send me mail: calvin@isi.edu.
code
https://ant.isi.edu/git/~calvin/isitdown.git
example output
machine ping ssh
======= ==== ===
localhost 0 0
host.does.not.exist 2 255
Thu Jan 1 00:00:00 UTC 1970