simple system availability monitor

Our research group has a Hadoop cluster that I frequently utilize and sometimes break. It’s a bit of a hassle via the Hadoop JobTracker to figure out which nodes (or TaskTrackers) from the cluster are missing, so I needed a way to quickly identify problematic nodes for reboots or further diagnosis.

There’s a lot of these network monitoring tools available: cacti, nagios, and ganglia are a few that I’ve used and they’re a a pain to install and maintain. I don’t need that level of granularity in monitoring: either a machine is up and I can SSH into it at the moment or something’s wrong.

I scripted a simple tool using bash and gnu parallel that will ping and SSH into systems and compile the exit code results into a nice text file for viewing via a browser or the command-line.

We’ll utilize network-level (ICMP) and application-level (SSH) probes for the machines we want to monitor; other application-level probes like HTTP are easily implemented, but most of the systems I monitor aren’t running an HTTP daemon.

It’s simple and inefficient.

If this has helped you or if you have any comments, send me mail: cardi@acm.org.

code

https://ant.isi.edu/git/~calvin/isitdown.git

example output

machine              ping   ssh
=======              ====   ===
localhost               0     0
host.does.not.exist     2   255

Thu Jan  1 00:00:00 UTC 1970