MoSShE: Monitoring Simple SHell Environment Copyright (C) 2003-2020 Volker Tanger 10/2020 This software is being retired and replaced by http://www.wyae.de/software/moshel/ This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. http://www.gnu.org/copyleft/gpl.html HISTORY The general idea was taken from plugins I wrote for the Nagios (formerly NetSaint) network/host monitoring system (NMS), especially the various versions of ASLrules. As most of the servers/services I want to monitor are remote systems, traditional NMS (relying on close-looped and/or unencrypted sessions) are either big, complicated to install for safe remote monitoring, ressource intense (when doing remote checks), lack a status history or a combination thereof. Thus I wrote this small, easily configured system. It primarily was intended for monitoring of a handful of typical internet systems. With the more recent versions system and grouping features monitoring of serious numbers of systems should be easily possible. For bug reports and suggestions or if you just want to talk to me please contact me at volker.tanger@wyae.de Updates will be available at http://www.wyae.de/software/mosshe/ please check there for updates prior to submitting patches! ------------------------------------------------------------------------ BUG - MAYBE ARCHITECTURAL/UNFIXABLE processing the entries is *S*L*O*W* checking 10 servers with total 270 checks takes 4.5 minutes on a RaspberryPi-B - of these 1.5 minutes are just for "graphing" (data sorting) ==> BASH probably is unsuited for monitoring more than a tiny handful of systems?? ------------------------------------------------------------------------ todo: - BUG (to fix): (re)check on Busybox-compatibility - BUG (to fix): ImportServerInfo does not overwrite old status - BUG (to fix): whenever a server(-import) fails, all prior sub-checks vanish with it (stats are kept, so are okay if resumed later, though) - want: SNMP query (general) - want: SNMP especially for Windows servers - want: localchecks, disk usage of single directory - want: localchecks, disk usage of subdirectories ("quota") - want: localchecks, number of users (via "w" command) - want: alerting: send alerts on ALERT only (escalation) - want: webviews: SLA reports, downtimes/availability v20.9.25 Volker Tanger * bugfix: NetworkLinuxTraffic yielded utter garbage due to changed output format of /proc/net/dev - corrected v20.9.9 Volker Tanger * bugfix: journalctl check in dovecot+postfix with typo * bugfix: journalctl check for postfix without -u as no working anymore v20.8.21 Volker Tanger * bugfix: limit WGET retries v20.5.18 Volker Tanger * bugfix: propoer check for journalct in postfix & dovecot v20.2.25 Volker Tanger * bugfix: remove leading 0's in HDhardwareSmartRaw v20.2.1 !!! changed HDhardwareSmart into two functions! Volker Tanger * bugfix/change: instead of HDhardwareSmart please use - HDhardwareSmartRaw to compare the RAW value - HDhardwareSmartValue to compare the translated value * deprecated: HDhardwareSmart v19.2.13 Volker Tanger * bugfix: only import successful http requests * bugfix: filtering Import* data before injection * cleanup: removed ImportServerInfo * cleanup: removed GNUplot code * simplification/speedup: ".avg" now is the full log, no longer a calculated average (the JS display can zoom in now). The "normal" view only are the currentmost ~1000 events, trimmed daily. * deprecated: PlotAvgDataFiles * deprecated: ServerInfo v19.01.05 Volker Tanger * bugfix: escape HTML tags from HTTPchecks v18.12.18 Volker Tanger * bugfix/change: default detection for established connections in NetworkConnections now is with "ss" - for the old one using "netstat" please use "NetworkConnectionNetstat" v18.10.21 Volker Tanger * bugfix: TCPing suppress output * bugfix: option syntax error in httpcontentmatch v18.10.7 Volker Tanger * bugfix: ImportAgentCurl: timeout parameter without = sign! v18.7.29 Volker Tanger * bugfix: HTTPcontentmatch now with timeout (30sec) v18.1.1 Volker Tanger * bugfix: apcaccess tool availability copy/paste error corrected v17.6.12 Volker Tanger * bugfix: apcaccess no longer defaults to "-h 127.0.0.1" added accordingly v17.3.27 Volker Tanger * bugfix: check for smartctl now to /usr/sbin/smartctl instead of which v16.10.18 Volker Tanger * bugfix: smartctl now a bit faster (-A instead of -a) v16.6.12 Jay Goldberg http://www.avianblue.net * bugfix/improvement: function declaration with improved compatibility * bugfix/improvement: for-loop with improved compatibility Volker Tanger * thanks Jay for all the issues and bugfixes! * bugfix: moved "mail" check to mail alerts (where it belongs) closing github#6 * bugfix: chmod result files to 644 so non-root can read the files when accessing them via webserver - closing github#4 * bufgix: renamed template* so it won't be deleted by cleanup script closing github#9 v16.4.24 Volker Tanger * bugfix to 16.2.4 : (crude) network rollover fix missing most cases v16.4.20 Volker Tanger * refactor: move WWW files to according directory * bugfix: replace space in sensor description with underscore so file access works (web interface, averages) v16.2.4 Volker Tanger * bugfix: (crude) fix against network traffic counter rollover v16.2.2 Volker Tanger * bugfix: dovecot+postfix checks were missing "OK" status to be set resulting in wrong status (but correct message) - fixed v16.1.20 Volker Tanger * bugfix: typo in README.md + wrong description in DovecotLoginFailed v16.1.18 Volker Tanger * bugfix: corrected TLS counting - a space on the wrong place... v16.1.17 Volker Tanger * feature: postfix and dovecot checks v15.7.20 Volker Tanger * feature: checking locally monitored APC UPS ApcUpsValueTooHigh ApcUpsValueTooLow ApcUpsStatus v15.7.20 !!! FedoraUpdatesAvailable split into pre-21 and post-22 Volker Tanger * feature: FeduraYumUpdatesAvailable - for Fedora 21 and prior FeduraDnfUpdatesAvailable - fpr Fedora 22 and later v15.6.11 Volker Tanger * bugfix: ArchlinuxUpdatesAvailable uses different tool: changed from pacupdatescheck to checkupdates in mosshe_hourly * bugfix: navigated around the annoying $WWWDIR/srv_*.txt.new not found bug when there is no server info. v15.4.25 Volker Tanger * feaure: added ArchlinuxUpdatesAvailable check (similar to DebianUpdatesCheck) v15.4.23 Volker Tanger * feaure: added FedoraUpdatesAvailable check (similar to DebianUpdatesCheck) v15.4.14 Volker Tanger !!! changed network checks naming convention * feaure/bugfix: network checks now report on server FROM which the check is done as well as TO which server, so monitoring can cope with distributed monitoring of the same servie * bugfix: copy/paste error in HDCheckGB corrected v14.12.04 Volker Tanger * feaure: added MemCheckLinux * feaure/bugfix: added HDfreeGB + HDfreeMB - using mount points instead of string matches v14.10.31 Volker Tanger * bugfix: DebianUpdatesAvailable hourly job misnamed corrected * feature: HDhardwareSmart alerts on excessive hardware fails v14.9.24 Volker Tanger * bugfix: DebianUpdatesAvailable now works - needs hourly cron job (included, must be enabled) v14.9.19 Volker Tanger * feature: DebianUpdatesAvailable ckecks whether your Debian installation is up to date * feature: Import* now log OK/ALERT - not only when errors * bugfix: ProcCheck had wrong process name v14.6.24 Volker Tanger * bugfix: ImportAgents now alert when slave unavailabe * bugfix: undefined states are now correctly counted * expire: LoadHekto (renamed to LoadCheckPercent) * feature/bugfix: import agents now respect NETWAIT value instead of fixed 30 seconds v14.5.11 Volker Tanger * feature/bugfix: it's LoadCheckPercent, not LoadHekto. Stupid me, wrong prefix (4 orders of magnitude off). * bugfix: skipped compare if old files are missing (e.g. because of restart) * bugfix: info generation wrote to wrong filename (stupid copy+paste) v14.2.20 Volker Tanger * feature/bugfix: added NetworkLinuxBandwidth checking against /proc/net/dev * feature/bugfix: added NetworkTrafficCheck checking against /proc/net/dev * bugfix: ServerInfo() did not delete old data, leading to increasing transfer and disk exhaustion (in the long rung) v14.2.3 Volker Tanger * feature: added monthly NetworkBandwidth check: GB/month * feature: added HDCheckGB - like HDCheck but in GB * feature: added HDparmState - check whether a disc is spun down * feature: added ImportServerInfo * bugfix: corrected server info placement (not just the central host) v14.1.23 Volker Tanger * new module: functions.virtual * moved VServer-checks into that module * feature: added Virtuzzo/OpenVZ VZbeancounter checks * feature: added FileLines check v13.5.28 !!! HardwareTemp and HardwareFan deprecated now, use HardwareSensors !!! *Show functions are deprecated, use ServerInfo instead Volker Tanger - feature: server-specific info now simpler & more flexible with ServerInfo which extends the server-specific page - feature: added current network connections check NetworkConnections - feature: added HardwareSensorBetween - feature: added MySQL checks (Threads, Queries) v13.5.14 !!! changed some server-relations !!! (i.e. which server the tested function is listed under) :-) no longer uses (nor needs) GNUplot Volker Tanger - bugfix: missing SERVER headline in group entry corrected - bugfix: CheckVserverUp/Down now listed under vserver name instead of host server name - bugfix: ReapPassive now listed under reaped name instead of host server name - bugfix: changed HTML section identification to be independent of time - feature: added VSERVER load check - feature: graph generation on client side using HTML5-canvas (also fixes broken AVG graph generation) (also is less CPU-intense on the server) v12.9.20 !!! change in template to automaticall configure %WWWDIR% Volker Tanger - bugfix: copy-paste error in MailQueue check - bugfix: wrong CUT parameter in ReapPassiveChecks - bugfix: more stable memory check - bugfix: HTTPcontentmatch with WGET instead of NC more stable when encountering web applications - feature: template automatically configures %WWWDIR% - feature: added VServer-related checks v11.6.27 !!! HardwareTemp and HardwareFan will be deprecated !!! in one of the next versions - use HardwareSensor instead !!! CreateDataFiles was removed as unused Volker Tanger - bugfix SAMBAcheck (removed CRON chatter, corrected server name) - improved lock message & internal logging to find lockups - fixed typo, extended example mosshe script - fixed typo in ImportAgents - added generic hardware sensor check - added IPv6 pings: Ping6Partner, Ping6Loss, Ping6Time - added average graphing/plotting - added HardwareSensor v11.5.10 David Soergel - bug report: forgotten gnuplot template Volker Tanger - added automatic reload / refresh (5 minutes) to template - added SwapCheck - page swaps per second - added ImportAgentWget - corrected/added NC parameters in functions.network - bugfix SSHcheck - bugfix HardwareTemp - bugfix ImportAgent (linewrap) - bugfix SyslogOnChange (missed group change) - bugfix table header in template v11.2.23 Volker Tanger - added function CreateDataFiles, creating basis for RRDB-alike data - added function PlotDataFiles, which creates & plots RRDB/MRTG-alike data & graphs using gnuplot v10.12.19 Volker Tanger - bugfix: removed unneccessary ping from ReapPassiveChecks - added FileTooOld check - added FileTooBig check v10.12.16 Christoph R. - found typo in SSHcheck function definition - found typo in mosshe main script Volker Tanger - removed CentralizeLog due to security concerns - added ReapPassiveChecks for incoming/passive monitoring (experimental, please help debugging) v10.12.6 Volker Tanger - added ARPing, idea by Thomas Bullinger http://consult.btoy1.net - small bugfix in server listing (when same server is in differnt groups) - added a quick-setup-script "create_mosshe.sh" which scans the network given and creates a basic monitoring - added SSH check - added TCPing for generic TCP services v10.12.5 !!! this version changes the log format !!! !!! - added GROUP column !!! - added INFO status Stefan Adams - fixed HDCheck - Showed 0MB free when partition specified not available (Ubuntu 10.04) - added INFO status in addition to OK, WARN, ALERT, UNDEF. This allows simply reporting info without considering it a check - add error checking against "mail" not available which resulted in odd error messages. Now will terminate with reasons. - added CentralizeLog - Not always possible for a central server to pull data, this will push data - changed use of lynx in ImportAgent to curl. Seems more commonly available. - added functions.localshows - functions dedicated to simply showing info as opposed to checking. Needn't be its own file, just done for organizational purposes. - modified template_head.html to include js and css for collapsable detail -- well suited for INFO showing - fixed LogTo* functions to mkdir -p `dirname $LOGFILENAME` to ensure log gets written - changed default of MYNAME to $(/bin/hostname) - added MYDOM $(/bin/hostname -d) and MYGROUP (obtained from DNS TXT) - added Group feature and Group sorting Volker Tanger - changed a few functions and file names for better compatibility - changed html indentation for better readability - added cron job for log rotation if not done via logrotate or similar system tools v10.12.1 Volker Tanger - added PingLoss & PingTime funcktions for more detailled checks - added SmartMonHealth harddisk monitoring - added Raid3ware hardware-RAID check - added LogToDaily, LogToWeekly, LogToMonthly v10.9.14 Volker Tanger - added "HardwareTemp" for checking temperatures in addition to the old mbmon-checks - added "HardwareFan" for checking fan speed in addition to the old mbmon-checks v10.6.22 Volker Tanger - nicer GUI: server-list & per-server listing (optional) - added output file suitable for passive checks in Nagios. This way MoSShE can be used as safe alternative for NRPE. v10.5.18 Volker Tanger - added SyslogOnChange to write alerts to syslog v10.3.24 Volker Tanger - added/changed load function: LoadCheck now checks 5-minutes average (instead of 15min window) LoadCheckFast checks 1-minute average LoadCheckSlow checks 15-minutes average - corrected POP3check which did not work at all since upgrade from version 1.4.7 (dumb variable renaming missed) v10.2.24 Volker Tanger - functions.network: redirected error messages by LYNX to /dev/zero to avoid error messages from CRON calls - corrected bug in self-locking (never reached timeout) - deleted temp files from SLA checks - added IDS functions: CheckFileChanges CheckConfigChanges plus helper-script GENERATE-COMPARES.SH v10.2.22 Volker Tanger - added web interface URL to alert mails - added function to selectively alert admins about their system - added full logging functions - added agglomeration function to allow central server(s), to create one (or more) central servers polling information from agents v10.2.18 Volker Tanger - more of bugfixes, especially in HTML output and SLA checks - self-locking to prevent simultaneous checks, including watchdog-restart with admin alerting v10.2.10 Volker Tanger - lots of bugfixes - generating summary SLA checks v9.11.28 Volker Tanger - complete rewrite - migrating from SSH-on-demand to local-Shell-via-cron - generation of local HTML status files ======================================================================== December 2008 v1.4.6 to v1.4.7 * Volker Tanger - removed bug: restart now clears status files ------------------------------------------------------------------------ November 2007 v1.4.5 to v1.4.6 * Volker Tanger - added check for local files (e.g. Unix sockets) - added check for running daemons/processes ------------------------------------------------------------------------ November 2007 v1.4.4 to v1.4.5 * Volker Tanger - HDD size check off by one, resulting in wrong numbers ------------------------------------------------------------------------ November 2007 v1.4.3 to v1.4.4 * Volker Tanger - corrected error messages due to missing date when programs not installed - corrected server identification which did misfire when the first part of a server matched a different one - added http_time check which returns reply time in milliseconds - cleanup of stale (read: all) status files on startup ------------------------------------------------------------------------ August 2007 v1.4.2 to v1.4.3 * Volker Tanger - console errors from mosshe_ssh are redirected into /dev/null ------------------------------------------------------------------------ May 2007 v1.4.0 to v1.4.2 * Volker Tanger - added mosshe_checkrun (no file changes, for debugging) - added SMB disk free check (sambafree) - corrected filtering RegEx in mosshe.singlerun ------------------------------------------------------------------------ January 2007 v1.3.11 to v1.4.0 * Volker Tanger - changed name of mosshe checks (mosshe_ssh instead of ssh) - added MOSSHE_HTTP check for unencrypted HTTP-based check - added Logs/statechange_YEAR.log for easier up-/downtime calculation ------------------------------------------------------------------------ February 2007 v1.4.0 to v1.4.1 * Volker Tanger - corrected (finally!) memory check in localcheck.functions ------------------------------------------------------------------------ April 2006 v1.3.10 to v1.3.11 * Volker Tanger - added "noping_" feature - added RETRYSLEEP variable - removed bug which left stale status information in server stat file ------------------------------------------------------------------------ December 2005 v1.3.9 to v1.3.10 * Volker Tanger - Changed FTP check to function with more servers (better/more robust return code separation) - cosmetic fixes in web interface - updated documents (thanks Eduardo Grosclaude ) - added "findbuggyline.py" to the Logs directory ------------------------------------------------------------------------ October 2005 v1.3.8 to v1.3.9 * Volker Tanger - Changed naming logic of log check a bit to allow multiple checks on one single log file, which was impossible before. ------------------------------------------------------------------------ October 2005 v1.3.7 to v1.3.8 * Volker Tanger - Added log file check (number of occorrencies) to SSH/localcheck.functions ------------------------------------------------------------------------ June 2005 v1.3.6 to v1.3.7 * Volker Tanger - corrected bug in DNS check * Ronny Henke - updated SNMP check for enchanced compatibility ------------------------------------------------------------------------ June 2005 v1.3.5 to v1.3.6 * Volker Tanger - added MAILQ_CHECK to control the length of the mail queue to localcheck.functions - main MoSSHe script: safer removement of stale stat files - logs completed checks only to stat files - no longer incomplete stats - showall.PY with added date of last check ------------------------------------------------------------------------ May 2005 v1.3.4 to v1.3.5 * Volker Tanger - loosened filtering RegEx to allow pathnames in disk check - changed disk size base for localcheck.embedded to KILObytes instead of MEGAbytes. - set 30 seconds timeout for HTTP/HTTPS checks ------------------------------------------------------------------------ May 2005 v1.3.3 to v1.3.4 * Volker Tanger !!! changed network traffic from byte/sec to the more usued kbit/s - added FTP service check - added HTTPS service check - added localcheck.embedded_functions for appliances using busybox shell/command replacement - added template for known appliances to download: Innominate mGuard - nicened outputs from LOCALCHECK ------------------------------------------------------------------------ May 2005 v1.3.2 to v1.3.3 * Volker Tanger - added network checks (usage, errors) - added template for known appliances for download: Astaro - webscripts ignore empty lines in servers.conf ------------------------------------------------------------------------ March 2005 v1.3.1 to v1.3.2 * Volker Tanger - corrected overzealous check filter in mosshe.singlerun ------------------------------------------------------------------------ January 2005 v1.3.0 to v1.3.1 * Eduardo Grosclaude - corrected Volker's bugs in RaidCheck() ------------------------------------------------------------------------ January 2005 v1.2.9 to v1.3.0 * Volker Tanger - added "changes only" notification/alert - stale server status files (i.e. when server removed from monitoring) will be automatically removed * Eduardo Grosclaude - added Linux-SoftRAID monitoring for LOCALCHECK ------------------------------------------------------------------------ December 2004 v1.2.8 to v1.2.9 * Volker Tanger - improved HTTP check in way easier checking for URLs which unfortunately breaks compatibility... - removed bug that disabled logging ($LOGFILE was not set) ------------------------------------------------------------------------ December 2004 v1.2.7 to v1.2.8 * Volker Tanger - removed bug introduced with 1.2.7 when checking for set environment paths - *OUCH* RegExp gone bad removed all check results. Corrected. Sorry guys, must have slipped somewhere... ------------------------------------------------------------------------ December 2004 v1.2.6 to v1.2.7 * Volker Tanger - split MOSSHE into three parts - config, loop and single check, which helps testing setups. ------------------------------------------------------------------------ November 2004 v1.2.5 to v1.2.6 * Volker Tanger - repaired quite some bugs in snmp check, now numeric values are set for all fields. Changed snmp.* parameter files accordingly - improved MBCheck in localcheck.functions ------------------------------------------------------------------------ November 2004 v1.2.4 to v1.2.5 * Volker Tanger - quite some bugs in localcheck.functions (introduced with helper binary checks - usually missing "fi"s) - nailed down the path to "localcheck.functions" - added check result sanitizing within mosshe script - corrected missing UNDEF listing in web interface(s) and mosshe - corrected SMTP check behaviour when service completely down * Nicholas Fechner - improved HTTP check - now you can define expected returns (e.g. "302 Document moved") as OK, too. ------------------------------------------------------------------------ November 2004 v1.2.2 to v1.2.4 * Volker Tanger - added safety checks for checks that need aditional/external programs (smmp, localcheck.tempcheck) * Eduardo Grosclaude - corrected bug in alert handling (line 89) - added MBMonCheck (CPU temperatur, fan speed, etc...) ------------------------------------------------------------------------ November 2004 v1.2.1 to v1.2.2 * Volker Tanger - corrected SSH / SSH1 checks, did not notice server/SSH gone completely. Will now return "UNDEF" when gone. - plit "localcheck" into a config and a funtions part to make updates much easier (to "just copy"). + Eduardo Grosclaude - updated README re. localcheck.include that was split in the last version - corrected cron_mosshe to enhance compatibility - corrected CheckMem funtion in loclcheck - added comments for the SSH tests ------------------------------------------------------------------------ November 2004 v1.2.2 to v1.2.3 * Volker Tanger - added "UNDEF" status to global_alert and mosshe ------------------------------------------------------------------------ October 2004 v1.2.0 to v1.2.1 * Volker Tanger - corrected HTTP check, did not notice a server gone completely - Added "UNDEF" alerting to alerts and web interface ------------------------------------------------------------------------ October 2004 v1.1.7 to v1.2.0 * Volker Tanger - corrected PrintCheck warning output (did not show limit) - delete status files of servers no longer in use - changed alert basics so we can add other alerts soon... ------------------------------------------------------------------------ October 2004 v1.1.6 to v1.1.7 * Yann Pilpre - increased portability of disk check in LOCALCHECK when using "long" devices like /dev/ide/host0/bus0/target0/lun0/part7 ------------------------------------------------------------------------ October 2004 v1.1.5 to v1.1.6 * Volker Tanger - corrected disk check in LOCALCHECK so hanging network mounts do not hang MOSSHE ------------------------------------------------------------------------ October 2004 v1.1.4 to v1.1.5 * Volker Tanger - added SSH V.1 remote check ------------------------------------------------------------------------ October 2004 v1.1.3 to v1.1.4 * Volker Tanger - added -n flag to PING to make independent on DNS - added DNS service test - added SAMBA service check - added printer and memory checks to LOCALCHECK ------------------------------------------------------------------------ September 2004 v1.1.2 to v1.1.3 * Volker Tanger - added a total count of checks/servers to SHOWALL overview - increased compatibility by changing shell integer handling * Chad Lepto - suggested adding Logs directory to archive - currected use of HEAD command line parameters - added compatibility documentation hints for README ------------------------------------------------------------------------ September 2004 v1.1.1 to v1.1.2 * Volker Tanger - added IMAP2 and IMAP3 checks ------------------------------------------------------------------------ September 2004 v1.1.0 to v1.1.1 * Volker Tanger - fixed small variable initialization error in tactical.oy (webinterface) ------------------------------------------------------------------------ August 2004 v1.0.4 to v1.1.0 * Volker Tanger - logfile rotated daily - added SNMP queries - trend showing min/avg/max values ------------------------------------------------------------------------ August 2004 v1.0.3 to v1.0.4 * Volker Tanger - parametric command (http.tomcat) - PING with execute-time (well, better than nothing...) - central config file(s) for webview - number of shells listed as numeric value - show mount point in addition to "Disk Free" message ------------------------------------------------------------------------ August 2004 v1.0.2 to v1.0.3 * Volker Tanger - changed PING behaviour to do retries, resulting in less false positives. A single ICMP packet is lost sometimes, after all... - corrected tactical.py - missing key in "stati" dictionary ------------------------------------------------------------------------ August 2004 v1.0.1 to v1.0.2 * Volker Tanger - removed (irrelevant but annoying) bug with VERSION setting - removed bug in shell check - removed absolute (and programming-specific) path for HTTP check - added cron handler ------------------------------------------------------------------------ July 2004 v1.0.0 to v1.0.1 * Volker Tanger - added documentation and published ------------------------------------------------------------------------ late 2003 (started the project) * Volker Tanger - wrote the "core system" and mail alerts - started into the web interface ------------------------------------------------------------------------