Monitoring an email installation is a critical part of maintaining a stable email installation. When done properly, it can provide the site operators with:
This document describes some of the basics of monitoring Sun ONE Messaging Server email systems.
While each email deployment is slightly different, the diagram below represents a basic email deployment.
+---------+ | | | ISP | |Customers| .++++++++++. | | .++++++++++++++. +------^--+ ++++ ++++ | +++ INTERNET +++ .____: _. ++++ ++++ v .'| ++++++++++++++++ --------:-------- .' ++++++++++++ | .' | v .' | +-------------------+ .' ___________v___________ | load balancer | .' | | +-------------------+ .' | | | .' | | ____________________|___________.'_____________ +--------+ +--------+ | | | | | | | | | | | | | mta3 | | mta4 | +--------+ +--------+ +--------+ +--------+ | | | | | | | | | | | | +--------+ +--------+ | mmp1 | | mmp2 | | mta1 | | mta2 | | | | | | | | | | | | | +--------+ +--------+ +--------+ +--------+ | | | | | | | | ______|__________|__________|__________|_______________|___________|______ | | | | | | | | | | | | | ,---------. ,---------. | | | | | | | ,-------. ,-------. ,-------. |`---------'| |`---------'| | | | | | | | | | | |`-------'| |`-------'| |`-------'| | Store1 | | Store2 | | ldap1 | | ldap2 | | ldap3 | | | | | | | | | | | | | | | `-------' `-------' `-------' `---------' `---------'
In this example, the customers of the ISP submit and retrieve email through the load balancer. The load balancer shields the email user from the underlying site architecture and helps provide a highly available email service. Email submitted to the site takes three potential routes:
load balancer -> mta[1,2] -> store[1,2]
load balancer -> mta[1,2] -> Internet
mta[3,4] -> store[1,2]
Load balancer <-> mmp[1,2] <-> store[1,2]
This document describes how to monitor this basic deployment. It will also suggest additional site monitoring that is essential to maintaining a well behaved email installation.
The basic technique presented here to monitor the health of the site is to use the site, essentially like the customers, to send and retrieve email. To do this, at least one test email account must be set up on each of the message store machines. A system must be then be put into place to exercise and measure all of the potential message flow paths through the systems. When a failure occurs at the customer visible entry points (mta1-4 or mmp1-2), an automated attempt to isolate the failure to a particular system should take place and an alert sent. Ideally the alert should not be sent using any of the email infrastructure of the site to avoid having the alert fail to go out due to a system failure.
The next two sections describe a collection of scripts and programs that have been developed and deployed to monitor (Sun ONE) Messaging Server email installations. Many of the scripts are generic and only rely on the existence of the basic protocols used in all email installations (POP, SMTP, and LDAP).
Theses scripts/programs were based upon scripts/programs from a variety of sources. Some were adapted from scripts that were used to monitor SIMS installations; others were developed on site during customer escalations. . There are being supplied on an as is basis.
There are two types of scripts, top-level scripts and support scripts. This section describes the support scripts. Under normal circumstances, these scripts are not called directly from the command line or from the crontab entries set up to monitor the site. These support scripts are called from the top-level scripts. They are being described here because they implement useful functions and may provide useful functions if the top-level scripts were rewritten, reconfigured, or enhanced. These support scripts are found in the lib/ subdirectory or the etc/ subdirectory.
The format of the entries in this file is
# <minute> DMSG -u <store> -s <MTA Relay> -p <MMP> 05 DMSG -u ims1 -s ssmtp01 -p smux01 10 DMSG -u ims2 -s ssmtp01 -p smux01 ... 55 DMSG -u ims4 -s ssmtp04 -p smux02 #
The following options are supported:
The following options are supported:
These scripts are the top-level scripts used to monitor the site. Some of the scripts access the health of the system, attempt to isolate the failure to a particular machine and protocol, and send out alerts. The results are logged to a file and sent to a list of email address each day. Other scripts parse through the SMTP logs and report message rate and size information.
The options supported are:
Below is an example of the qmonitor.log file:
DATE COLLECTED: 10 Jan 2001 instance name = msg-smtp1 +---------------------------------------------------------------------------------------------------+ | System and Queue monitoring data | +-----+----------------+------------+---------+----------------------------+------------------------+ | | Msgs in Queue | Available | | CPU | TCP/IP Connections | |Time | act | held | swap |Scanrate | us | sys | wt | id |smtp|imap|pop3|ldap|http| +-----+-------+--------+------------+---------+------+------+------+-------+----+----+----+----+----+ |00:10| 0 | 0 | 2005656k | 0 | 0 | 0 | 2 | 97 | 0| 50| 0| 55| 0| |00:15| 1 | 0 | 2003760k | 0 | 1 | 2 | 3 | 94 | 0| 51| 0| 55| 0| |00:20| 0 | 0 | 1992640k | 0 | 0 | 0 | 3 | 96 | 0| 52| 0| 56| 0| |00:25| 0 | 0 | 2004280k | 0 | 0 | 0 | 6 | 94 | 0| 50| 0| 55| 0| |00:30| 0 | 0 | 2004552k | 0 | 0 | 1 | 5 | 94 | 0| 50| 0| 55| 0| |00:35| 1 | 0 | 1992216k | 0 | 0 | 0 | 4 | 96 | 0| 48| 0| 56| 0| |00:40| 0 | 0 | 2018200k | 0 | 0 | 1 | 5 | 94 | 0| 48| 0| 54| 0| |00:45| 0 | 0 | 2005648k | 0 | 1 | 1 | 5 | 93 | 0| 48| 0| 55| 0| |00:50| 0 | 0 | 2004264k | 0 | 0 | 0 | 4 | 95 | 0| 48| 0| 55| 0| |00:55| 1 | 0 | 2005064k | 0 | 0 | 0 | 3 | 96 | 0| 49| 0| 55| 0| |01:00| 0 | 0 | 1993824k | 0 | 6 | 4 | 46 | 44 | 0| 50| 0| 55| 0| |01:05| 1 | 0 | 1981784k | 0 | 1 | 3 | 2 | 95 | 0| 51| 0| 56| 0| |01:10| 0 | 0 | 1985232k | 0 | 1 | 4 | 5 | 91 | 0| 49| 0| 55| 0| |01:15| 0 | 0 | 1972688k | 0 | 1 | 2 | 3 | 94 | 0| 50| 0| 56| 0| |01:20| 0 | 0 | 2003048k | 0 | 1 | 4 | 45 | 50 | 0| 50| 0| 54| 0| |01:25| 0 | 0 | 1994032k | 0 | 1 | 2 | 49 | 48 | 0| 50| 0| 55| 0| |01:30| 0 | 0 | 2003240k | 0 | 2 | 2 | 5 | 92 | 0| 51| 0| 54| 0| |01:35| 0 | 0 | 2005624k | 0 | 1 | 5 | 45 | 49 | 0| 51| 0| 54| 0| |01:40| 1 | 0 | 1979776k | 0 | 1 | 3 | 35 | 61 | 0| 51| 0| 56| 0| |01:45| 1 | 0 | 1976952k | 0 | 28 | 20 | 20 | 32 | 0| 52| 0| 56| 0| |01:50| 0 | 0 | 1983448k | 0 | 4 | 9 | 24 | 62 | 0| 58| 0| 55| 0| |01:55| 0 | 0 | 1990520k | 0 | 1 | 4 | 36 | 59 | 0| 51| 0| 55| 0| |02:00| 0 | 0 | 1987624k | 0 | 9 | 8 | 39 | 44 | 0| 59| 0| 55| 0| |02:05| 0 | 0 | 1994216k | 0 | 3 | 12 | 12 | 72 | 1| 49| 0| 55| 0| |02:10| 1 | 0 | 1994408k | 0 | 7 | 9 | 14 | 70 | 0| 57| 0| 55| 0| |02:15| 0 | 0 | 2004864k | 0 | 0 | 3 | 46 | 50 | 0| 49| 0| 57| 0|
The check_rndtrip script is just a convenience script that allows the roundtrip monitoring to occur with a single entry in crontab. It reads from the etc/sequence.cfg to determine which combination of stores, MTA routers and MMPs it should cycle through.
The runmonitor script is used to test out the email site by sending a message to a specified MTA and retrieving it from a specified MMP. In the event of a failure it will attempt to diagnose and isolate the failure. The results of each test are logged to a file (systest.log) and mailed out to a list of email address specified in alarms.cfg when the script is invoked with the r option.
Here is a summary of the functionality of this script when invoked in test mode (without the "-r" option).
Runmonitor parses the stdout and stderr output immonitor-access to determine the error string on failure.
The following options are supported:
Below is an example of the systest.log contents:
DATE COLLECTED: 10 Jan 2001 +--------------------------------------------------------+-----------------+ | Message Delivery time | | +----------+-----------+------------+------------+-------+-----------------+ | Test run | Username: | Msg submit | Msg retr | Total | Error | | at time: | | SMTP: | POP: | Time: | | +----------+-----------+------------+------------+-------+-----------------+ | 00:05:04 | ims2 | ssmtp02 | smux01 | 2083 | | | 00:10:03 | ims3 | ssmtp03 | smux02 | 576 | | | 00:15:03 | ims4 | ssmtp04 | smux02 | 1715 | | | 00:20:04 | ims1 | ssmtp02 | smux02 | 1886 | | | 00:25:03 | ims2 | ssmtp01 | smux02 | 1914 | | | 00:30:03 | ims3 | ssmtp04 | smux01 | 614 | |
Monitoring an email site such as the typical one presented in this document requires four steps
Only one system in the installation should run the check_rndtrip script. This system should have access to all of the systems in the site. All of the systems in the site should be configured to run the check_queues, check_files, check_ldap, and check_sys scripts.
Assuming the site architecture outlined in this doc, the check_rndtrip monitoring could be configured using the following steps:
0,5,10,15,20,25,30,35,40,45,50,55 * * * * <path to script>/check_rndtrip 4 0 * * * <path to script>/check_rndtrip -r
Assuming the typical site architecture outlined in this doc, the check_queues and other scripts could be configured using the steps below. Because these scripts parse through log files, they must be installed and run on each machine.
0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_queues 0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_sys 0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_ldap 0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_files 0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_mconn 0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_procs 0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_rndtrip 59 23 * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_rndtrip -ror you could use the internal scheduler to run those probes
setconf local.schedule.check_qm "0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_queues" setconf local.schedule.check_sys "0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_sys" setconf local.schedule.check_ldap "0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_ldap" setconf local.schedule.check_files "0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_files" setconf local.schedule.check_mconn "0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_mconn" setconf local.schedule.check_procs "0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_procs" setconf local.schedule.check_rndtrip "0,5,10,15,20,25,30,35,40,45,50,55 * * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_rndtrip" setconf local.schedule.send_report "59 23 * * * <msg_srv_base>/examples/unsupported/healthmon/bin/check_rndtrip -r "
If you use a different reporting style by default, you can define that new reporting style in the plugins/report_formats.cfg file and at the same time modify your etc/alarms.cfg file to reflect that change.
The scripts described in this paper should help a sysadmin or support individual monitor and maintain a Messaging Server email infrastructure. However, there are other tools available to monitor an Messaging Server installation. For more information on some of the tools that come with iMS please read Chapter 15 of the Messaging Server admin guide (http://docs.sun.com/source/816-6009-10/monitor.htm ).