Start, Stop and Status Monitoring

A typical SHC application consists of multiple parallel activities that need to be initialized and gracefully stopped at shutdown: Most Interfaces have an internal loop task for interacting with external systems and each of SHC’s timers has an internal loop to wait for the next trigger time. In addition, “initializable” Reading objects, like SHC Variables need to read their initial value during startup.

For this purpose, the shc.supervisor module implements functions for controlling startup and shutdown of SHC applications.

The main entry point of SHC applications, after all objects have been constructed and connected, should be the shc.supervisor.main() function, we already encountered in the examples. It performs the following startup procedure:

Register a signal handler to initiate shutdown when receiving a SIGTERM (or similar)
Start all interface instances via their start() coroutine and await their successful startup
Trigger initialization of variables via read
Start timers (incl. Once triggers)

When a shutdown is initiated, all interfaces (and the SHC timers) are stopped by calling and awaiting their stop() coroutine. The SHC application only quits when all these coroutines have returned successfully.

When an interface fails starting up, it shall raise an exception from its start() coroutine, which will interrupt the SHC startup process.

When an interface encounters a critical error during operation, after successful startup, it may call shc.supervisor.interface_failure() to initiate a shutdown. In this case, SHC will wait for the remaining interfaces to stop and exit with an non-zero exit code.

Some interfaces, especially client interfaces inheriting from shc.interfaces._helper.SupervisedClientInterface, can be configured to automatically retry the external connection on errors, even if an an error is encountered during the initial startup. As the SHC application will continue to run in these cases, it’s useful to monitor the status of individual interfaces.

Monitoring of Interface Status

For this purpose, most interfaces, implement a “monitoring connector” (also called “status connector”). It is a Readable object of value type shc.supervisor.InterfaceStatus that be retrieved via the monitoring_connector.

If an interfaces does not provide monitoring capabilities, this method will raise a NotImplementedError.

In many cases, the monitoring connector object is not only Readable but also Subscribable. This can be used to interactively react to interface status changes, e.g. set some variables to an emergency-fallback mode when the SHC client connection to a primary SHC server is lost.

The shc.supervisor.InterfaceStatus includes a basic health status, represented as a ServiceStatus <shc.supervisor.ServiceStatus (OK / WARNING / CRITICAL / UNKNOWN), based on the service state representation of the Nagios monitoring system (and its successor, the Icinga monitoring system). In addition, a human-readable message can be provided by the interface, for communicating the cause of the interface problems.

SHC’s built-in WebServer allows to expose the monitoring status of any number of interfaces and an overall status via a HTTP monitoring endpoint, so that external monitoring systems can check the status of the SHC application: Monitoring via HTTP

To include some fundamental system and application status into the monitoring, such as the health of the Python asyncio event loop, running the SHC application, there are pseudo-interfaces available in the shc.interfaces.system_monitoring module.

class shc.supervisor.ServiceStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None): Enum of possible service status, derived from Nagios/Icinga status.

namedtuple shc.supervisor.InterfaceStatus(status: ServiceStatus = ServiceStatus.OK, message: str = '')

Interface status information as returned by AbstractInterface.get_status().

Contains the overall interface status (status), a human readable message, typically describing the current status, especially the error if any.

Fields:

status (ServiceStatus) – Overall status of the interface.
message (str) – A textual description of the error. E.g. an error message, if status != ServiceStatus.OK

Monitoring Helper classes

This module provides pseudo SHC interfaces that allow to monitor fundamental system functionality, such as the Python asyncio event loop.

These interfaces don’t “interface” with anything, but they provide the usual monitoring_connector() method to be included in the SHC monitoring framework and make use of the supervisor for startup and graceful shutdown.

class shc.interfaces.system_monitoring.EventLoopMonitor(interval: float = 5.0, num_aggr_samples: int = 60, lag_warning: float = 0.005, lag_error: float = 0.02, tasks_warning: int = 1000, tasks_error: int = 10000)

A special SHC interface class for monitoring the health of the asyncio Event Loop.

This interface only provides a monitoring connector, allowing external monitoring systems to monitor the health of this application’s event loop.

For this purpose, when started, it regularly checks the current number of asyncio tasks and the delay of scheduled function calls in the event loop. From these measurements, the maximum value over a number of intervals is calculated for each metric. These maximum values are reported via the tasks and lag connectors. The interface’s service status is determined by comparing these metrics to fixed threshold values.

Parameters:

interval – Interval for checking the function call delay and number of tasks in seconds
num_aggr_samples – Number of intervals to aggregate the measurements. For both, delay and task number, the maximum from all samples is reported and compared to the threshold values. Thus, at each time, the monitoring status covers a timespan of the last num_aggr_samples * interval seconds.
lag_warning – Threshold for the scheduled function call delay in seconds to report WARNING state
lag_error – Threshold for the scheduled function call delay in seconds to report CRITICAL state
tasks_warning – Threshold for the number of active/waiting asyncio Tasks to report WARNING state
tasks_error – Threshold for the number of active/waiting asyncio Tasks to report CRITICAL state

Variables:

tasks – readable and subscribable connector, representing and publishing the current number of active/waiting asyncio Tasks (maximum within the sample interval)
lag – readable and subscribable connector, representing and publishing the current call delay (maximum within the sample interval)