Building for Resilience
Today I’m going to talk about how I build resilience into my personal projects.
This is not a post about testing or writing testable code. I’ll claim that my strategy of having “fun” with my personal projects limited my investment in the area of unit and integration testing at home. I may write about this topic in future but meanwhile you can find a significant wealth of information online or specifically on Python.
Concepts
Resilience Engineering is a big topic. In my projects I’ve defined resilience as having these properties:
- Takes predictable action for either application or dependency failures, and recovers to steady-state when the disruption is resolved.
- Provides adequate visibility for new scenarios that are not well handled by existing logic, and without noise when dependencies are disrupted.
Constraints and Assumptions
I also apply an important simplifying assumption in my project design: disruptions (usually outside of main event loops and in daemon threads) trigger an application shutdown [with an expectation of a supervised restart]. This dramatically reduces the number of recovery scenarios that I need to specifically handle in code if I want the application to fully return to an unimpaired state. An example of an impaired state would be thread death.
There are many real-world scenarios where this is almost certainly undesirable behaviour from automation because it explicitly introduces correlated disruption either across a horizontally scaled application or across applications sharing a common strategy. It also assumes that application restart carries neither impact to the end-user experience nor system resource cost at startup, which is a bad assumption in real-world systems at scale. Substantial, real-time systems tend to have more exhaustive exception handling mechanisms to obviate the need for restart-on-failure and usually also employ some kind of traffic throttling mechanisms to protect available resources. Exception handling is made robust by unit and regression testing to explicitly exercise recovery code paths. It is because exceptions are exceptional by definition, this should be the absolute minimum due diligence when building resilience off the common-path.
There are legitimate examples of forced-restart in the form of watchdog timers (WDT) typically used in low-level applications like micro-controllers. In these environments it is more useful to trigger a reboot of the processor with the expectation of post-reboot success than to leave a device in a stuck or impaired state. It also means that all setup and initialization code can be run during startup paths, keeping exception paths simple.
Since my personal projects carry none of these constraints and incidentally benefit from [non-redundant] message brokers, short disruptions for restarts should not actually drop any un-fetched work and may only delay event processing for a few seconds during application restart.
Desired Traits
With that, let’s discuss a few of the resilience features I wanted in all my applications.
Steady State
When an application is running in the steady state, it has:
- A uniform interface for logging activity.
- Unhandled exception capturing with the ability to define additional integrations as needed.
- Automatic detection of thread death with a means of responding appropriately.
Shutdown
When an application is to shutdown or in the process of shutting down:
- Unless busy in application logic or I/O, a thread must shutdown immediately when it is signaled by an application to do so. All blocking behavior, including sleeps, must allow interruption. As far as possible, this behavior is also followed by daemon threads to support clean shutdown of ZeroMQ which requires that all sockets be closed to allow the main thread to terminate.
- The application logger switches to debug logging automatically if shutdown takes longer than 30 seconds. This also includes a listing of all remaining ZeroMQ sockets along with their code instantiation locations. This gives the application a chance to report on what is delaying the shutdown before the process manager later follows
TERM
withKILL
. - If an unhandled exception is responsible for application shutdown, the process exit code is set to non-zero (typically
1
). This seems obvious, but isn’t necessarily the behaviour for non-trivial applications. - A process supervisor must restart the application for exits where the code is not
0
. It is of course possible to make the process manager restart the child process under all conditions but I’ve found it useful for environment debugging to be able to send aTERM
signal to the app and to keep the supervisor from bringing back the application in the container.
Monitoring and Observability
This is a big topic and worth spending time doing research online. The IBM cloud blog has a useful distinction between the concepts of monitoring and observability.
Monitoring tells you when something is wrong, while observability can tell you what’s happening, why it’s happening and how to fix it.
Here is a brief list of the mechanisms I use in my projects, both self-made and borrowed from helpful tools online.
It’s useful to develop an early opinion about which tools you need local and/or network-isolated vs tools that can be reached on the Internet because it will determine your monitoring strategy and almost certainly also the setup and operating cost.
I’ll first introduce the functional properties of these mechanisms and then in the later section I’ll show how this is used in code.
- cronitor.io: Using cronitor-python for in-process monitoring, I post periodic metrics from my thread monitor thread_nanny and report on thread counts and missing threads as part of the metadata. This allows me to explicitly monitor cases where threads have died unexpectedly.
-
healthchecks.io: All my container projects include a cron job to run a shell script to invoke
curl
to call project-specific URLs in the Healthchecks web service. Calls to the URL are tracked by healthchecks and overdue calls are alerted via your chosen set of many integrations. While Healthchecks do support in-process probes, I explicitly want mine sent viacron
to verify that my container application has a properly functioningcron
instance for other jobs like cleanup or backups. - InfluxDB: I post a variety of time-series data from my various projects. Influx provide both a containerized project for network-local instances as well as alerting capabilities based on Influx QL queries. The main purpose of this is to visualize my data in the form of metrics. Excellent alternatives include Grafana or Prometheus.
- PagerDuty: While I don’t interact with this service directly from code, the monitoring services above do include PagerDuty as an integration option and so this provides resilience in communication of the issue. Good paging tools provide both communication (push notification, phone call) and team escalations (another human).
- sentry.io: This service is the single reason that I am able to achieve resilience in my personal projects without bothering with automated testing or tailing logs for every single unexpected issue. One of the many features I use with Sentry is an in-process mechanism for capturing unhandled exceptions. When triggered, Sentry will automatically create a unique ticket for the issue, including a variety of metadata such as local context and call-tracing breadcrumbs. With Sentry, I’ve been able to rapidly identify unintuitive failure modes and add robustness to my implementation where an application restart (to fix) is either inappropriate or unnecessary.
- Telegram: While not a monitoring tool as such, Telegram provides a bot interface to easily communicate rich context about actions taken by the application or other discretionary information. The monitoring tools above also have the ability to send notifications to a Telegram group.
Top Tip: Host reboots and network disruptions of your service flush out hard-to-find issues because it includes testing your code and all the library, system and network dependencies that you [can and should] take for granted. Test both controlled and uncontrolled shutdown of containers and host systems. It will teach you some valuable lessons, I guarantee it.
Putting it Together
In the same way that my pylib project contains a variety of code factored out of my projects over time, I discovered that it was useful to do the same with the project structure of my container applications. You can find an example in my base-app project which uses pyblib
as a package dependency (installed as a git submodule). I designed this project to also be stand-alone to test basic functionality of a working application. This gives me confidence that Docker projects that extend this project have inherited functionality that is already tested. The examples below take from both of these projects.
Logging
With APP_NAME
being defined in __init__.py
(for example here), a Python StreamHandler is created in pylib __init__.py
to include both the application name and thread name. By default, the system log is used but otherwise the console is used which is useful when running the application interactively.
log = logging.getLogger(APP_NAME)
log.propagate = False
log.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(name)s %(threadName)s [%(levelname)s] %(message)s')
log_handler = None
if os.path.exists('/dev/log'):
log_handler = logging.handlers.SysLogHandler(address='/dev/log')
log_handler.setFormatter(formatter)
log.addHandler(log_handler)
if sys.stdout.isatty() or ('SUPERVISOR_ENABLED' in os.environ and log_handler is None):
log.warning("Using console logging because there is a tty or under supervisord.")
log_handler = logging.StreamHandler(stream=sys.stdout)
log_handler.setFormatter(formatter)
log.addHandler(log_handler)
It’s typically convenient to have all container logs sent to a central remote logging service. On Linux, these logs can be easily forwarded to the host system rsyslog
instance by using the logging driver in your docker-compose.yml
template.
version: "3.8"
services:
app:
logging:
driver: syslog
I happen to use solarwinds papertrail for off-box log persistence but the free tier does not tolerate logs that are too chatty.
Unhandled Exceptions
Sentry makes this so easy that there’s very little to say about it in terms of code. If you want to explicitly forward an exception as a ticket to Sentry, you can use the following pattern.
from sentry_sdk import capture_exception
def some_function():
try:
# ...
except NetworkError:
# ... handling specific error
except Exception:
# ... catch-all
capture_exception()
It is important to note that Sentry will still detect unhandled exceptions via your logger without having to always use the call to capture_exception
as above. You can also install a Sentry filter for a logger namespace in order to prevent triggering tickets for cases where there is application-level handling. I found that I needed this to filter some exception noise in RabbitMQ. Of course, filtering should be used with care to prevent masking real issues.
from sentry_sdk.integrations.logging import ignore_logger
ignore_logger('pika.adapters.utils.io_services_utils')
Here is an example of using Sentry with integrations. In this example, the HTTP 500 handler for Python Flask is updated with information to enable a feedback form to post to Sentry. A user-friendly way to admit failure.
import sentry_sdk
from sentry_sdk import last_event_id
from sentry_sdk.integrations.flask import FlaskIntegration
sentry_sdk.init(
dsn=creds.sentry_dsn,
integrations=[FlaskIntegration()]
)
@flask_app.errorhandler(500)
def internal_server_error(e):
return render_template('error.html',
sentry_event_id=last_event_id(),
sentry_dsn=creds.sentry_dsn
), 500
Process Management
Some kind of process manager is needed to control and monitor execution of your application. I’ve had prior success with systemd but for my container applications I currently use supervisord which is loaded as part of my container entrypoint. By using this syntax below, I replace the execution context of the Docker entrypoint with supervisord as the root process.
exec env supervisord -n -c /opt/app/supervisord.conf
Supervisord has some helpful configuration templates and good documentation on default values and possible overrides. Since all my applications use a common pattern, they all use this stanza:
[program:app]
command=poetry run python -m app
directory=/opt/app/
user=app
autorestart=unexpected
The program stanza has this to say about autorestart
which I’ve set to unexpected
.
If unexpected, the process will be restarted when the program exits with an exit code that is not one of the exit codes associated with this process’ configuration…
If, for whatever reason, the root process fails or the container exits unexpectedly due to an environment issue, Docker can also be configured with a rule regarding what to do with the container. In my example, I use unless-stopped
which will restart the container on any condition other than an explicit stop, including starting the container at host boot.
version: "3.8"
services:
app:
restart: unless-stopped
Helpers
I have also built a few patterns in pylib for common error handling. Python’s context manager allows for a convenient way to support sophisticated but scoped activity life cycle management. I’ve applied this pattern in my exception_handler which takes appropriate action based on the activity block.
These work in tandem with another module pylib.threads which contains the thread nanny (aptly named thread_nanny
) mentioned earlier as well as a few thread trackers with some abuse of a handful of globals that rely on Python’s threading.Event. Any instantiation of AppThread automatically registers as threads to track by the nanny.
Let’s take an end-to-end example from the base-app entrypoint which also installs a signal handler. This lays down all the code necessary to start the application and worker threads which continue work until the application signals a shutdown.
from pylib.process import SignalHandler
from pylib.threads import thread_nanny, die, bye
from pylib.app import AppThread
from pylib.zmq import zmq_term, Closable
from pylib.handler import exception_handler
class EventProcessor(AppThread, Closable):
def __init__(self):
AppThread.__init__(self, name=self.__class__.__name__)
Closable.__init__(self, connect_url='inproc://my-zeromq-in-process-socket')
def run(self):
with exception_handler(closable=self, and_raise=False, shutdown_on_error=True):
while not threads.shutting_down:
event = self.socket.recv_pyobj()
log.debug(event)
# other processing
def main():
# only log at INFO level
log.setLevel(logging.INFO)
# ensure proper signal handling; must be main thread
signal_handler = SignalHandler()
# create the application worker thread
event_processor = EventProcessor()
# start the thread nanny with signal handler
nanny = threading.Thread(
name='nanny',
target=thread_nanny,
args=(signal_handler,),
daemon=True)
try:
event_processor.start()
# start thread nanny
nanny.start()
# main thread now waits on the shutdown latch
threads.interruptable_sleep.wait()
raise RuntimeWarning()
except(KeyboardInterrupt, RuntimeWarning, ContextTerminated) as e:
log.warning(str(e))
threads.shutting_down = True
# ensure the latch is set if we arrive here due to another issue
threads.interruptable_sleep.set()
finally:
# tell ZeroMQ to shutdown (blocks on any remaining open sockets)
zmq_term()
# exists the Python process with an exit code dependent on exceptions thrown
bye()
if __name__ == "__main__":
main()
Here’s a little more detail about how exception_handler
does its job, particularly around __exit__
behaviour:
from sentry_sdk import capture_exception
from zmq.error import ContextTerminated
from . import threads
from .threads import die
from .zmq import Closable, try_close
class exception_handler(object):
def __init__(self, closable: Closable = None, connect_url=None, socket_type=None, and_raise=True, close_on_exit=True, shutdown_on_error=False):
self._closable = closable
self._zmq_socket = None
self._zmq_url = connect_url
self._socket_type = socket_type
self._and_raise = and_raise
self._close_on_exit = close_on_exit
self._shutdown_on_error = shutdown_on_error
def __enter__(self):
# ...
def __exit__(self, exc_type, exc_val, tb):
if self._close_on_exit or (exc_type and issubclass(exc_type, ContextTerminated)):
if self._closable:
self._closable.close()
elif self._zmq_socket:
try_close(self._zmq_socket)
if exc_type is None:
return True
if issubclass(exc_type, ContextTerminated):
# treat as non-critical
return True
elif issubclass(exc_type, ResourceWarning):
# raised to indicate a fatal dependency error that
# does not fill Sentry with exception regressions
# or unhandled exceptions; used typically at startup
log.warning(self.__class__.__name__, exc_info=True)
if self._shutdown_on_error:
die(exception=exc_type)
elif issubclass(exc_type, Exception):
if not threads.shutting_down:
log.exception(self.__class__.__name__)
capture_exception(error=(exc_type, exc_val, tb))
if self._shutdown_on_error:
die(exception=exc_type)
else:
# log the exception as informational if in debug mode
log.debug(self.__class__.__name__, exc_info=True)
return not self._and_raise
When the context manager closes, the __exit__
method is called by the Python runtime. If there is a ZeroMQ socket or Closable
associated with the context manager, an attempt is made to close it. If the context manager has no exception context, denoted by the exc_type
parameter, then the context manager is exited with a return (True indicates that it will not be re-raised to the calling code). If there is an exception on exit:
- a ZeroMQ
ContextTerminated
exception, which happens when a ZeroMQ socket operation is attempted after callingzmq.Context().term()
, then this is treated as non-critical; handling it is pointless because the application is shutting down. - I
abuse Python’s built-inResourceWarning
as a placeholder for an unrecoverable error (like dependency failure) that should trigger an application shutdown but without capturing an error in Sentry because there is nothing to debug in the application code. Of course, the dependency needs its own monitoring. I’ve found Uptime Kuma a good option for this. - For any (unhandled)
Exception
type, capture the error in Sentry if the application isn’t already shutting down. Thedie()
method captures this to use in the exit code for the process. - Re-raise the exception if the context manager is used with the parameter
and_raise
is set toTrue
.