Health monitoring
This document describes the options available for monitoring the health of a
QuestDB instance. There are options for minimal health checks via a min
server
which provides a basic 'up/down' check, or detailed metrics in Prometheus format
exposed via an HTTP endpoint.
#
Prometheus metrics endpointPrometheus is an open-source systems monitoring and alerting toolkit. Prometheus collects and stores metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.
QuestDB exposes a /metrics
endpoint which provides internal system metrics in
Prometheus format. To use this functionality and get started with example
configuration, refer to the
Prometheus documentation.
#
Min health serverREST APIs will often be situated behind a load balancer that uses a monitor URL
for its configuration. Having a load balancer query the QuestDB REST endpoints
(on port 9000
by default) will cause internal logs to become excessively
noisy. Additionally, configuring per-URL logging would increase server latency.
To provide a dedicated health check feature that would have no performance knock
on other system components, we opted to decouple health checks from the REST
endpoints used for querying and ingesting data. For this purpose, a min
HTTP
server runs embedded in a QuestDB instance and has a separate log and thread
pool configuration.
The configuration section for the min
HTTP server is available in the
minimal HTTP server reference.
The min
server is enabled by default and will reply to any HTTP GET
request
to port 9003
:
The server will respond with an HTTP status code of 200
, indicating that the
system is operational:
Path segments are ignored which means that optional paths may be used in the URL and the server will respond with identical results, e.g.:
info
The /metrics
path segment is reserved for metrics exposed in Prometheus
format. For more details, see the
Prometheus documentation.
#
Unhandled error detectionWhen metrics subsystem is
enabled
on the database, the health endpoint checks the occurrences of unhandled,
critical errors since the database start and, if any of them were detected, it
returns HTTP 500 status code. The check is based on the
questdb_unhandled_errors_total
metric.
When metrics subsystem is disabled, the health check endpoint always returns HTTP 200 status code.
#
Avoiding CPU starvationOn systems with 8 Cores and less, contention for threads might increase the latency of health check service responses. If you are in a situation where a load balancer thinks QuestDB service is dead with nothing apparent in QuestDB logs, you may need to configure a dedicated thread pool for the health check service. For more reference, see the minimal HTTP server configuration.