Gateway Metrics (internal perfomance counters, measurements and profiling)

You will learn how to enable and query gateway metrics.

Basic

A minimal command-line for a gateway looks something like this

$ roq-deribit \
    --name "deribit" \
    --config_file $HOME/deribit.toml \
    --client_listen_address unix://$HOME/deribit.sock \
    --flagfile $CONDA_PREFIX/share/roq-deribit/flags/test/flags.cfg

There are two different ways to enable metrics:

  • The gateway is passive and will accept HTTP requests to query its metrics.

  • The gateway will actively push its metrics to a HTTP service.

Active

We want to configure the gateway to push metrics to Prometheus’ pushgateway.

In this case you would add the --metrics_push_uri flag like this

$ roq-deribit \
    --name "deribit" \
    --config_file $HOME/deribit.toml \
    --client_listen_address unix://$HOME/deribit.sock \
    --flagfile $CONDA_PREFIX/share/roq-deribit/flags/test/flags.cfg \
    --metrics_push_uri http://localhost:9091/metrics

You can control the push frequency with the --metrics_push_freq flag. The default is 5 seconds which is normally sufficient.

Note

The active case is easiest to maintain if you deploy many gateways.

Passive

We want to configure Prometheus to query the gateway for metrics.

In this case we add the --service_listen_address flag like this

$ roq-deribit \
    --name "deribit" \
    --config_file $HOME/deribit.toml \
    --client_listen_address unix://$HOME/deribit.sock \
    --flagfile $CONDA_PREFIX/share/roq-deribit/flags/test/flags.cfg \
    --service_listen_address tcp://localhost:1234

We can now query the gateway’s metrics

$ curl http://localhost:1234/metrics

The full response is quite verbose because the gateways have many internal measurement points. We will only show some examples in the following.

Examples

The most common metrics are counters and histograms.

Counters

An example of a counter

# TYPE roq_counter counter
roq_counter{source="deribit", connection="5:md", function="disconnect"} 1 1777551390465

This shows that the connection named "5:md" has experienced 1 (one) disconnect.

Note

The name "5:md" is used to quickly identify the connection. The leading number is what is called stream_id. The following string is a short-name for the type of connection. To find the exact type of the connection, you should consult your gateway log (or the gateway’s event-log) and find the stream_update events.

The gateways typically open many connections and we can use these counters to see if there are connections which bounce (the counter will be increasing with time).

Note

We could build a Grafana dashboards to monitor disconnects or we could configure Prometheus’ Alertmanager to send notifications if the disconnects persist over a period of time. These are just two use-cases.

Histograms

An example of a histogram

# TYPE roq_request_latency histogram
roq_request_latency_bucket{source="deribit", function="exchange", le="10000"} 0 1777551390465
roq_request_latency_bucket{source="deribit", function="exchange", le="100000"} 0 1777551390465
roq_request_latency_bucket{source="deribit", function="exchange", le="1000000"} 0 1777551390465
roq_request_latency_bucket{source="deribit", function="exchange", le="10000000"} 0 1777551390465
roq_request_latency_bucket{source="deribit", function="exchange", le="100000000"} 3 1777551390465
roq_request_latency_bucket{source="deribit", function="exchange", le="1000000000"} 3 1777551390465
roq_request_latency_bucket{source="deribit", function="exchange", le="+Inf"} 3 1777551390465
roq_request_latency_sum{source="deribit", function="exchange"} 95390153 1777551390465
roq_request_latency_count{source="deribit", function="exchange"} 3 1777551390465

This type of metrics also has a counter which is 3 (three) in this case.

There are also information about the sum which is accumulated total time spent for all events. In this case we can compute an average of 95390153 / 3 = 31796717.7ns = 31.8ms.

Note

All periods are captued in nanoseconds.

Note

Metrics only capture aggregate snapshots of various counters. The histogram buckets are relatively wide and generally useful to detect tail events. If you want all the details, please study the event-log which will record every single event.

We also have information about the distribution of the latencies. In this case we see that all 3 events were in the range (10ms; 100ms].