Latency Experiment¶
The purpose is to demonstrate the host specific latency profile using a reasonably realistic trading setup. By following the steps outlined in this document, you should be in a position to measure latencies on your own server configuration.
Summary¶
The test setup includes a Deribit gateway and two connected clients.
The gateway connects to Deribit’s testnet.
Both clients will automatically respond to ping messages sent by the gateway.
Client #1 will subscribe all symbols from the gateway.
Client #2 is a simple trading strategy which will manage orders through the gateway. This client only needs to subscribe a single symbol.
All components will be configured for low latency.
This document will
Describe
A typical server configuration
How to install and configure the software
How to extract latency metrics from the running gateway
Demonstrate
Function profiling
Internal ping latency
Internal round-trip latency
External latency
Preparations¶
Platform¶
This is the server configuration used for testing
AMD EPYC 3251 8-Core Processor
Hyper threading disabled in the BIOS
Ubuntu 18.04 LTS
Kernel boot command-line includes
isolcpu=1-6
Dynamic frequency scaling disabled using
tuned-adm profile network-latency
Docker CE installed
Prometheus and Grafana running on same host (as Docker containers)
Note
A true low latency configuration should use RSS (receive packet steering), IRQ balancing, have local timer interrupts disabled, etc. However, these are advanced topics and not required for most use-cases.
Knowing the NUMA architecture is very important if you want to achieve the lowest inter-process latencies
$ lstopo --no-io
Machine (31GB) + Package L#0
L3 L#0 (8192KB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#4)
L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#5)
L3 L#1 (8192KB)
L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#2)
L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#3)
L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7)
We will be running the gateway on processor #1.
The lowest latencies can be achieved if we run clients on processor #4 and #5 since they reside on the same node as processor #1.
We will include an experiment to measure the cross-connect between the two nodes. That can be achieved by running one of the clients on processor #3, for example.
Further readings
Prerequisites¶
Download Mambaforge
wget -N https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
Install Mambaforge
bash Mambaforge-Linux-x86_64.sh -b -u -p ~/conda
Activate conda
source ~/conda/bin/activate
Note
You should repeat this step whenever you open a new terminal window and you need to access your conda environment.
Install the required packages
conda install \
--channel https://roq-trading.com/conda/stable \
roq-deribit \
roq-cpp-samples \
roq-test
Further readings
Gateway¶
Let’s create a config file named deribit.toml
.
You can start by copying the template
cp $CONDA_PREFIX/share/roq/deribit/config.toml deribit.toml
Edit the config file and update with your Deribit API key and secret
[symbols]
include = ".*"
exclude = "USDT-.*
[accounts]
[accounts.A1]
master = true
login = "YOUR_DERIBIT_LOGIN_GOES_HERE"
secret = "YOUR_DERIBIT_SECRET_GOES_HERE"
symbols = ".*"
[users]
[users.test]
password = "1234"
symbols = ".*"
[users.trader]
password = "secret"
accounts = [ "A1" ]
symbols = [ "BTC-.*" ]
monitor_period_secs = 60
ban_period_secs = 300
request_limit = 10
Note
Update with your specific details.
You can search for YOUR_DERIBIT
and change accordingly.
It is convenient to create flag file named deribit.flags
with the following content
--name=deribit
--metrics_listen_address=1234
--fix_uri=tcp://test.deribit.com:9881
--ws_uri=wss://test.deribit.com/ws/api/v2
--loop_sleep=0ns
--loop_timer_freq=250ns
Note
You can read more about flags and flag files here.
The gateway can now be started like this
roq-deribit \
--config_file "deribit.toml" \
--flagfile "deribit.flags" \
--loop_cpu_affinity=1 \
--client_listen_address ~/deribit.sock
Further readings
Client #1¶
Started like this
roq-cpp-samples-example-4 \
--name "test" \
--exchange "deribit" \
--symbols ".*" \
--dispatcher_affinity 4 \
~/deribit.sock
Client #2¶
Started like this
roq-test \
--name "trader" \
--exchange "deribit" \
--symbol "BTC-PERPETUAL" \
--dispatcher_affinity 5 \
--enable_trading \
~/deribit.sock
Testing¶
Metrics¶
Gateway metrics can be retrieved from the HTTP interface
curl -s http://localhost:1234/metrics 2>&1 | less
For example, profiling information
# TYPE roq_profile histogram
roq_profile_bucket{source="deribit", connection="ws", function="parse", le="500"} 0
roq_profile_bucket{source="deribit", connection="ws", function="parse", le="1000"} 0
roq_profile_bucket{source="deribit", connection="ws", function="parse", le="2000"} 0
roq_profile_bucket{source="deribit", connection="ws", function="parse", le="5000"} 795
roq_profile_bucket{source="deribit", connection="ws", function="parse", le="10000"} 8471
roq_profile_bucket{source="deribit", connection="ws", function="parse", le="20000"} 8884
roq_profile_bucket{source="deribit", connection="ws", function="parse", le="+Inf"} 8895
roq_profile_sum{source="deribit", connection="ws", function="parse"} 6.13741e+07
roq_profile_count{source="deribit", connection="ws", function="parse"} 8895
This collection represents a histogram of all measurements since the gateway started. Each bucket has a total count for observations less-than or equal-to the number of nanoseconds, starting with 500 and ending with infinity. The sum is the total nanoseconds spent in the function. The count is the total number of times the function has been called.
Prometheus allows you to capture a time-series of these metrics and then compute incremental statistics.
For example, this would be the average processing time over a 1 minute rolling window
irate(roq_profile_sum[1m]) / on (source, connection, function)
irate(roq_profile_count[1m])
And this would be a conditional distribution, the percentage of events where processing time is larger than 5 microseconds
1 - irate(roq_profile_bucket{le="5000"}[1m]) / on (source, connection, function)
irate(roq_profile_count[1m])
Further readings
Function Profiling¶
The following charts are lifted straight from Grafana using the Prometheus queries outlined in the previous section
First the average processing time at different measurement points
Then the conditional processing time
Internal Ping Latency¶
For this example we run two instances of Client #1.
The first instance (test
) runs on processor #4 which is located
on the same NUMA node where the gateway is running.
The second instance (trader
) runs on processor #3 which is on
a different NUMA node.
These are average 1-way heartbeat ping latencies between the gateway and the clients
As expected, inter-process latencies are worse for the second instance.
Internal Round Trip Latency¶
The roq-test
program is used to test order management.
It waits, creates an order, waits again, it cancels the order
and finally it terminates when the order is indeed cancelled.
I0920 09:39:43.255401 107190 application.cpp:55] ===== START =====
I0920 09:39:43.255441 107190 application.cpp:56] Process: name="roq-test", version="0.4.3", type="", git="", date="Sep 16 2020", time="06:02:37"
I0920 09:39:43.255546 107190 service.cpp:39] The metrics service will *not* be started
I0920 09:39:43.256109 107190 controller.cpp:108] Dispatching...
I0920 09:39:43.256121 107190 controller.cpp:112] Starting event loop thread...
I0920 09:39:43.256161 107190 controller.cpp:126] Thread affinity 5
I0920 09:39:43.256253 107191 controller.cpp:148] Event loop thread is now running
I0920 09:39:44.267789 107191 session_manager.cpp:44] Connecting "unix:///var/tmp/roq-deribit.sock"
I0920 09:39:44.273765 107191 session.cpp:38] Adding name="deribit" (user_id=5)
I0920 09:39:44.273853 107190 pollster.cpp:403] Adding name="deribit" (user_id=5)
I0920 09:39:44.273870 107190 strategy.cpp:132] Connected
I0920 09:39:44.273917 107190 strategy.cpp:140] Downloading market data ...
I0920 09:39:44.273921 107190 strategy.cpp:169] Market data is READY
I0920 09:39:44.274311 107190 strategy.cpp:150] download_end={account="", max_order_id=0}
I0920 09:39:44.274314 107190 strategy.cpp:154] Download market data has COMPLETED
I0920 09:39:44.274317 107190 strategy.cpp:143] Downloading account data ...
I0920 09:39:44.274322 107190 strategy.cpp:182] Order manager is READY
I0920 09:39:44.274325 107190 strategy.cpp:150] download_end={account="A1", max_order_id=1000}
I0920 09:39:44.274327 107190 strategy.cpp:157] Download account data has COMPLETED
I0920 09:39:44.274328 107190 strategy.cpp:274] *** INSTRUMENT READY ***
I0920 09:39:44.395049 107190 strategy.cpp:261] *** READY TO TRADE ***
I0920 09:39:44.395222 107190 strategy.cpp:56] create_order={account="A1", order_id=1001, exchange="deribit", symbol="BTC-PERPETUAL", side=BUY, quantity=1.0, order_type=LIMIT, price=10959.5, time_in_force=GTC, position_effect=UNDEFINED, execution_instruction=UNDEFINED, stop_price=nan, max_show_quantity=nan, order_template=""}
I0920 09:39:44.395259 107190 strategy.cpp:225] order_ack={account="A1", order_id=1001, type=CREATE_ORDER, origin=GATEWAY, status=FORWARDED, error=UNDEFINED, text="", gateway_order_id=10000001, external_order_id="", request_id="roq-1600592255-15"}
I0920 09:39:44.425197 107190 strategy.cpp:225] order_ack={account="A1", order_id=1001, type=CREATE_ORDER, origin=EXCHANGE, status=ACCEPTED, error=UNDEFINED, text="success", gateway_order_id=10000001, external_order_id="4504419316", request_id="roq-1600592255-15"}
I0920 09:39:44.425207 107190 strategy.cpp:233] order_update={account="A1", order_id=1001, exchange="deribit", symbol="BTC-PERPETUAL", status=WORKING, side=BUY, price=10959.5, remaining_quantity=1.0, traded_quantity=0.0, position_effect=UNDEFINED, order_template="", create_time_utc=1600594784439000000ns, update_time_utc=1600594784439000000ns, gateway_order_id=10000001, external_order_id="4504419316"}
I0920 09:40:14.425761 107190 strategy.cpp:89] cancel_order={account="A1", order_id=1001}
I0920 09:40:14.425814 107190 strategy.cpp:225] order_ack={account="A1", order_id=1001, type=CANCEL_ORDER, origin=GATEWAY, status=FORWARDED, error=UNDEFINED, text="", gateway_order_id=10000001, external_order_id="4504419316", request_id="roq-1600592255-16"}
I0920 09:40:14.447198 107190 strategy.cpp:225] order_ack={account="A1", order_id=1001, type=CANCEL_ORDER, origin=EXCHANGE, status=ACCEPTED, error=UNDEFINED, text="canceled", gateway_order_id=10000001, external_order_id="4504419316", request_id="roq-1600592255-16"}
I0920 09:40:14.447211 107190 strategy.cpp:233] order_update={account="A1", order_id=1001, exchange="deribit", symbol="BTC-PERPETUAL", status=CANCELED, side=BUY, price=10959.5, remaining_quantity=1.0, traded_quantity=0.0, position_effect=UNDEFINED, order_template="", create_time_utc=1600594784439000000ns, update_time_utc=1600594784439000000ns, gateway_order_id=10000001, external_order_id="4504419316"}
I0920 09:40:14.447216 107190 strategy.cpp:104] *** FINISHED ***
W0920 09:40:14.447790 107191 controller.cpp:162] Signal 15 (Terminated)
I0920 09:40:14.447805 107191 controller.cpp:158] Event loop thread has terminated
I0920 09:40:14.447805 107190 controller.cpp:118] Waiting for event loop thread to terminate...
I0920 09:40:14.447855 107190 controller.cpp:121] Done!
I0920 09:40:14.448083 107190 application.cpp:69] ===== STOP =====
The gateway updates round-trip latencies for each request
For this particular setup we have an average round-trip latency around 12 microseconds.
External Ping Latency¶
The gateway also sends regular ping messages to the exchange
The main point of showing the external latency is that the software itself should be fast enough for most purposes.
In this particular case there’s in decreasing order of magnitude
Physical distance between server (Switzerland) and exchange (London)
Variability when the network packets meant for you are routed by a standard ISP
Standard network switch equipment
Standard Linux network stack (i.e. no kernel-bypass solution)
External factors will most likely dominate the total latency.
However, you may possibly benefit from low latency, if you have the option to co-locate with the exchange.
Another take on this is that you have one less thing to worry about, if you can rely on the software solution to be performant enough.
Conclusion¶
We have shown how to configure a server, install and configure the software, and how to obtain relevant metrics for host latencies. Examples have been shown using command-line, Prometheus and Grafana.
To note, this test used an AMD Epyc Zen (1st generation) embedded processor. With a proper server solution you should easily be able to get single digit microsecond host round-trip latencies.
If you have any comments, feel free to share or contact us using the links below.