Tuning DNS for TCP queries

This page summarizes options to tune DNS servers to handle TCP queries.

Goals

Our goal is to allow DNS authoritative servers to support many concurrent TCP connections. We support TCP for large queries and zone transfers. We are also exploring support for DNS-over-TLS using TCP port 853.

These notes are represent our current understanding. We will update these pages as we learn more.

Server TCP performance for web servers has been carefully studied. Some of that guidance also applies to DNS-over-TCP.

Server Software

Bind

Bind-9.16 has reasonable TCP performance.

We are looking at bind-9.16 parameter tuning.

In progress: We have seen occasional problems with bind locking up with all file descriptors in use, even with the kernel tuning described below.

Some configuration is needed in named.conf to increase limits for a large number of simultaneous connections. Actual values will vary depending on the OS kernel limits used.

options {

  # processing out of order in parallel is faster
  keep-response-order { none; };

  # raise limit on max files
  files 6500000;

  # use listen queue length from OS
  tcp-listen-queue 0;

  # allow lots of clients (default is just 150)
  tcp-clients 6500000;

  # have 16k sockets pre-reserved
  reserved-sockets 16384;
}

Nsd

We have not yet looked at nsd TCP performance.

Knot

Knot-3.1.8 has reasonable TCP performance.

General Kernel Tuning

The following sections describe various studies and recommendations about how to tune linux kernel parameters to increase TCP throughput for DNS services.

Advice from Liang Zhu et al.

These values were used while testing DNS over TLS performance in the paper [1]

Liang Zhu and John Heidemann 2018. LDplayer: DNS Experimentation at Scale. Proceedings of the ACM Internet Measurement Conference (Boston, Massachusetts, USA, Oct. 2018), to appear. [DOI] [PDF] Details

net.ipv4.tcp_max_syn_backlog=8388600
net.core.netdev_max_backlog=8388600

# somaxconn limited to 16-bit until 4.1 and later kernels
net.core.somaxconn=65535

# expand port ranges
net.ipv4.ip_local_port_range=2048 65500

# increase system limit on open files
fs.file-max = 6500000

# increase process limit on open files
fs.nr_open  = 6500000

Once the open file limit is increased, the ulmits also need to be increased. This can be done by creating /etc/security/limits.d/increase-nofile.conf with these lines (including the ‘*’):

\* soft nofile 6500000
\* hard nofile 6500000

Note that processes started by systemd may not check/honor security limits set in /etc/security/limits.d/. You can create a system or per-service file to increase file limits. For example, if for knot.service:

# mkdir /etc/systemd/system/knot.service.d/
# cat >/etc/systemd/system/knot.service.d/filelimit.conf << EOF
[Service]
# value is SOFTLIMIT[:HARDLIMIT]
LimitNOFILE=6500000:8388600
EOF
# systemctl daemon-reload
# service knot restart

To increase limits for all services:

# mkdir /etc/systemd/system.conf.d/
# cat >/etc/systemd/system.conf.d/filelimit.conf << EOF
[Manager]
# value is SOFTLIMIT[:HARDLIMIT]
DefaultLimitNOFILE=6500000:8388600
EOF

For the system level change to take effect, the system must be rebooted (just running systemctl daemon-reload does not work).

Advice from ESnet

10G

Quoting https://fasterdata.es.net/host-tuning/linux/:

For a host with a 10G NIC, optimized for network paths up to 100ms RTT, and for friendliness to single and parallel stream tools:

  # read/write receive buffer sizes; buffers up to 64MB
  net.core.rmem_max = 67108864
  net.core.wmem_max = 67108864
  net.core.rmem_default = 67108864
  net.core.wmem_default = 67108864

  # increase Linux autotuning TCP buffer limit to 32MB
  net.ipv4.tcp_rmem = 4096 87380 33554432
  net.ipv4.tcp_wmem = 4096 65536 33554432

  # recommended default congestion control is htcp
  net.ipv4.tcp_congestion_control=htcp

  # recommended for hosts with jumbo frames enabled
  #net.ipv4.tcp_mtu_probing=1

  # recommended to enable 'fair queueing'
  net.core.default_qdisc = fq

We also strongly recommend reducing the maximum flow rate to avoid bursts of packets that could overflow switch and receive host buffers.

For example for a 10G host, add this to a boot script:

  /sbin/tc qdisc add dev ethN root fq maxrate 8gbit

Configuration from another authoritative server operator

# Increase the maximum amount of option memory buffers
net.core.optmem_max = 10485760

# Increase the maximum total buffer-space allocatable
# This is measured in units of pages (4096 bytes)
net.ipv4.udp_mem = 747744 996992 1495488
net.ipv4.tcp_mem = 8388608 8388608 8388608

# Decrease the time default value for tcp_fin_timeout connection.
# time orphaned (unreferenced) connection will wait before aborted.
# some say not lower then 30, other say 5-10 for busy server; dflt 60s
net.ipv4.tcp_fin_timeout = 15

# prefer lower latency as opposed to higher throughput.
# removed in v4.13
net.ipv4.tcp_low_latency = 1

# Increase tcp-time-wait buckets pool size to prevent simple DOS attacks
net.ipv4.tcp_max_tw_buckets = 2000000

# don't cache metrics for closed connections
net.ipv4.tcp_no_metrics_save = 1

# see tcp_retries2 for details
# default 8; less means timeout sooner
net.ipv4.tcp_orphan_retries = 1

# disable select acknowledgement (SACKS)
net.ipv4.tcp_sack = 0

# Number of times SYNACKs for passive TCP connection
# default (CentOS 7) is 5; less means timeout sooner
net.ipv4.tcp_synack_retries = 1

values lower/different than ESnet or liang values

#net.core.netdev_max_backlog = 300000
#net.ipv4.tcp_max_syn_backlog = 65536
#net.core.rmem_max = 10485760
#net.core.wmem_max = 10485760
#net.core.somaxconn = 32768
#net.ipv4.ip_local_port_range = 1024 65535
#net.ipv4.tcp_wmem = 1024 4096 1048576
#net.ipv4.tcp_rmem = 1024 4096 1048576

# not recomended:
# net.ipv4.tcp_tw_reuse = 1

# in 4.12 and later, timestamps randomized and using 1 is recommended
# net.ipv4.tcp_timestamps = 0

Programmatic evaluation of linux tuning parameters using DIINER

In 2023, Spencer Stingley, a USC Masters Student, conducted a number of tests using the DIINER test bed of various linux kernel parameters. The following are his results, which do not agree with the importance of some parameters in the above recommendations. At this point, it is unclear whether these differences result from a difference in testing methodologies, the importance of the configuration settings themselves, the differences in the hardware/software where the tests were being run, or potentially that multiple results from dnsperf vary too widely to be used for fine-grain tuning. Note that this work has not been through a peer-review process, unlike Liang’s paper above. These are his conclusions from his tests:

The following tuning results where found using at least twenty 5-minute trials of DNSperf, where each parameter was individually changed and measured against ~10% and ~100% server load capacities. The server and client were connected over a single 10G NIC on the DIINER test platform. To ensure statistical relevance of the results, Dunn’s and Tukey’s post hoc tests were used to identify any significant parameters. Dunn’s test finds significant parameters with respect to the baseline and Tukey’s test finds significance within the entire group, without need for a baseline.

The baseline system these results are compared against uses the default parameter values found in the standard installation of CentOS Core 7.

Improvements from net.ipv4.tcp_abort_on_overflow = 1

A noticable improvement at the 100% load setting was found by changing tcp_abort_on_overflow:

net.ipv4.tcp_abort_on_overflow = 1

This improved the latency by ~5%, the queries per second by ~8% and reduce the number of packets lost by ~98%. This parameter had no noticeable effect under other stress conditions.

Improvements from net.ipv4.tcp_sack = 1

Changing tcp_sack also had a statistically significant improvement in the average lantecy under 100% load (a roughly 1.5% decrease):

net.ipv4.tcp_sack = 1

It also had visual improvements in the number of queries per second, though this improvement was not statistically validated.

Negative consequences from tcp_tso_win_divisor = 1

Setting: tcp_tso_win_divisor = 1 has statistically validated and massively negative consequences. It is NOT recommended.

Other linux kernel parameters of note

Setting small (ex. 550B) rmem/wmem buffers seem to have no impact on system performance.

Increasing tcp_wmem / tcp_rmem buffers over the 32MB recomemdation provided by ESnet above offered zero improvement over the baseline.

Under all loads, changing the queuing method to any of the following: sfq, codel, fq_codel, and fq, didn’t have any impact on system performance.
In agreement with ESnet, inceasing optmem_max did not seem to have any noticable improvements.

Increasing the fs.file-max (past 524288 baseline) seemed to have no noticable improvement on the TCP performance

Changing the frequency scaling methods of all the processors to performance didn’t show any improvement.

Conflicting results with the prior advice

The sections ‘Adivce from Liang Zhu et al’ and ‘Advice from ESnet’ above had visual improvements in the scatter plots in their connection latency compared to all the other tested parameters. These two parameter groups and the baseline had significantly higher variance than other parameters, so while there may appear to be improvements in the means of the multiple tests, the variance was high enough that they could not statistically validated.

The sections ‘Configuration from another authoritative server operator’ and ‘values lower/different than ESnet or liang values’ did show noticable visual improvements in the latency and queries per second. They did not register statistical improvements but their data VERY closely matched the net.ipv4.tcp_sack results, which did have verifiable improvements as mentioned above.

The code used to run these trials and do the data analysis can be found at: https://github.com/BlankCanvasStudio/DNScalc

IRQ binding

There is also a lot of good information on IRQ balancing at https://fasterdata.es.net/host-tuning/linux/100g-tuning/interrupt-binding/.

Resources

ANT: the Analysis of Network Traffic research group
DNS datasets
DIINER
b.root-servers.net information
LD Player: a trace replay system we have used for DNS/TCP testing.