Bind Misconfiguration/Bug creates too many DNSKEY queries
Table of Contents
1 Summary of results
At least some versions of ISC's bind recursive resolver, when configured with certain options and the old DNSKEY as its trust anchor, occasionally triggers significant numbers of DNSKEY queries for the root zone's DNKSEY set. This is occurring today, after the old DNSKEY (KSK2010) has been revoked and removed from the active DNSKEY list. I believe it, or something similar, was also occurring during the period that the KSK2010 key was advertised in the zone as a revoked DNSKEY. It seems to only sporadically happen, making reproducing the bug challenging.
To trigger the issue, I created an experiment to repeatedly start and stop bind and set in a consistent set of requests. See further below for detailed setup instructions.
1.1 Graphing the number of queries sent per experiment
Each bar in the plot below shows the number of queries for the root ('.') DNSKEY sent during each experiment. The quantity varies widely, with the large peaks being the problem this page documents. This graph shows that reproducing the bug is challenging, as after 20 experiments it only showed up 3 times (15% of the time).
1.2 Graphing the requests leaving bind over time
Each point in the following graph represents the number of queries for the root ('.') DNSKEY sent in a given second. Each different experiment run is shown using a different symbol/color. Note that some experiments show only DNSKEY queries in the beginning, and others generate many more DNSKEY queries over time. Some end these queries, some do not. Note the periodicity, which I originally thought to be 60s but actually seems shorter when measuring it (the vertical dotted lines are the 60s intervals).
2 Experimental Setup
To reproduce these results, we take a freshly compiled and installed copy of bind 9.11.5-P4, and configure it with specific settings:
./configure --prefix=/usr/local/bind-9.11.5-P4
2.1 Bind config file
The bind configuration file I used and saved in /usr/local/bind-9.11.5-P4/etc/named.conf:
options { listen-on port 53 { 127.0.0.1; }; listen-on-v6 port 53 { ::1; }; directory "/usr/local/bind-9.11.5-P4/var/named"; dump-file "/usr/local/bind-9.11.5-P4/var/named/data/cache_dump.db"; statistics-file "/usr/local/bind-9.11.5-P4/var/named/data/named_stats.txt"; memstatistics-file "/usr/local/bind-9.11.5-P4/var/named/data/named_mem_stats.txt"; secroots-file "/usr/local/bind-9.11.5-P4/var/named/data/named.secroots"; recursing-file "/usr/local/bind-9.11.5-P4/var/named/data/named.recursing"; allow-query { localhost; }; recursion yes; dnssec-enable no; //dnssec-validation yes; managed-keys-directory "/usr/local/bind-9.11.5-P4/var/named/dynamic"; pid-file "/run/named/named.pid"; session-keyfile "/run/named/session.key"; }; zone "." IN { type hint; file "named.ca"; }; include "/usr/local/bind-9.11.5-P4/etc/bind.keys";
2.2 Out of date bind.keys file
In many instances, some linux and other packaging software refuses to overwrite existing configuration files when a file already exists. Assuming this may have happened on a system, we replace the newly installed bind.keys file with an old version that contains only the DNKSEY-2010 key (comments removed for brevity):
managed-keys { . initial-key 257 3 8 "AwEAAaz/tAm8yTn4Mfeh5eyI96WSVexTBAvkMgJzkKTOiW1vkIbzxeF3 +/4RgWOq7HrxRixHlFlExOLAJr5emLvN7SWXgnLh4+B5xQlNVz8Og8kv ArMtNROxVQuCaSnIDdD5LKyWbRd2n9WGe2R8PzgCmr3EgVLrjyBxWezF 0jLHwVN8efS3rCj/EWgvIWgb9tarpVUDK/b58Da+sqqls3eNbuv7pr+e oZG+SrDK6nWeL3c6H5Apxz7LjVc1uTIdsIXxuOLYA4/ilBmSVIzuDWfd RUfhHdY6+cn8HFRm+2hM8AnXGXws9555KrUB5qihylGa8subX2Nn6UwN R1AkUTV74bU="; };
2.3 Out of date managed keys file
Similarly, we replace the /usr/local/bind-9.11.5-P4/var/named/dynamic/managed-keys.bind file with just the old key (taken from a system that had never updated it using 5011 processing):
$ORIGIN . $TTL 0 ; 0 seconds @ IN SOA . . ( 4 ; serial 0 ; refresh (0 seconds) 0 ; retry (0 seconds) 0 ; expire (0 seconds) 0 ; minimum (0 seconds) ) KEYDATA 20150704162113 20150703162113 19700101000000 257 3 8 ( AwEAAagAIKlVZrpC6Ia7gEzahOR+9W29euxhJhVVLOyQ bSEW0O8gcCjFFVQUTf6v58fLjwBd0YI0EzrAcQqBGCzh /RStIoO8g0NfnfL2MTJRkxoXbfDaUeVPQuYEhg37NZWA JQ9VnMVDxP/VHL496M/QZxkjf5/Efucp2gaDX6RS6CXp oY68LsvPVjR0ZSwzz1apAzvN9dlzEheX7ICJBBtuA6G3 LQpzW5hOA2hzCTMjJPJ8LbqF6dsV6DoBQzgul0sGIcGO Yl7OyQdXfZ57relSQageu+ipAdTTJ25AsRTAoub8ONGc LmqrAmRLKBP1dfwhYB4N7knNnulqQxA+Uk1ihz0= ) ; KSK; alg = RSASHA256; key id = 19036 ; next refresh: Sat, 04 Jul 2015 16:21:13 GMT ; trusted since: Fri, 03 Jul 2015 16:21:13 GMT
3 Experimental procedure
To trigger the bug we perform the following task repeatedly:
- start bind
- start tcpdump
- Every 30 seconds:
- dig @localhost example.com
- sleep 1
- dig @localhost example.org
- sleep 1
- … repeat 7 times total
The results of this experiment and studying the resulting measurements are shown at the top of this page.
4 Follow-on questions
- Does the bug only happen when the bind.keys file is included, or is it enough to have just the (old) trust-anchor listed in the managed keys set?
- Does it matter what the state of the managed keys set is? Maybe it's just the bind.keys file that triggers the problem?
- What of the above named.conf settings are actually necessary to trigger the issue?
- What other versions of bind exhibit this behavior?
- Is this the same bug that was seen during the period where the DNSKEY set was marked as revoked? If so, are we about to see an increasing trend again? If different, how do we find and fix the other bug?
- What's the periodicity of the results?
- What causes the queries to end after a while for some running instances and not others?