<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

Hello,<br>

<br>

W.C.A. Wijngaards wrote:<br>

<blockquote id="mid_4A200433_3020607_nlnetlabs_nl"

 cite="mid:4A200433.3020607@nlnetlabs.nl" type="cite">

  <pre wrap="">Are those replies from authority servers? That arrive just after unbound

times out and closes the socket?

  </pre>

</blockquote>

I'm not sure I understand this correctly. What we do is the following:<br>

- we have some loaded (some thousand queries per sec) recursive

nameservers (behind a load balancer) on which the clients report

occasional loss of answers<br>

- we start a query against the servers, the program sends a

configurable amount of queries for the same name (so no, the queries

should be answerable from the cache) and waits 5 seconds for the answers<br>

- we sometimes find timeouts on the client<br>

- investigating this yields a capture (made on the nameservers), which

has the client's request (so there is no packet loss involved between

the client and the server), but no answer from the server<br>

<br>

The cache is quite big (the machines have 8 GiB of RAM), the TTL is

high, so answers all should come from it.<br>

<br>

<blockquote id="mid_4A200433_3020607_nlnetlabs_nl"

 cite="mid:4A200433.3020607@nlnetlabs.nl" type="cite">

  <pre wrap="">Some sort of selective verbose logging is an idea on my TODO.

  </pre>

</blockquote>

Great, it would ease debugging issues like this.<br>

<blockquote id="mid_4A200433_3020607_nlnetlabs_nl"

 cite="mid:4A200433.3020607@nlnetlabs.nl" type="cite">

  <pre wrap="">

Unbound will indeed not respond to particular queries.  These queries

end up getting counted as 'cache hits', but really they were malformed.

 Some malformed queries unbound does not reply to - such as queries with

QR=1 flag, or shorter than 12 byte queries.  Since you are sending them

yourself, it seems unlikely they are this malformed.

  </pre>

</blockquote>

The queries should be the same. Or at least, I haven't seen any

differences between the "good" and the "bad" ones in wireshark.

<blockquote id="mid_4A200433_3020607_nlnetlabs_nl"

 cite="mid:4A200433.3020607@nlnetlabs.nl" type="cite">

  <blockquote id="StationeryCiteGenerated_1" type="cite">

    <pre wrap="">

With tcpdump it seems that the machine gets the query, but there is no

answer from unbound.

Its statistics counters seems OK, there is no full queue, or drops

according to that.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Is it a query to port 53?  Queries to other ports are not answered.

  </pre>

</blockquote>

Yes.<br>

<blockquote id="mid_4A200433_3020607_nlnetlabs_nl"

 cite="mid:4A200433.3020607@nlnetlabs.nl" type="cite">

  <pre wrap="">

Are there 'jostled' queries? They also create dropped replies by

replacing an existing (old) one.

  </pre>

</blockquote>

unbound-control currently tells this:<br>

thread0.requestlist.avg=757.391<br>

thread0.requestlist.max=867<br>

thread0.requestlist.overwritten=0<br>

thread0.requestlist.exceeded=0<br>

I have graphs from this (munin) and I haven't seen a single overwritten

or exceeded value on any of our servers.<br>

<blockquote id="mid_4A200433_3020607_nlnetlabs_nl"

 cite="mid:4A200433.3020607@nlnetlabs.nl" type="cite">

  <pre wrap="">

  </pre>

  <blockquote id="StationeryCiteGenerated_2" type="cite">

    <pre wrap="">   72146984 dropped due to full socket buffers

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Could this explain the 1 in 100qps-for-3600s that are dropped?  Could

they be dropped at the query-sender (seems unlikely)?

  </pre>

</blockquote>

We monitor the traffic on three points: before, and after the load

balancer and on the DNS servers. The queries for which we haven't got

an answer are in all three captures, so no, it gets to the server.<br>

<blockquote id="mid_4A200433_3020607_nlnetlabs_nl"

 cite="mid:4A200433.3020607@nlnetlabs.nl" type="cite">

  <pre wrap="">

  </pre>

  <blockquote id="StationeryCiteGenerated_3" type="cite">

    <pre wrap="">I've already tried to raise the related sysctls, without any effects.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

You tried to increase socket buffers already, I presume. Weird.

  </pre>

</blockquote>

Yes, I've tried increasing net.inet.udp.recvspace, kern.ipc.maxsockbuf.<br>

<br>

According to netstat -m, there is no mbuf shortage:<br>

# netstat -m<br>

5148/5097/10245 mbufs in use (current/cache/total)<br>

4080/2470/6550/25600 mbuf clusters in use (current/cache/total/max)<br>

4080/1424 mbuf+clusters out of packet secondary zone in use

(current/cache)<br>

0/0/0/12800 4k (page size) jumbo clusters in use

(current/cache/total/max)<br>

0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)<br>

0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)<br>

9447K/6214K/15661K bytes allocated to network (current/cache/total)<br>

0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)<br>

0/0/0 requests for jumbo clusters denied (4k/9k/16k)<br>

0/0/0 sfbufs in use (current/peak/max)<br>

0 requests for sfbufs denied<br>

0 requests for sfbufs delayed<br>

0 requests for I/O initiated by sendfile<br>

0 calls to protocol drain routines<br>

<br>

We have pf active, but its states are also monitored, and it doesn't

reach the maximum, and udp related timeouts are high:<br>

udp.first                    60s<br>

udp.single                   30s<br>

udp.multiple                 60s<br>

<br>

Thanks,

</body>

</html>