It's not always DNS, unless it is

  • It's amazing how often finding the obvious cause of a problem only mitigates it, and you end up having to solve it 2 or 3 more times in the following weeks.

    In this case, there were NUMEROUS suboptimal or misconfigurations of DNS but none of them mattered until the volume reached a tipping point, and, suddenly, ALL of them came into play. Fixing one overflowed into the next, which overflowed into the next.

  • The real learnings of this incident is how to handle incidents effectively. The author cites ones that I use every time I manage one:

    * Centralize - 1 tracking doc that describes the issue, timeline, what's been tested, who owns the incident. Have 1 group chat, 1 'team' (virtual or in person). Get an incident commander to drive the group.

    * Create a list of hypotheses and work through them one at a time.

    * Use data, not assumptions to prove or disprove your hypotheses.

    * Gather as much data as you can, but don't let a particular suspicious graph lead you into a rabbit hole. Keep gathering data.

    If you don't do the above you are guaranteed to have a mess, have to repeat yourself over and over and waste time.

  • Making it sounds like DNS problem is clickbaity. It is a coredns/kube-dns problem.

    And yes k8s world, dns fails more often than you think.

  • Sounds more like a kubernetes problem than a dns problem.

    I hate coredns. Everything running inside of a kubernetes cluster should just be querying the kubernetes endpoints api for these IPs directly and using the node dnsservers for external hosts.

  • One red flag that stood out for me is where the blog says the team considered all apps to be the same and hadn’t looked at any of their logs, only infrastructure stuff.

    When they looked they saw all apps were not the same, and it was only a few kinds of apps that were affected.

    When a big incident hits, you need people drilling down not just across; and hopefully people who know the actual apps in question.

    Maybe this was DevOps people too far into the ops side and not as much on the dev?

  • Had a similar problem at work a while ago. One service was unable to connect to another occasionally. The Splunk logs said it was a TLS connection problem. After an unsuccessful attempt at reproducing the problem locally, it eventually dawned on me it might be Kubernetes DNS. And by changing temporarily to not using DNS for connecting to that host, we confirmed that indeed it was Kubernetes DNS.

  • This post reminds me about a similar but different K8s Turned-Out-To-Be-DNS problem we had recently. We published a write-up about it but never got around to submitting it here before now: https://news.ycombinator.com/item?id=38740112

  • > To avoid ndots issues, the most straightforward solution is to have at least five dots in our hostname. Fluent-bit is one of the biggest abusers of the DNS requests. ... As it now has five dots in the domain, it doesn’t trigger local search anymore.

    But it wasn't DNS. DNS didn't break. The protocol didn't break. Not even issues with the CoreDNS or dnsmasq implementations.

    The culprit was ndots (why did Kubernetes arbitrarily choose five dots) and the general way that Kubernetes (ab)uses DNS.

  • I always chuckle at the "It's not DNS...it was DNS" line because in my experience, the problem is usually actually DHCP.

    I'm struggling with a problem where a VM is supposed to get an IP address from the host, but it takes forever to do so. The host is telling me it has assigned an IP, but the VM says it hasn't. It can take anywhere from 10-60 minutes for the VM to actually get the IP that the host has assigned.

  • DNS is for friendly names; friendly to humans using web browsers. Using DNS for machine to machine communication is not essential complexity. Every chance I get I eliminate DNS from internal infrastructure and a whole lot of things get a lot better. If you naively keep forward/reverse DNS resolutions on in different parts of the stack, you end up with a shitstorm of DNS lookup requests at even a moderate scale infrastructure. Then bad things tend to happen.

  • It's always DNS. And if it's not DNS, it's certificates.

    99.9% of the time.

  • oh, k8s and DNS... Spent a lot of hours trying to debug a bug and it was "k8s DNS would eventually expose pods though DNS, but it could take 30 seconds" (or time till pod becomes ready + 30 seconds, because coredns caches negative DNS responses).

    I am feeling that caching all DNS responses for 30 seconds is not always the solution for all kind of usage patterns... Ah, generic solutions are for generic problems (which are usually not your problems).

  • I'm not sure if it was a prank or mistake, but someone recently set up a machine for me and they fat fingered the IP on the primary DNS server, so everything "worked" but was super slow due to the primary lookup silently timing out.

    I barked up the wrong tree for a while and then a more senior guy immediately found the issue. Anyways, now I grok this headline and have a new prank in my kit.

  • The latest thing I had with DNS is that a client and server were communicating with EDNS packet sizes greater than 4096 but an intermediate caching server couldn't handle it and I'd get intermittent resolution failures when the intermediate server landed on one host. Fortunately was just able to boost.

  • I’ve come to check DNS first nowadays. It’s the equivalent of checking if it’s plugged in at this point for me.

  • Do you have node local DNS setup? https://kubernetes.io/docs/tasks/administer-cluster/nodeloca...

    Might have been a quicker, easier "fix".

  • This reminds me of an experience from two decades ago. https://0xfe.blogspot.com/2023/12/the-firewall-guy.html

    It's not always the Firewall -- unless it is :-)

  • Great writeup, and having a lot of issues with fluentd buffer overflows over the years it absolutely tickled me that was the main clue that led to the discovery of the issue.

  • Anyone else have difficulty parsing that headline and making sense of it?