What are failfast errors in Linkerd and how do you debug them?

One common error message in Linkerd logs is around “failfast” (also called fail-fast or fail fast). What is failfast, and what does it mean for you?

Let’s find out.

What is failfast?

If you’re using Linkerd, you might encounter failfast in error messages like these:

outbound:accept{client.addr=172.16.26.21:42480}:ingress{addr=172.16.108.124:4444}:http{v=1.x}:override{dst=linkerd-communication.linkerd-dev.svc.cluster.local:80}: linkerd_stack::failfast: HTTP Logical service has become unavailable

outbound:accept{client.addr=172.30.1.53:50162}:proxy{addr=172.28.70.43:8080}:http{v=1.x}:logical{dst=some-service.namespace.svc.cluster.local:8080}:concrete{addr=some-service.namespace.svc.cluster.local:8080}: linkerd_stack::failfast: HTTP Balancer in failfast

outbound:accept{client.addr=10.244.5.133:33824}:proxy{addr=10.245.112.110:3306}: linkerd_stack::failfast: TCP Server service has become unavailable

inbound:accept{client.addr=10.244.5.133:33824}:proxy{addr=10.245.112.110:3306}:tcp: linkerd_stack::failfast: TCP Logical service has become unavailable

Failfast is a state that Linkerd enters when it is unable to reach a destination that it was asked to proxy a request to. Once it’s in the failfast, further requests to that destination are immediately returned as failures back to the caller, until such time as the destination actually becomes available.

Failfast is an implementation detail of the proxy. However, because it is often present in log messages when something is going wrong, we see many questions about what failfast actually is. So when debugging a failfast message, the most important thing to understand is that failfast it is not an error by itself. Failfast is a symptom of an underlying problem. If you see a failfast error, don’t blame Linkerd. Look at what Linkerd is trying to connect to—the problem is probably there!

Common ways that failfast happens

Failfast can happen in two ways.

First, failfast can happen on the inbound (server) side, when Linkerd is proxying an incoming request to the local app container. In this case, it means that Linkerd is unable to reach the app container on the given port. For example, if pod A tries to connect to meshed pod B on port 1234, and the application container in pod B doesn’t listen on port 1234, then the proxy in pod B would log a server-side failfast error for port 1234. It tried to connect to port 1234 on the local app container, but couldn’t, and future requests to port 1234 will be immediately failed by Linkerd. (Of course, Linkerd will periodically re-check whether port 1234 is open, and if the app container later opens that port, will leave fail-fast mode and proxy the connections as expected.)

The other way failfast can happen is on the outbound (client) side, when Linkerd is asked to proxy a request from the local app container to a destination somewhere else. For example, if meshed pod A wants to connect to port 1234 on pod B, and pod B is not listening on port 1234, then the proxy in pod A would log a client-side failfast error. More commonly, if meshed pod A is trying to connect to a service C on port 1234, and C has no endpoints available, then A will enter failfast.

Why is failfast necessary?

Failfast is really an optimization in Linkerd. Every time Linkerd tries to establish a connection, it sets a timeout on the operation (by default, 10 seconds). If Linkerd can’t connect within 10 seconds, it considers that a failure.

So failfast is simply there so that Linkerd (and your application) can avoid waiting for that timeout to occur any more than is strictly necessary.

Common situations that lead to failfast errors

  • The service you’re trying to connect to doesn’t select over any pods.
  • Every pod in the service has been evicted because they’re all failing liveness probes.
  • You’re trying to connect to a service on a port that it doesn’t use.

How do I debug failfast errors in Linkerd?

The failfast log line will typically include either inbound or outbound near the beginning of the string. This tells you whether it’s server-side (inboudn) or client-side (outbound).

For server-side failfast, there’s really only one problem: your application isn’t responding on the port Linkerd is trying to connect to it on. Figure out why.

For client-side failfast, you can use the linkerd diagnostics endpoints command to list out the endpoints that Linkerd is aware of for a given service. For example,

linkerd diagnostics endpoints emoji-svc.emojivoto.svc.cluster.local:8080 web-svc.emojivoto.svc.cluster.local:80

Hopefully this will give you a clue into why Linkerd is unable to connect to the destination.

Good luck and happy debugging!

3 Likes