Linkerd Destination Scaling Issues

We run linkerd on a lot of Kubernetes clusters and it generally works fine for us. We recently upgraded from 2.10.2 to 2.14.8 with control-plane 1.16.10. This worked fine for most of our clusters.

However, on one of our more dynamic clusters we started to notice weird random 504 errors for no reason (most noticable in our ingresses). This seems to happen periodically every 10 minutes with a low probability (1-10 failures each time but only at that time). The cluster topology changes a lot over the day with between 3k and 6k replica sets and 15 to 100 nodes on a normal day. A lot of replicasets are replica 0 most of the time. Most of the others are replica 1. We got between 1k and 3k pods only.

Digging deeper we found the following logs:

linkerd-proxy (sidecar in ingress):

[   366.299401s]  INFO ThreadId(01) outbound:ingress{addr=172.20.21.120:81}:service{ns= name=service port=0}: linkerd_proxy_api_resolve::resolve: No endpoints
[   369.298746s]  WARN ThreadId(01) outbound:ingress{addr=172.20.21.120:81}: linkerd_stack::failfast: Service entering failfast after 3s
[   369.298818s]  INFO ThreadId(01) outbound:ingress{addr=172.20.21.120:81}:rescue{client.addr=172.20.95.136:37560}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service xxxx.yyyy.svc.cluster.local:81: route default.http: backend default.service: service in fail-fast error.sources=[route default.http: backend default.service: service in fail-fast, backend default.service: service in fail-fast, service in fail-fast]

This happens at the time of the 504 in our ingress. However, the service xxxx.yyyy.svc.cluster.local actually exists and also did not change recently. So it must be something else.

destination:

time="2024-02-02T13:22:28Z" level=info msg="Trace[1398023451]: \"DeltaFIFO Pop Process\" ID:xxx/y1,Depth:12,Reason:slow event handlers blocking the queue (02-Feb-2024 13:22:28.553) (total time: 102ms):\nTrace[1398023451]: [102.144017ms] [102.144017ms] END\n"
time="2024-02-02T13:22:29Z" level=info msg="Trace[1252857032]: \"DeltaFIFO Pop Process\" ID:xxx/y2,Depth:1022,Reason:slow event handlers blocking the queue (02-Feb-2024 13:22:29.099) (total time: 104ms):\nTrace[1252857032]: [104.085804ms] [104.085804ms] END\n"
time="2024-02-02T13:22:29Z" level=info msg="Trace[1364238191]: \"DeltaFIFO Pop Process\" ID:xxx/y3,Depth:34,Reason:slow event handlers blocking the queue (02-Feb-2024 13:22:29.275) (total time: 124ms):\nTrace[1364238191]: [124.362754ms] [124.362754ms] END\n"

This can be found in destination at the same time the error occurs. It seems to happen periodically every 10 minutes but only when our cluster is experiencing some kind of load/changes (i.e. not during lunch time or during the night).

Occasionally we also see readiness probes fail (not every time):

Warning  Unhealthy  9m33s (x4 over 9m40s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500

Sometimes the error will also be something like this:

Warning  Unhealthy  18m   kubelet            Readiness probe failed: Get "http://172.20.46.177:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Destination does not use a lot of CPU (about 50m with a lot of headroom) nor RAM (about 500-700MB). We tried scaling destination from 3 to 10 and it did not really change anything. The number of blocking the queue log lines above stayed the same absolutely (even though we had 3 times the pods). We expected either those log lines to tripple (i.e. because its caused by reading updates) or the number of errors to reduce absolutely (i.e. because of less load per pod). This might indicate that this is caused by some network request? Not sure.

We tried to find metrics inside destination but could not find anything obvious. We got roughly sum(ip_port_subscribers) = 1500 which is about the number of pods we had at that time. avg(sum by (pod) (ip_port_subscribers)) = 500 with replica 3 and 1500 with replica 10. We noticed that rate(grpc_server_handling_seconds_sum{namespace="linkerd", component="destination", grpc_method="Get"}[15m]) / rate(grpc_server_handling_seconds_count{namespace="linkerd"}[15m]) is very high. Most pods are between 15 and 800s which looks like a lot. We checked other clusters where the metric is about 1.5s with some occasional spikes up to a few hundred seconds. We are not sure if this is an issue or expected.

Any ideas what we should tune or check? Linkerd destination does not seem to have a lot of config options. Downgrading would be an option but we would prefer to fix forward.

Hey @jabdoa,

It looks like you’re seeing strange behavior out of the destination service. You can try upgrading to Linkerd 2.14.9 or even consider the latest edge release, 24.2.2.

You can diagnose sync issues on any given service with the linkerd diagnostics endpoints command but you may be best off looking at later versions of the control plane as they likely contain fixes for any bugs that have been found in destination.

Hi! Thanks for your reply.

We updated to 2.14.9 in the meantime which has some fixes in the changelog around destination. However, our issues persist even with 2.14.9. We will try diagnosis.

Jan