Linkerd-destination has stale data in endpoint cache

We’re running linkerd 2.13.4 and we had a weird problem this morning. Around 15% of our traffic started dropping with failed to find transactions for account: rpc error: code = Unavailable desc = logical service transaction-server.tplat-team.svc.cluster.local:8024: service unavailable. We are fairly confident it’s related to linkerd because restarting the linkerd-destination pods fixed the issue.

We’re having trouble really nailing down the cause. There are no log messages or metrics that we can see that would indicate a problem. The only alert we go was that linkerd-destination pods were consuming more memory (but not cpu) than usual. We have found that the destination-controller had bad IPs in it’s cache, and that ended up causing Cilium to start routing things incorrectly.

Is there something I can do look for in the linkerd logs or metrics that might shed more light onto why this happened?

Error rate for cilium pod

Endpoint slice cache

You’ll notice that the cache size increases at very nearly the same time

Endpoint address count

Can you explain how you reached this conclusion?

It might prove useful to use the linkerd diagnostics endpoints to verify the resolution for a given endpoint.

Finally, please provide the logs to the destination container during those failures to see if there’s something that pops up.

We believe it is stale IPs (or, at least, IPs that should not be routed to according to the SMI TrafficSplit we have setup) for a few reasons.

For one thing, this event tends to occur after we have completed a rollout of a canaried set of pods. During the canary, there will be two sets of pods running: one with the new code (within its own service that is a leaf of the trafficsplit which has the original service that is being routed to: api-canary) and one with the old code (similar to the canary, this is the stable set of pods; api-stable). When the rollout finishes, traffic is stable, and the selectors for the stable service (api-stable) will be changed to mirror that of the canary (api-canary) because the changes have been promoted.

What we are seeing with linkerd is that, after the changes to the services have been made and linkerd should just be routing to the stable (newly promoted from canary) set of pods, it instead exhibits weird behavior where it will send traffic to the pods that were in the canary set (even after they are deleted).

During the TrafficSplit changes, something odd happens according to the metrics where traffic is all routed to a single pod, and then they are routed back to a set of pods that was just deleted, causing the proxies that are sending traffic to those pods to go into failfast, and only recover once the destination control plane recovers:

In the image above, you can see that the traffic gets shifted to the canary set (though, only one pod for some reason), and then progressively more traffic goes onto this single pod until it is determined that the canary set is good enough to become the new stable set. Then, once we have swapped to the new stable set (our old canary set of pods) we find that we are sending traffic to the replicaset of 85f7886bb6 instead of 778986fff9 (the canary set that was for some reason only routing to a single set of pods). This is worsened by the fact that these pods were in the process of being deleted! So, while traffic was successful for a small amount of time, it quickly failed, causing the linkerd proxies that were sending traffic to these pods to go into fail fast and then never update with the new endpoints that should point to the new stable set (778986fff9) until the destination control plane is restarted.

IMO, the set of endpoints 778986fff9 is stale, since we have swapped the selectors on the service to a different set of pods, and those pods are being deleted. For some reason they were never pruned or updated from the client pod’s linkerd proxy point of view, so traffic continued to be sent to them until the destination controller was restarted (which I assume, reset any stale caches or refreshed all the information that the linkerd proxies stream in on the client pods).

I have been digging into metrics a bit more, and one thing that stands out is that the outbound_http_balancer_endpoints metric from the upstream linkerd-proxy containers (the client callers) can get enormously large pending endpoints for the downstream service (the server being called) – greater than 300 of them – while there is a maximum of ~16 endpoints in the cluster for that service in total.

With that in mind (that, perhaps, the upstream linkerd-proxy is not removing pending endpoints from the balancer buffer), I wonder if something is causing the bad endpoints to be used. For example, in a fallback case where a 5XX code is received momentarily and circuit breaking kicks in, the proxy might start to retry some of those pending endpoints, but the list is so long, and all of those endpoints are non existent that the proxy ends up permanently retrying until the destination controller is restarted and the “ready” endpoints are forcibly updated, which starts the proxy out of trying the dead pending endpoints.

That is just conjecture though, I don’t have a lot of hard evidence to back this up other than the weird behavior with routing to stale IPs // endpoints seen above, and a outbound_http_balancer_endpoints{endpoint_state="pending"} metric that is absolutely enormous.