Outbound calls failure

Hi team, Recently we’ve started facing issue with outbound calls for many pods when trying to connect with services within same cluster. Proxy was OOM killed and we fixed it, we restarted Linkerd control plane components, including the destination and identity pods.

However, even after stabilizing the control plane, the connection issues persisted for the affected application pods. The problem was only resolved after we manually restarted each impacted application pod.

We would like to better understand the underlying behavior here:

  1. Why does restarting the Linkerd control plane (destination/identity) not automatically restore connectivity for already-running pods?
  2. Is there a way to refresh or resync metadata, or proxy state without restarting the application pods?
  3. Does the linkerd-proxy sidecar maintain any persistent state that requires a full pod restart after certain failure scenarios?

Our goal is to understand whether this behavior is expected and whether there is a cleaner recovery approach than restarting all affected application pods.

Adding few logs we found from proxy sidecar container during the outage

worker must set a failure if it exits prematurely
thread 'main' panicked at /__w/linkerd2-proxy/linkerd2-proxy/linkerd/proxy/balance/queue/src/service.rs:73:18:
[238763.453630s]  WARN ThreadId(01) outbound: linkerd_app_core::serve: Server failed to become ready error=buffer's worker closed unexpectedly client.addr=10.2.57.34:52068

Adding previous ticket where we had raised for similar issue, but it was due to certificate expiry linkerd-identity-issuer-not-refreshing-certificates-as-expected - we had to restart all application pods to restore our services.

Hey @Darshan, sorry for the delay here! I’m just back from vacation.

Linkerd proxies currently set up their identities at startup. Notably, this includes loading the identity issuer certificate and trust anchor, which means that if the identity issuer or trust anchor ever expire, you must restart the proxies to get an updated trust chain. That’s what you were seeing here. We’re actively working on making this smoother, but the data-plane restart is very important at the moment.

Hi @Flynn understood that we need to restart the proxies to get an updated trust chain. But main concern here is post OOM of control plane components and restarting identity, destination pods - outbound call failure issue still persisted until restart of application pods - is there any better way to fix issue without restarting application pods? or can we expect an auto-recovery of service mesh in such cases without the need of application pods restart in upcoming linkerd versions ?