Hi team, Recently we’ve started facing issue with outbound calls for many pods when trying to connect with services within same cluster. Proxy was OOM killed and we fixed it, we restarted Linkerd control plane components, including the destination and identity pods.
However, even after stabilizing the control plane, the connection issues persisted for the affected application pods. The problem was only resolved after we manually restarted each impacted application pod.
We would like to better understand the underlying behavior here:
- Why does restarting the Linkerd control plane (destination/identity) not automatically restore connectivity for already-running pods?
- Is there a way to refresh or resync metadata, or proxy state without restarting the application pods?
- Does the
linkerd-proxysidecar maintain any persistent state that requires a full pod restart after certain failure scenarios?
Our goal is to understand whether this behavior is expected and whether there is a cleaner recovery approach than restarting all affected application pods.
Adding few logs we found from proxy sidecar container during the outage
worker must set a failure if it exits prematurely
thread 'main' panicked at /__w/linkerd2-proxy/linkerd2-proxy/linkerd/proxy/balance/queue/src/service.rs:73:18:
[238763.453630s] WARN ThreadId(01) outbound: linkerd_app_core::serve: Server failed to become ready error=buffer's worker closed unexpectedly client.addr=10.2.57.34:52068
Adding previous ticket where we had raised for similar issue, but it was due to certificate expiry linkerd-identity-issuer-not-refreshing-certificates-as-expected - we had to restart all application pods to restore our services.