Hello
I’ve noticed that when a Linkerd proxy pod restarts within my service mesh, there is a brief period where traffic to the associated workload is either delayed or temporarily unavailable.
While Kubernetes ensures that the pod is rescheduled, and Linkerd itself should handle retries and load balancing, there still seems to be a noticeable impact; especially for latency-sensitive applications. This issue is more prominent when rolling updates / node drain events occur.
I’ve checked the Linkerd readiness and liveness probes, and they seem to be functioning as expected. However, it appears that there is a short window during proxy initialization where traffic routing isn’t fully stable.
I’m wondering if this is due to how the service mesh updates its routing tables or if there are additional configurations needed to smooth out the transition when proxies restart. Checked Troubleshooting | Linkerdazure solution architect related to this and found it quite informative.
Has anyone else experienced similar disruptions? Are there recommended best practices for ensuring zero-downtime proxy restarts, such as tweaking pod disruption budgets, fine-tuning Linkerd’s connection pooling settings, or using an alternative traffic failover strategy?
Thank you !!