Hi all, we are struggling with upgrading linkerd in our cluster (from stable-2.14.1 to edge 24.10.5). The same upgrade procedure works fine on smaller clusters. But this cluster is relatively large (aws, k8s 1.29, 75 nodes, 2k+ pods). The upgrade of controlplane itself finished fine, but once we started restarting the existing workload, to use new proxy, it started degrade quickly into unusable state, the rollback to the original version luckily helped. We find out, that logs were flooded with thousands of messages.
From proxies:
"level": "WARN", "fields": { "message": "Unexpected policy controller response; retrying with a backoff", "grpc.status": "Deadline expired before operation could complete", "grpc.message": "initial item not received within timeout" }, "target": "linkerd_app::dst",
From destination:
"level": "error", "msg": "failed to send profile update: rpc error: code = Canceled desc = context canceled",
The prometheus metrics for cpu/mem consumptions didn’t show any bottleneck. But still, we tried significantly increase resources and number of replicas for the whole linkerd controlplane. Which only delay slightly the issue. We also tried increase resources on individual proxies and increase proxy-inbound-connect-timeout/proxy-outbound-connect-timeout, unfortunately without success.
Thanks in advance for any ideas.