Destination failed to send profile update

Hi all, we are struggling with upgrading linkerd in our cluster (from stable-2.14.1 to edge 24.10.5). The same upgrade procedure works fine on smaller clusters. But this cluster is relatively large (aws, k8s 1.29, 75 nodes, 2k+ pods). The upgrade of controlplane itself finished fine, but once we started restarting the existing workload, to use new proxy, it started degrade quickly into unusable state, the rollback to the original version luckily helped. We find out, that logs were flooded with thousands of messages.

From proxies:

"level": "WARN", "fields": { "message": "Unexpected policy controller response; retrying with a backoff", "grpc.status": "Deadline expired before operation could complete", "grpc.message": "initial item not received within timeout" }, "target": "linkerd_app::dst",

From destination:

"level": "error", "msg": "failed to send profile update: rpc error: code = Canceled desc = context canceled",

The prometheus metrics for cpu/mem consumptions didn’t show any bottleneck. But still, we tried significantly increase resources and number of replicas for the whole linkerd controlplane. Which only delay slightly the issue. We also tried increase resources on individual proxies and increase proxy-inbound-connect-timeout/proxy-outbound-connect-timeout, unfortunately without success.

Thanks in advance for any ideas.

The issue seems related to the control plane struggling under the cluster’s scale. The logs point to timeouts with the policy and destination controllers. Ensure these components have enough resources, and check for network latency or DNS issues. Try restarting workloads in smaller batches and consider upgrading to a stable version first before using the edge release.