[Multicluster] Link failure

Hello,I was hoping to get some help in setting up linkerd for multicluster communication. I have followed the instructions here and was able to link my clusters. I noticed I’m only getting a link up state for one of the clusters but the other clusters are failing.

This is what I see on the failing cluster using linkerd check:

linkerd-multicluster
--------------------
√ Link CRD exists
√ Link resources are valid
	* claremont
	* geneva
√ remote cluster access credentials are valid
	* claremont
	* geneva
√ clusters share trust anchors
	* claremont
	* geneva
√ service mirror controller has required permissions
	* claremont
	* geneva
√ service mirror controllers are running
	* claremont
	* geneva
× probe services able to communicate with all gateway mirrors
        liveness checks failed for claremont
    liveness checks failed for geneva
    see https://linkerd.io/2.13/checks/#l5d-multicluster-gateways-endpoints for hints
√ multicluster extension proxies are healthy
‼ multicluster extension proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-gateway-659f7dcd9c-9txfp (stable-2.13.2)
	* linkerd-service-mirror-claremont-75b768665b-ct2ft (stable-2.13.2)
	* linkerd-service-mirror-geneva-6c8d44b8b-j7mgv (stable-2.13.2)
	* linkerd-gateway-659f7dcd9c-9txfp (stable-2.13.2)
	* linkerd-service-mirror-claremont-75b768665b-ct2ft (stable-2.13.2)
	* linkerd-service-mirror-geneva-6c8d44b8b-j7mgv (stable-2.13.2)
    see https://linkerd.io/2.13/checks/#l5d-multicluster-proxy-cp-version for hints
√ multicluster extension proxies and cli versions match

Failing clusters gateway checks:

linkerd --context=elyria multicluster gateways
CLUSTER    ALIVE    NUM_SVC      LATENCY
claremont  False          0            -
geneva     False          0            -

working clusters gateway checks:

linkerd --context=claremont multicluster gateways
CLUSTER  ALIVE    NUM_SVC      LATENCY
elyria   True           0          3ms
geneva   True           0          2ms

service-mirroring container logs on non-working cluster:

time="2023-06-27T20:31:15Z" level=warning msg="Gateway returned unexpected status 503. Marking as unhealthy" probe-key=claremont

linkerd-proxy container logs for non-working cluster:

[  2868.313630s]  WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[  2869.815798s]  WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[  2870.934761s]  INFO ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:rescue{client.addr=10.244.4.68:33950}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.116.134:4191: service unavailable error.sources=[service unavailable]
[  2872.819979s]  WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[  2874.068416s]  INFO ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:rescue{client.addr=10.244.4.68:33954}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.116.134:4191: service unavailable error.sources=[service unavailable]
[  2874.321869s]  WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[  2875.825250s]  WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[  2877.071833s]  INFO ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:rescue{client.addr=10.244.4.68:36266}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.116.134:4191: service unavailable error.sources=[service unavailable]
[  2877.327552s]  WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[  2878.829522s]  WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[  2880.149621s]  INFO ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:rescue{client.addr=10.244.4.68:36274}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.116.134:4191: service unavailable error.sources=[service unavailable]
[  2880.331180s]  WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s

I’m also seeing this on the gateway pods for the non-working cluster:

[  2773.985830s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_core::serve: Connection closed error=TLS detection timed out client.addr=10.244.4.1:44681
[  2774.019384s]  INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=direct connections must be mutually authenticated error.sources=[direct connections must be mutually authenticated] client.addr=10.224.0.10:30646

I also noticed that the service mirror on the non-working cluster was pulling an internal IP from the working cluster instead of the load-balancer external IP. To add on, the working cluster is on AWS and AWS doesn’t use IP but DNS.

Replied in issue Multi-cluster not working between EKS and AKS · Issue #11069 · linkerd/linkerd2 · GitHub