Hello,I was hoping to get some help in setting up linkerd for multicluster communication. I have followed the instructions here and was able to link my clusters. I noticed I’m only getting a link up state for one of the clusters but the other clusters are failing.
This is what I see on the failing cluster using linkerd check:
linkerd-multicluster
--------------------
√ Link CRD exists
√ Link resources are valid
* claremont
* geneva
√ remote cluster access credentials are valid
* claremont
* geneva
√ clusters share trust anchors
* claremont
* geneva
√ service mirror controller has required permissions
* claremont
* geneva
√ service mirror controllers are running
* claremont
* geneva
× probe services able to communicate with all gateway mirrors
liveness checks failed for claremont
liveness checks failed for geneva
see https://linkerd.io/2.13/checks/#l5d-multicluster-gateways-endpoints for hints
√ multicluster extension proxies are healthy
‼ multicluster extension proxies are up-to-date
some proxies are not running the current version:
* linkerd-gateway-659f7dcd9c-9txfp (stable-2.13.2)
* linkerd-service-mirror-claremont-75b768665b-ct2ft (stable-2.13.2)
* linkerd-service-mirror-geneva-6c8d44b8b-j7mgv (stable-2.13.2)
* linkerd-gateway-659f7dcd9c-9txfp (stable-2.13.2)
* linkerd-service-mirror-claremont-75b768665b-ct2ft (stable-2.13.2)
* linkerd-service-mirror-geneva-6c8d44b8b-j7mgv (stable-2.13.2)
see https://linkerd.io/2.13/checks/#l5d-multicluster-proxy-cp-version for hints
√ multicluster extension proxies and cli versions match
Failing clusters gateway checks:
linkerd --context=elyria multicluster gateways
CLUSTER ALIVE NUM_SVC LATENCY
claremont False 0 -
geneva False 0 -
working clusters gateway checks:
linkerd --context=claremont multicluster gateways
CLUSTER ALIVE NUM_SVC LATENCY
elyria True 0 3ms
geneva True 0 2ms
service-mirroring container logs on non-working cluster:
time="2023-06-27T20:31:15Z" level=warning msg="Gateway returned unexpected status 503. Marking as unhealthy" probe-key=claremont
linkerd-proxy container logs for non-working cluster:
[ 2868.313630s] WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 2869.815798s] WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 2870.934761s] INFO ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:rescue{client.addr=10.244.4.68:33950}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.116.134:4191: service unavailable error.sources=[service unavailable]
[ 2872.819979s] WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 2874.068416s] INFO ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:rescue{client.addr=10.244.4.68:33954}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.116.134:4191: service unavailable error.sources=[service unavailable]
[ 2874.321869s] WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 2875.825250s] WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 2877.071833s] INFO ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:rescue{client.addr=10.244.4.68:36266}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.116.134:4191: service unavailable error.sources=[service unavailable]
[ 2877.327552s] WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 2878.829522s] WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 2880.149621s] INFO ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:rescue{client.addr=10.244.4.68:36274}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.116.134:4191: service unavailable error.sources=[service unavailable]
[ 2880.331180s] WARN ThreadId(01) outbound:proxy{addr=10.0.116.134:4191}:service{ns= name=service port=0}:endpoint{addr=10.0.115.50:4191}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
I’m also seeing this on the gateway pods for the non-working cluster:
[ 2773.985830s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_core::serve: Connection closed error=TLS detection timed out client.addr=10.244.4.1:44681
[ 2774.019384s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=direct connections must be mutually authenticated error.sources=[direct connections must be mutually authenticated] client.addr=10.224.0.10:30646