Stable-2.12.4 - Multicluster Headless Connection "504 Gateway Timeout"

I try to setup a multicluster with headless service support using linkerd stable-2.12.4, which should allow me address statefulset pods directly from the remote cluster.

Followed the docs hereI’m pretty sure the cluster connection worked before I linked the cluster with the headless option:

linkerd multicluster --cluster-name eu2 --set “enableHeadlessServices=true”

The linkerd + multicluster checks are all happy ( :white_check_mark:) and I also see new endpoints being created in the remote cluster, when I scale up/down the statefulset in the remote cluster.But when I query the service or a specific pod, I get this error, even if DNS gets resolved to the local Endpoint IP:

< HTTP/1.1 504 Gateway Timeout
< l5d-proxy-error: Gateway service in fail-fast

I labeled this service to be exported with a no ClusterIP:

apiVersion: v1
kind: Service
  labels: "true"
  name: linkedapp-svc
  namespace: multiregion
  clusterIP: None
  - None
  internalTrafficPolicy: Cluster
  - IPv4
  ipFamilyPolicy: SingleStack
  - port: 8765
    app: linkedapp

Here is an example of the endpoints that can be resolved but throw the 504 on connection:

$ k get endpoints -n multiregion linkedapp-svc-eu2 -o yaml |k neatapiVersion: v1
kind: Endpoints
  annotations: linkerd-gateway.linkerd-multicluster.serviceaccount.identity.linkerd.cluster.local linkedapp-svc.multiregion.svc.cluster.local
  labels: eu2 "true"
  name: linkedapp-svc-eu2
  namespace: multiregion
- addresses:
  - hostname: linkedapp-1
  - hostname: linkedapp-0
  - hostname: linkedapp-2
  - port: 8765
    protocol: TCP

All this is running between two AWS EKS clusters running with v1.22.17 with the default AWS VPC CNI.
The classic NLB that Linkerd creates are running and accessible, I also let the recreate multiple times.
I tried to do a clean reinstall, recreate the NLB but it doesn’t change :frowning:

I’ve found this discussion, but as this is a fresh installation without any but my test traffic, that shouldn’t be the problem, no?

Any hints?
Maybe a AWS CNI Issue?

Hi @seb

Are the requests that you’re making from the source cluster being made from inside a meshed pod? Because all multicluster traffic needs to be encrypted with mTLS, only meshed pods will be able to query the mirror services.

I’d also recommend taking a look at the linkerd-gateway logs in the target cluster. These logs may give you some clue as to why the gateway service is in fail-fast.

Hi @Alex ,
thanks for your response!:slight_smile:
Yes both clusters have an identical namespace (“multiregion” in that case) and inject Linkerd to all pods within that namespace.

I’ll rebuild everything from scratch again and also have a look on the gateway logs. Keep you posted !

Hey @Alex ,

Just spend some more time on this topic, wiped again the complete installation and reinstalled it. But I got the same Issue.

Checking the linkerd-gateway logs in the target cluster I get the same issues as described here:

INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=direct connections must be mutually authenticated error.sources=[direct connections must be mutually authenticated] client.addr=

Any more ideas?
I tried to fix the Health Probes on the AWS LB but it doesn’t help so far…

EDIT: Talking to the headless services is now working, finally :partying_face:
In the end I had to restart all involved pods and now it’s working.

Anyhow, I still get the error above in the gateway. Even if they’re not blocking for Multicluster Communication, I’d still like to solve them. Is there a fix for the LB Health Probes, ready to be applied?

Glad to hear that you got multicluster traffic working!!!

As for the errors in the logs, we have an open issue tracking this (Linkerd Gateway logs spammed with "connections must be mutually authenticated" from kube-system probes · Issue #10203 · linkerd/linkerd2 · GitHub)

1 Like