How do I know which endpoints Linkerd is aware of? (was: Fail-Fast troubleshooting)

Eli-Goldberg · April 11, 2023, 10:53am

Hi all !
I’m trying to figure out the reason for a “fail-fast” error that occurred between 2 meshed services (A is the client, B is the server).
Is there a way to know at a given point which endpoints service A is currently aware of?

To the best of my understanding, “fail-fast” would happen when a client says: “I don’t know of any endpoints to send requests to, so I’m done”.
In this case Service B had 3 instances but one of them crashed (OOM) - which triggered one of the 2 instances of Service A to “fail-fast”.

I suspect Service A’s endpoints were not updated.

Perhaps this has to do max-concurrent-requests too, like in this issue:

github.com/linkerd/linkerd2

The proxy on a random pod enters fail-fast mode preventing it from working correctly.

opened 04:18PM - 20 Jul 22 UTC

closed 11:26PM - 02 Dec 22 UTC

agalue

wontfix needs/repro support

### What is the issue? The Linkerd proxy on a random [Cortex](https://cortexmet…rics.io/) Ingester enters into the fail-fast mode blocking the communication against the distributors, but not against other Cortex components like Queriers. That effectively breaks replication as the distributors cannot see the Ingester in a healthy state, even if the communication via `memberlist` is unaffected and the Ingester appears active on the ring. I tried restarting the Ingester, but the problem solves temporarily. The strange part is that, sometimes, another Ingester enters into the fail-fast state after restarting the affected one, which is why I used the term random to describe the problem. ### How can it be reproduced? The problem appears when handling a considerable amount of traffic. Currently, distributors are receiving a constant rate of 50K samples per second in batches of 500, meaning, effectively, distributors are receiving 100 requests (with 500 samples each), and according to `linkerd viz dashboard`, Ingesters are receiving a similar number of RPS. On my initial tests with orders of magnitudes less traffic, the problem doesn't appear. ### Logs, error output, etc As the logs are very verbose due to the injection rate. Here are the last 5 seconds from the affected Ingester (i.e., `ingester-0`) and the two distributors: https://gist.github.com/agalue/5ecbbfcf37ecf8b5798bf18bbe0473b1 Here is how I got the logs: ```bash kubectl logs -n cortex distributor-ddd56cf9-4sz4s -c linkerd-proxy --since=5s > distributor-4sz4s-proxy-logs.txt kubectl logs -n cortex distributor-ddd56cf9-wzjdd -c linkerd-proxy --since=5s > distributor-wzjdd-proxy-logs.txt kubectl logs -n cortex ingester-0 linkerd-proxy --since=5s > ingester-0-proxy-logs.txt ``` ### output of `linkerd check -o short` ```bash ➜ ~ linkerd check -o short Status check results are √ ``` ### Environment - Kubernetes Version: 1.23.3 - Cluster Environment: AKS with Kubenet and Calico - Host OS: Ubuntu 18.04.6 LTS, 5.4.0-1083-azure, containerd://1.5.11+azure-2 (managed by AKS) - Linkerd Version: 2.11.4 Note: the problem appears with and without Calico (tested on different clusters). ### Possible solution _No response_ ### Additional context In Cortex, all components talk to each other via Pod IP, meaning all the communication happens via Pod-to-Pod through gRPC. To give more context about what you would see on the proxy logs: ```bash ➜ ~ kubectl get pod -n cortex -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES compactor-0 2/2 Running 0 47h 10.244.24.10 aks-rpsupport-13025066-vmss000004 <none> <none> distributor-ddd56cf9-4sz4s 2/2 Running 0 46h 10.244.2.26 aks-distributors-15878912-vmss000000 <none> <none> distributor-ddd56cf9-wzjdd 2/2 Running 0 46h 10.244.28.8 aks-distributors-15878912-vmss00000d <none> <none> ingester-0 2/2 Running 0 24h 10.244.12.14 aks-ingesters-39599960-vmss000001 <none> <none> ingester-1 2/2 Running 0 47h 10.244.4.13 aks-ingesters-39599960-vmss000000 <none> <none> ingester-2 2/2 Running 0 43h 10.244.3.17 aks-ingesters-39599960-vmss000002 <none> <none> memcached-chunks-0 3/3 Running 0 47h 10.244.8.26 aks-rpsupport-13025066-vmss000001 <none> <none> memcached-chunks-1 3/3 Running 0 47h 10.244.10.28 aks-rpsupport-13025066-vmss000002 <none> <none> memcached-chunks-2 3/3 Running 0 47h 10.244.6.33 aks-rpsupport-13025066-vmss000003 <none> <none> memcached-frontend-0 3/3 Running 0 47h 10.244.10.30 aks-rpsupport-13025066-vmss000002 <none> <none> memcached-frontend-1 3/3 Running 0 47h 10.244.8.25 aks-rpsupport-13025066-vmss000001 <none> <none> memcached-frontend-2 3/3 Running 0 47h 10.244.6.32 aks-rpsupport-13025066-vmss000003 <none> <none> memcached-index-0 3/3 Running 0 47h 10.244.10.29 aks-rpsupport-13025066-vmss000002 <none> <none> memcached-index-1 3/3 Running 0 47h 10.244.8.24 aks-rpsupport-13025066-vmss000001 <none> <none> memcached-index-2 3/3 Running 0 47h 10.244.6.31 aks-rpsupport-13025066-vmss000003 <none> <none> memcached-metadata-0 3/3 Running 0 47h 10.244.6.34 aks-rpsupport-13025066-vmss000003 <none> <none> querier-794978b45f-2b7z2 2/2 Running 0 47h 10.244.23.6 aks-storegws-30442145-vmss000013 <none> <none> querier-794978b45f-h2fmf 2/2 Running 0 47h 10.244.15.13 aks-storegws-30442145-vmss00000w <none> <none> querier-794978b45f-vbjqn 2/2 Running 0 47h 10.244.17.8 aks-storegws-30442145-vmss00000z <none> <none> query-frontend-5b57ddb6cf-bkvpk 2/2 Running 0 47h 10.244.6.35 aks-rpsupport-13025066-vmss000003 <none> <none> query-frontend-5b57ddb6cf-jxgq2 2/2 Running 0 47h 10.244.8.27 aks-rpsupport-13025066-vmss000001 <none> <none> store-gateway-0 2/2 Running 0 47h 10.244.23.5 aks-storegws-30442145-vmss000013 <none> <none> store-gateway-1 2/2 Running 0 47h 10.244.17.7 aks-storegws-30442145-vmss00000z <none> <none> store-gateway-2 2/2 Running 0 47h 10.244.15.12 aks-storegws-30442145-vmss00000w <none> <none> ``` The only error code I found on the distributors proxy is: ```text [165766.325281s] DEBUG ThreadId(01) outbound:accept{client.addr=10.244.2.26:49942}: linkerd_app_core::serve: Connection closed reason=connection error: server: Transport endpoint is not connected (os error 107) ``` In terms of the applications, the affected Ingester reports nothing on its log, as the distributor traffic is not reaching the application. The distributors, on the other hand, are flooded with the following message; as I presume the proxy on the affected Ingester is rejecting the traffic: ```text level=warn ts=2022-07-18T15:57:05.907166697Z caller=pool.go:184 msg="removing ingester failing healthcheck" addr=10.244.3.14:9095 reason="rpc error: code = Unavailable desc = HTTP Logical service in fail-fast" ``` ### Would you like to work on fixing this bug? _No response_

jmo · April 11, 2023, 2:48pm

Hey Eli!

The linkerd diagnostics endpoints command will list out the endpoints for a given service. A quick example using the emojivoto app would look like: linkerd diagnostics endpoints emoji-svc.emojivoto.svc.cluster.local:8080 web-svc.emojivoto.svc.cluster.local:80

Topic		Replies	Views
What are failfast errors in Linkerd and how do you debug them? Linkerd General Discussion	1	2899	April 12, 2023
Services cannot talk to mirrored services Linkerd General Discussion configuration	4	11	July 15, 2025
Linkerd-destination has stale data in endpoint cache Linkerd General Discussion	4	745	July 10, 2023
Linkerd2 Tap Error with MissingEndPoints Linkerd General Discussion	2	973	August 11, 2023
Meshed pods fail to connect to unmeshed smtp/redis pods with Linkerd 2.14.2 Linkerd General Discussion	10	1652	November 9, 2023

How do I know which endpoints Linkerd is aware of? (was: Fail-Fast troubleshooting)

Related topics