Clarification on high number of pending outbound http balancer endpoints and metric definition

jandersen · July 11, 2023, 2:32pm

Question about the metric outbound_http_balancer_endpoints. My current understanding of the metric is that it lists the total number of backend endpoints that are in either a pending or ready state (Proxy Metrics | Linkerd). So, it seems like the total number of endpoints (ready and pending ) in the balancer for a given pair of (backend_name, pod) should never exceed the total number of endpoints (presumably deduped with endpointslices) assigned to the backing object (for example, the sum of all endpoints (ready or pending ) in the balancer for a service should be the sum of all the existing endpoints for that service — that is sum (backend_name, pod) (outbound_http_balancer_endpoints{backend_name="api"}) should equal (# of service addresses) * (# of service ports) in the cluster (from kubectl get endpoints api )).

With that in mind, currently, we have some pods with outbound_http_balancer_endpoints metrics that have more than 300 endpoints in pending states, which does not seem possible given that the number of endpoints for the backend_name (in this case pretty normal services and trafficsplits) is less than 10.

Am I understanding the metric correctly? If so, is there a bug with removing pending backends (currently on linkerd 2.13.4) from the balancer in certain conditions? If not, what is the expected cap of outbound_http_balancer_endpoints?

Some additional information:

I ran linkerd diagnostics endpoints api.team.svc.cluster.local against the cluster and only got 4 endpoints (for the 4 existing pods), while the linkerd proxies that are sending traffic to these services report hundreds of pending endpoints (though maybe diagnostics only prints ready endpoints?).
I ran the diagnostics command against each of our destination pods, and they all agreed with each other.
I double checked the pods that had the high pending endpoints directly and confirmed that this is not a metrics issue:

# For pod A
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="team",parent_name="api-v2",parent_port="8080",parent_section_name="",backend_
group="core",backend_kind="Service",backend_namespace="team",backend_name="api-v2-stable",backend_port="8080",backend_section_name="",endpoint_state="pending"} 81
# For pod B
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="team",parent_name="api-v2",parent_port="8080",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="team",backend_name="api-v2-stable",backend_port="8080",backend_section_name="",endpoint_state="pending"} 81
# For pod C
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="team",parent_name="api-v2",parent_port="8080",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="team",backend_name="api-v2-stable",backend_port="8080",backend_section_name="",endpoint_state="pending"} 81

Some more information about the service setup itself:

The client linkerd-proxy is meshing an nginx proxy. The upstream service (api-v2) is in a trafficsplit that has a stable and canary split. The stable service is the one that has the highest number of pending endpoints, and is the one that has its service selectors switched quite frequently (as we change the set of pods that has been promoted from canary to stable, without restarting them).

Let me know if you need more information here! Just super curious on why we are seeing such a high number of pending endpoints that shouldn’t exist (given my current understanding).

olix0r · October 2, 2023, 5:17pm

Hi @jandersen,

Sorry for our delayed response to this well-written question. That does indeed look like unexpected behavior.

it seems like the total number of endpoints (ready and pending ) in the balancer for a given pair of (backend_name, pod) should never exceed the total number of endpoints

While this is generally true, it’s not exactly true: Since ready and pending are instantaneous gauges that are not updated atomically, we frequently see them fluctuate around (both lesser and greater than) the total number of endpoints.

This doesn’t really change anything about the behavior you’ve reported, where you’re seeing 81 endpoints when only 4 exist in the service. That likely indicates a bug.

The stable service is the one that has the highest number of pending endpoints, and is the one that has its service selectors switched quite frequently (as we change the set of pods that has been promoted from canary to stable, without restarting them).

Are you able to provide more detail about this process of label selector changes? I’d like to work towards reproducing this bug by triggering similar changes. Are you simply changing the labelSelector on the Service resource? Are pod labels changing at all? Etc…

jandersen · October 10, 2023, 7:59pm

No worries on the delayed response – you guys are quite busy.

Thank you for the clarification on the endpoints! That makes sense and fits in with my intuition here – locking for updating the metric might be too expensive anyways.

To clarify on the label selector changes for the services:

This is driven by argo-rollouts (GitHub - argoproj/argo-rollouts: Progressive Delivery for Kubernetes) which will change the label selectors on the service resources once per full rollout. Specifically, when the rollout is about to finish and we need to switch the selectors on the stable service to match that of the canary service (because the canary has now been validated as stable).
We do this through the SMI extension for linkerd and argo rollouts, the specific logic is here in argo rollouts: https://github.com/argoproj/argo-rollouts/blob/master/rollout/service.go#L281-L322 where the stable service has it label selectors changed to the new unique rollout selector (https://github.com/argoproj/argo-rollouts/blob/master/rollout/service.go#L252-L274) because it has completed the rollout.

The pod labels should not be changing as far as I know – the label selector that changes "rollouts-pod-template-hash" should be unique to the pod template hash generated by the replicaset.

It has been a while since I have spun up a test cluster to check this metric // changes, so if we want to continue down the practical debugging path I am more than happy to do so, but it may take me a bit to get back to you.

Topic		Replies	Views
Thousands of TCP Connections Remain Established Linkerd General Discussion	0	35	December 4, 2024
Could Someone Give me Guidance on Optimizing Linkerd for High-Traffic Microservices? Linkerd General Discussion	1	91	August 22, 2024
503 Service Unavailable from proxy with large number of connections Linkerd General Discussion	3	1329	June 13, 2023
Metrics: tcp_read_bytes versus tcp_write_bytes and the application of the peer label Linkerd General Discussion	1	389	May 3, 2023
Linkerd Proxy memory usage increase & OOM when app response with ~5MB payload over ~12 requests/sec Linkerd General Discussion proxy	1	1282	July 24, 2023

Clarification on high number of pending outbound http balancer endpoints and metric definition

Related topics