Question about the metric outbound_http_balancer_endpoints
. My current understanding of the metric is that it lists the total number of backend endpoints that are in either a pending
or ready
state (Proxy Metrics | Linkerd). So, it seems like the total number of endpoints (ready
and pending
) in the balancer for a given pair of (backend_name, pod)
should never exceed the total number of endpoints (presumably deduped with endpointslices) assigned to the backing object (for example, the sum of all endpoints (ready
or pending
) in the balancer for a service should be the sum of all the existing endpoints for that service — that is sum (backend_name, pod) (outbound_http_balancer_endpoints{backend_name="api"})
should equal (# of service addresses) * (# of service ports)
in the cluster (from kubectl get endpoints api
)).
With that in mind, currently, we have some pods with outbound_http_balancer_endpoints
metrics that have more than 300 endpoints in pending
states, which does not seem possible given that the number of endpoints for the backend_name
(in this case pretty normal services and trafficsplits) is less than 10.
Am I understanding the metric correctly? If so, is there a bug with removing pending
backends (currently on linkerd 2.13.4) from the balancer in certain conditions? If not, what is the expected cap of outbound_http_balancer_endpoints
?
Some additional information:
- I ran
linkerd diagnostics endpoints api.team.svc.cluster.local
against the cluster and only got 4 endpoints (for the 4 existing pods), while the linkerd proxies that are sending traffic to these services report hundreds of pending endpoints (though maybe diagnostics only prints ready endpoints?). - I ran the diagnostics command against each of our destination pods, and they all agreed with each other.
- I double checked the pods that had the high pending endpoints directly and confirmed that this is not a metrics issue:
# For pod A
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="team",parent_name="api-v2",parent_port="8080",parent_section_name="",backend_
group="core",backend_kind="Service",backend_namespace="team",backend_name="api-v2-stable",backend_port="8080",backend_section_name="",endpoint_state="pending"} 81
# For pod B
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="team",parent_name="api-v2",parent_port="8080",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="team",backend_name="api-v2-stable",backend_port="8080",backend_section_name="",endpoint_state="pending"} 81
# For pod C
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="team",parent_name="api-v2",parent_port="8080",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="team",backend_name="api-v2-stable",backend_port="8080",backend_section_name="",endpoint_state="pending"} 81
Some more information about the service setup itself:
- The client linkerd-proxy is meshing an nginx proxy. The upstream service (
api-v2
) is in a trafficsplit that has astable
andcanary
split. Thestable
service is the one that has the highest number of pending endpoints, and is the one that has its service selectors switched quite frequently (as we change the set of pods that has been promoted from canary to stable, without restarting them).
Let me know if you need more information here! Just super curious on why we are seeing such a high number of pending endpoints that shouldn’t exist (given my current understanding).