Clarification on high number of pending outbound http balancer endpoints and metric definition

Question about the metric outbound_http_balancer_endpoints. My current understanding of the metric is that it lists the total number of backend endpoints that are in either a pending or ready state (Proxy Metrics | Linkerd). So, it seems like the total number of endpoints (ready and pending ) in the balancer for a given pair of (backend_name, pod) should never exceed the total number of endpoints (presumably deduped with endpointslices) assigned to the backing object (for example, the sum of all endpoints (ready or pending ) in the balancer for a service should be the sum of all the existing endpoints for that service — that is sum (backend_name, pod) (outbound_http_balancer_endpoints{backend_name="api"}) should equal (# of service addresses) * (# of service ports) in the cluster (from kubectl get endpoints api )).

With that in mind, currently, we have some pods with outbound_http_balancer_endpoints metrics that have more than 300 endpoints in pending states, which does not seem possible given that the number of endpoints for the backend_name (in this case pretty normal services and trafficsplits) is less than 10.

Am I understanding the metric correctly? If so, is there a bug with removing pending backends (currently on linkerd 2.13.4) from the balancer in certain conditions? If not, what is the expected cap of outbound_http_balancer_endpoints?

Some additional information:

  • I ran linkerd diagnostics endpoints against the cluster and only got 4 endpoints (for the 4 existing pods), while the linkerd proxies that are sending traffic to these services report hundreds of pending endpoints (though maybe diagnostics only prints ready endpoints?).
  • I ran the diagnostics command against each of our destination pods, and they all agreed with each other.
  • I double checked the pods that had the high pending endpoints directly and confirmed that this is not a metrics issue:
# For pod A
group="core",backend_kind="Service",backend_namespace="team",backend_name="api-v2-stable",backend_port="8080",backend_section_name="",endpoint_state="pending"} 81
# For pod B
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="team",parent_name="api-v2",parent_port="8080",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="team",backend_name="api-v2-stable",backend_port="8080",backend_section_name="",endpoint_state="pending"} 81
# For pod C
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="team",parent_name="api-v2",parent_port="8080",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="team",backend_name="api-v2-stable",backend_port="8080",backend_section_name="",endpoint_state="pending"} 81

Some more information about the service setup itself:

  • The client linkerd-proxy is meshing an nginx proxy. The upstream service (api-v2) is in a trafficsplit that has a stable and canary split. The stable service is the one that has the highest number of pending endpoints, and is the one that has its service selectors switched quite frequently (as we change the set of pods that has been promoted from canary to stable, without restarting them).

Let me know if you need more information here! Just super curious on why we are seeing such a high number of pending endpoints that shouldn’t exist (given my current understanding).