Topology aware routing doesn't appear to turn on/off under load

I’m using linkerd to enable topology aware routing in kubernetes. I’ve gotten my service to configure itself with EndpointSlice hints successfully, and have verified that traffic is being routed locally in zone.

So it all works, huzzah.

The problem I’m having is that topology aware routing won’t enable/disable when there is load on the system. If I’m sending even 1 RPS to my service, it won’t disable the feature even when endpoint-slice controller has removed hints from the endpoints. Likewise, if there is any traffic at all it won’t enable the routing even after the endpointslice hints are placed.

The only way I’ve found to get the state to change are:

  • Stop all traffic to the service and wait a minute or two
  • Kill all the service pods in the AZ. This works when a single pod is receiving all the traffic for an AZ but hasn’t fallen over.
  • Restart my nginx ingress pods

It’s possible this is related to nginx doing weird connection pooling and never refreshing its connections. However, if I removed Linkerd from both nginx and the service, topology routing stops working. So I’m not certain that nginx is doing anything worse than usual.

My testing architecture is as follows:

K6 load generator → AWS LB → nginx ingress pods → my service pods

Are there any logs I could collect to figure out what might be going wrong?

TL;DR

  • Topology aware routing works as expected in steady state
  • Switching between enabled/disabled under any load does not work

Kubernetes: 1.25.6
Linkerd: 2.13.3
nginx ingress controller: 1.5.1
service communication: gRPC

Hi @MarkRobinson

This is interesting. I would expect this to dynamically update when the endpoint hints are updated without needing to pause traffic or restart clients.

One place to collect logs that would be interesting is if you can set the destination controller’s log level to debug and take a look at the logs there while you edit the endpoint hints.

You can set the log level by running

> linkerd upgrade --controller-log-level debug | kubectl apply -f -

and view the destination controller logs with

> kubectl logs -n linkerd deploy/linkerd-destination -c destination

The presence or absence of log messages which start with “Filtering” or “Filtered” are of particular interest.

Hey @Alex Thanks for the tips

Here are the logs from when I reduced the pod count

time="2023-05-18T00:03:14Z" level=debug msg="<mark>Filter</mark>ing through addresses that should be consumed by zone us-east-1b" addr=":8086" component=endpoint-translator remote="100.96.21.199:46440" service="net-tester-stable.infra-team.svc.cluster.local:8024"
time="2023-05-18T00:03:14Z" level=debug msg="<mark>Filter</mark>ing through addresses that should be consumed by zone us-east-1b" addr=":8086" component=endpoint-translator remote="100.96.21.199:46440" service="net-tester-stable.infra-team.svc.cluster.local:8024"
time="2023-05-18T00:03:14Z" level=debug msg="<mark>Filter</mark>ed from 9 to 2 addresses" addr=":8086" component=endpoint-translator remote="100.96.21.199:46440" service="net-tester-stable.infra-team.svc.cluster.local:8024"
time="2023-05-18T00:03:14Z" level=debug msg="<mark>Filter</mark>ed from 11 to 4 addresses" addr=":8086" component=endpoint-translator remote="100.96.21.199:46440" service="net-tester-stable.infra-team.svc.cluster.local:8024"
time="2023-05-18T00:03:14Z" level=debug msg="<mark>Filter</mark>ing through addresses that should be consumed by zone us-east-1b" addr=":8086" component=endpoint-translator remote="100.96.23.166:47596" service="net-tester-stable.infra-team.svc.cluster.local:8024"
time="2023-05-18T00:03:14Z" level=debug msg="<mark>Filter</mark>ed from 9 to 2 addresses" addr=":8086" component=endpoint-translator remote="100.96.23.166:47596" service="net-tester-stable.infra-team.svc.cluster.local:8024"
time="2023-05-18T00:03:14Z" level=debug msg="<mark>Filter</mark>ed from 11 to 4 addresses" addr=":8086" component=endpoint-translator remote="100.96.23.166:47596" service="net-tester-stable.infra-team.svc.cluster.local:8024"
time="2023-05-17T23:59:45Z" level=debug msg="<mark>Filter</mark>ing through addresses that should be consumed by zone us-east-1b" addr=":8086" component=endpoint-translator remote="100.96.21.199:46440" service="net-tester-stable.infra-team.svc.cluster.local:8024"
time="2023-05-17T23:59:45Z" level=debug msg="<mark>Filter</mark>ed from 12 to 5 addresses" addr=":8086" component=endpoint-translator remote="100.96.21.199:46440" service="net-tester-stable.infra-team.svc.cluster.local:8024"
time="2023-05-17T23:59:45Z" level=debug msg="<mark>Filter</mark>ing through addresses that should be consumed by zone us-east-1b" addr=":8086" component=endpoint-translator remote="100.96.23.166:47596" service="net-tester-stable.infra-team.svc.cluster.local:8024"
time="2023-05-17T23:59:45Z" level=debug msg="<mark>Filter</mark>ed from 12 to 5 addresses" addr=":8086" component=endpoint-translator remote="100.96.23.166:47596" service="net-tester-stable.infra-team.svc.cluster.local:8024"

I reduced the pod count to 2 in one AZ and that was enough to disable endpointslice hints in k8s, but traffic kept going to just those two pods.