Latency between ServiceA and Service B

We are part way through a migration of a microservice application from Azure Service Fabric to Azure Kubernetes Service. We are also changing the connectivity from traditional HTTP to gRPC. This move to gRPC necessitates a service mesh component to be added to AKS, as AKS doesn’t have any native capability to load balance gRPC transactions. We are using the popular Linkerd service mesh and that is correctly load balancing our traffic.

The microervices have now all been containerised and are running on AKS.

We have now hit an issue during load testing that we are hoping you can support.

The issue is that when making connections between services, we are consistently seeing an approximately 2% error rate.

We have checked our logs and are seeing, for approximately 2% of the traffic from Service A to Service B, that it is timing out. We get a DeadlineExceeded error from Service A, which is the standard error for gRPC communications exceeding the configured timeout (400ms in our setup).

We have also validated in our logs that the actual processing time of Service B is typically only around 10ms - meaning that they should respond well within the 400ms without timing out.

We have tried massively increasing this timeout to 30 second, and all of the errors go away (as you’d expect). But we still see most of the requests finishing in under 100ms, with a bunch of outliers taking 1-2s to complete. We have tried adjusting the resouirce allcoated to the linkerd sidecar containers, adjusting the amount of downstream pods available to service requests, but none of these seem to make a difference to the errors. The error rate still seens to be consistent regardless of volume, so even low volume gives a 2% error rate, which suggests that this is not a performance bottleneck somewhere but some inherent issue with latency in transit.

Hi Jude. This is interesting. I don’t know off the top of my head why you would see this behavior, but K8s is a complex system overall. A couple things you could try:

  1. Mesh the client side only, and see if the behavior persists. That would help narrow down whether this is due to client-side or server-side behavior.
  2. Double check that you don’t see the behavior when Linkerd is disabled. It sounds like a couple things have changed in your environment. (You could even do a skip-inbound-ports and skip-outbound-ports so that the proxy is injected but not used.)
  3. Add distributed tracing to see if you can isolate which component the latency is coming from. If you can, then some debug-level logging on that component (e.g. if it is a proxy) that could help us pin down the underlying issue.

We’d be happy to do a small professional services contract if that would be helpful; we tackle these integration issues all the time on behalf of customers. Alternatively let us know if you discover any more information and we’ll do our best to guide you.