High Linkerd TCP Connection and timeout

We use Linkerd as a service mesh in production, last evening one of our applications started getting a timeout. This app was called from one of our services and it used Linkerd URL for communication between the two. Service A called service B which was down using Linkerd’s internal URL

While investigating we found out that the number of open connection were too high and it plateaued.

In order to resolve it we restarted the app pod with high TCP open connections and things started to work.

Now trying to figure out what happened as there were no changes that happened to the system

Questions I am dabbling with:

  1. Why did the TCP open connections suddenly spike

    so much? It’s like 90 degrees?
  2. Why did it plateau around 500, why can’t we have more open connections?

Hi @johri21

Without more information about the system or reproduction steps, it’s hard to say why this might have happened. One thing that I’d recommend is looking at the proxy metrics (you can use the linkerd diagnostics proxy-metrics command) to figure out exactly where the connection spike is. e.g. is it between the client and it’s sidecar proxy? or between the two proxies? or between the server and it’s sidecar proxy?