We use Linkerd as a service mesh in production, last evening one of our applications started getting a timeout. This app was called from one of our services and it used Linkerd URL for communication between the two. Service A called service B which was down using Linkerd’s internal URL
While investigating we found out that the number of open connection were too high and it plateaued.
In order to resolve it we restarted the app pod with high TCP open connections and things started to work.
Now trying to figure out what happened as there were no changes that happened to the system
Questions I am dabbling with:
- Why did the TCP open connections suddenly spike
so much? It’s like 90 degrees? - Why did it plateau around 500, why can’t we have more open connections?