I’m x-posting this from a discussion I started on GitHub and unfortunately not getting any traction there, so hoping a wider forum here would be able to assist.
I’m hoping to get some guidance/direction on how to troubleshoot something we’re experiencing on one of our services.
Environment: linkerd 2.12.4 on K8s 1.24.9
We have a fairly simple application that serves HTTP requests (and exposed behind an nginx ingress controller), ingress controllers & application Pods are meshed, the requests hitting this app usually have a small response size however from time to time it needs to serve a larger payload (~5MB, compressed payload).
The application is written in Go, using
net/http to serve requests, runs on 8 pods and is sufficiently sized (based on cpu/memory usage observations), each pod is receiving ~10 req/s, however when the app responds with a ~5MB payload (per request) we notice that the linkerd proxy sidecar memory utilization increases quite rapidly and if we increase the load a little further (to ~12 req/s per pod) linkerd proxy eventually OOMs, there are roughly ~70 inbound connections on each pod.
Bandwidth wise, each Pod is responding at ~20MB/s (@~10 req/s), which drives the memory usage on the proxy to ~200MB, when we increase the load slightly, each Pod is sending ~25MB/s and that’s when linkerd proxy eventually OOMs.
We’re trying to understand where the bottleneck here might be.
we understand we can increase the proxy memory requests & limits, but when we tried that it just seem to have shifted the issue downstream (to the nginx ingress controller linkerd proxies), which caused a much bigger issue as that impacted all services behind ingress. For now we’re on the default 250Mi setting, but will likely try to increase it to 512Mi to give us a little bit of breathing room.
(speculation starts here )
It appears like the proxy is waiting for the application to respond with the payload while buffering the response data in memory and releases it only after the full response was received? the application inbound latency increases only for p99 to around 500~700ms when it’s serving the larger payload, otherwise we’re not seeing anything abnormal and not quite sure how to troubleshoot this further.
I reviewed other issues/discussions I found on this topic, this one suggests that memory would increase as proxy need to handle more connections, which correlates with what we see on the ingress controllers in general and have increased linkerd proxy memory allocations there, however, this doesn’t seem to explain what we’re seeing on the application pods, they each have a steady ~70 inbound connections and this only happens when they are serving the larger payload (~5MB) as a response.
any guidance/assistance would be greatly appreciated!