Hello! I’ve noticed that we have a single deployment for which the linkerd-proxy container appears to have a slow memory leak. It currently has a memory limit of 500Mi for the container, and will take over a week to hit that limit, but memory usage for it really stands out among all of our other workloads.
I know this is probably nearly impossible to diagnose with the information I’ve given, but are there any properties of a workload that might interact with linkerd to precipitate something like this? As far as I know, we’re not having any operational problems with the workload. It’s meshed, subject to authz policies, and working fine.
I’ve attached a screenshot that shows memory usage for the linkerd-proxy container for all of our deployments. You can see that only a single one (out of dozens) shows this pattern.
Which version is the proxy?
Edit: never mind, just saw that in the title. Interesting. Can you connect your deployment to Buoyant Cloud and send us a diagnostic bundle? That would be the quickest way to dig into this.
I pressed the button to send diagnostics. I was told I’d get an email, but I’m not sure that ever happened.
Neal, I just wanted to follow up on this. Did you ever get an email with the diagnostic ID in it?
Hello! I don’t think I ever got an email.
Interesting. I looked through the send logs and also don’t see anything. Can you try again? And note the timestamp?
I just tried again at 5:25pm PDT. I didn’t receive a timestamp form Buoyant or anything, in case that’s what you mean by timestamp.
Neal, I dropped the ball on replying to this. I trawled our outbound email logs and also don’t see any email sent to you at that time. Not sure why that would be. I will file an internal ticket for us to investigate. Sorry about that!
In terms of that one workload, can you describe the nature of the traffic a bit? What does it do that’s different from the other meshed workloads?
The workload takes requests which are large json blobs (mostly base64 encoded images), and returns json responses which are scan results of those images. We have other workloads which are similar, and not experiencing a memory leak. I’m not sure what it is about this workload that distinguishes it from other meshed workloads.
At this point, we at least have a work-around: the ol’ reboot it daily. (The leak would probably take about 4-5 days to cause an OOM, so this is more aggressive than is necessary, but fine.)