I know HA mode sets the number of replicas for each of the control plane components to 3. Out of curiosity, when (if ever) would it make sense to increase that to even more replicas?
Hi Neal. We rarely recommend scaling beyond 3 replicas as there is a cost to data plane memory consumption as the number of control plane replicas. Our experience is that 3 is a good balance—HA mode uses node anti-affinity to distribute components across distinct nodes, so the failure case with 3 replicas is simultaneous failure / network partition of 3 distinct nodes, and it’s rare that that is the salient failure mode in the system versus e.g. whole-cluster failure, AZ failure, or something else.
There are occasionally specific scenarios call for more replicas but unless you’re doing something outside of the ordinary I’d be surprised if it was warranted. Hope that helps.
Thanks for this. A while back, we increased the number of replicas from 3 to 5 in our clusters. This was when we were having some issues with the control plane components restarting. IIRC, that was mostly due to the destination component OOMing (and so the correct thing to do there was to allocate more memory, which we also did).
However, I’m noticing today that destination pods still have occasional restarts which are not due to OOM. It looks like the occasionally fail readiness and liveness checks (due to timeouts). This doesn’t happen often enough to cause us any issues with 5 pods, but I’m vaguely nervous that dropping down to 3 might cause some instability for us. Given the rate at which the restarts happen, this is very unlikely, but still.
Is it expected that destination pods will occasionally (maybe once every other day) fail liveness checks?
I would not expect to see the sporadic failing liveness checks unless there’s something going on in the underlay that would explain probes not being able to get a response back (but I would also expect to see this have wider impact and noticeable on other components outside of Linkerd as well).
Is there any correlation between pods and nodes (i.e., is it happening on certain nodes only?), time factors (happening around the same time every time)…etc?
Another thing to look out for is whether or not they’re all restarting at the same time. As long as one instance is in a Ready state, you could realistically drop all the other instances and be fine. They’re there primarily for redundancy (for resource scalability we recommend scaling vertically where possible) and even if all destination instances go down, existing connections would be able to reuse the cache without needing to call back home. Only new connections would be temporarily impacted.
Anything in the logs when they fail the health checks?
There are a bunch of goroutine stack traces that I’m not sure how to interpret (with lines for StreamServerInterceptor.func1
and subscribeToEndpointProfile
and GetProfile
and getProfileByIP
). I’m not sure if this is a dump as a result of failing the liveness check, or if this is what caused the pod to fail the liveness check.
I would file a GH issue with those log lines and a pointer to this discussion. Someone who is familiar with the control plane could probably quickly rule out whether they’re related to the the liveness check.