When -- if ever -- should the control plane have more than 3 replicas?

neal · July 15, 2023, 6:19pm

I know HA mode sets the number of replicas for each of the control plane components to 3. Out of curiosity, when (if ever) would it make sense to increase that to even more replicas?

william · July 16, 2023, 8:23pm

Hi Neal. We rarely recommend scaling beyond 3 replicas as there is a cost to data plane memory consumption as the number of control plane replicas. Our experience is that 3 is a good balance—HA mode uses node anti-affinity to distribute components across distinct nodes, so the failure case with 3 replicas is simultaneous failure / network partition of 3 distinct nodes, and it’s rare that that is the salient failure mode in the system versus e.g. whole-cluster failure, AZ failure, or something else.

There are occasionally specific scenarios call for more replicas but unless you’re doing something outside of the ordinary I’d be surprised if it was warranted. Hope that helps.

neal · July 17, 2023, 2:55pm

Thanks for this. A while back, we increased the number of replicas from 3 to 5 in our clusters. This was when we were having some issues with the control plane components restarting. IIRC, that was mostly due to the destination component OOMing (and so the correct thing to do there was to allocate more memory, which we also did).

However, I’m noticing today that destination pods still have occasional restarts which are not due to OOM. It looks like the occasionally fail readiness and liveness checks (due to timeouts). This doesn’t happen often enough to cause us any issues with 5 pods, but I’m vaguely nervous that dropping down to 3 might cause some instability for us. Given the rate at which the restarts happen, this is very unlikely, but still.

Is it expected that destination pods will occasionally (maybe once every other day) fail liveness checks?

Alen · July 17, 2023, 4:53pm

I would not expect to see the sporadic failing liveness checks unless there’s something going on in the underlay that would explain probes not being able to get a response back (but I would also expect to see this have wider impact and noticeable on other components outside of Linkerd as well).

Is there any correlation between pods and nodes (i.e., is it happening on certain nodes only?), time factors (happening around the same time every time)…etc?

Another thing to look out for is whether or not they’re all restarting at the same time. As long as one instance is in a Ready state, you could realistically drop all the other instances and be fine. They’re there primarily for redundancy (for resource scalability we recommend scaling vertically where possible) and even if all destination instances go down, existing connections would be able to reuse the cache without needing to call back home. Only new connections would be temporarily impacted.

william · July 17, 2023, 10:36pm

Anything in the logs when they fail the health checks?

neal · July 18, 2023, 12:11am

There are a bunch of goroutine stack traces that I’m not sure how to interpret (with lines for StreamServerInterceptor.func1 and subscribeToEndpointProfile and GetProfile and getProfileByIP). I’m not sure if this is a dump as a result of failing the liveness check, or if this is what caused the pod to fail the liveness check.

william · July 18, 2023, 5:07pm

I would file a GH issue with those log lines and a pointer to this discussion. Someone who is familiar with the control plane could probably quickly rule out whether they’re related to the the liveness check.

neal · July 20, 2023, 1:41am

Done! Occasional control plane health check failures · linkerd/linkerd2 · Discussion #11135 · GitHub

Topic		Replies	Views
Linkerd destination control plane pod restarts Linkerd General Discussion configuration	1	518	September 27, 2023
2.13.x Issues After Upgrading Linkerd General Discussion proxy , configuration	1	423	August 1, 2023
Linkerd Destination Scaling Issues Linkerd General Discussion	2	629	February 14, 2024
503 Service Unavailable from proxy with large number of connections Linkerd General Discussion	3	1324	June 13, 2023
Unable to upgrade to enterprise-2.15.2: node(s) didn't match pod anti-affinity rules Linkerd General Discussion configuration	4	332	April 12, 2024

When -- if ever -- should the control plane have more than 3 replicas?

Related topics