The linkerd-enterprise-windows-cni helm chart does not support tolerations. The normal use-case as I see it is to install this on the Windows nodes in a k8s cluster that has both Linux and Windows nodes. The normal pattern when using windows nodes is to always taint the windows nodes with:
taints:
- effect: NoSchedule
key: kubernetes.io/os
value: windows
But the linkerd-enterprise-windows-cni helm chart does not have any default tolerations nor does it support adding tolerations
Manually hacked it to test while waiting. Next problem is that we have already upgraded control plane to 2.19.1. And the proxy-win image does not exist for that version. Can you please fix that also?
But the linkerd-network-validator init container still fails when I try to mesh a pod:
2025-11-18T13:48:18.183963Z INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140 2025-11-18T13:48:18.184243Z DEBUG linkerd_network_validator: token=“mytoken\n” 2025-11-18T13:48:18.184258Z INFO linkerd_network_validator: Connecting to 1.1.1.1:20001 2025-11-18T13:48:28.183712Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure traffic redirection rules are rewriting traffic as expected. timeout=10s stream closed: EOF for horizons/metadata-api-6fd6b666cc-8zw5h (linkerd-network-validator)
@william I’m available to help you guys get this working. It is indeed an alpha version. But I had a bit higher hopes . Nothing much working
Thanks for the feedback. Going from top to bottom:
We had a problem with our build infra that has been fixed now, so the 2.19.1 Windows image should be updated.
Currently, the DS installer of the Windows CNI has a node selector with kubernetes.io/os:`` windows. Is that not working for you? Why is there a need for tainting?
At the moment, this plugin has been tested on AKS only, so it might have some assumptions embedded about how CNI plugins are chained on Windows nodes.
From the logs you’ve shared, it seems that no error is being thrown, yet the network is not being configured. Could I ask you to please share:
The manifest of the workload that is being injected.
The environment you are trying to run it in.
The CNI plugin should have a log named linkerd2-windows-cni.log in the bin directory of the CNI plugin. Can you verify that this is the case and share its contents?
The configuration of any other CNI plugins that are installed on the node.
Taints are needed because most Linux applications do not use nodeSelectors. This means that Linux pods are scheduled on Windows nodes. Default nodes in a cluster is Linux with no taints. All pods by default go there. If suddenly there are Windows nodes without taints also then there is chaos and Linux pods are scheduled on Windows nodes.
Common practice for Windows nodes is to have:
taints:
- effect: NoSchedule
key: kubernetes.io/os
value: windows
The rest of the log is filled with the same errors about fluentbit.
Actually looking more on this cluster now I see that the linkerd-windows-cni is causing fluentbit pods to be stuck in ContainerCreating state. With this error:
Warning FailedCreatePodSandBox 3m22s (x568 over 131m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox “ac794de392a18710ced67f7c411efe9b86a88fd4318c135f33e8deeeeef0de7e”: plugin type=“linkerd2-windows-cni” name=“0.4.0” failed (add): Unauthorized\
That is a serious bug. The cni plugin are causing problems for other non-meshed applications. Currently there are no meshed Windows pods on this cluster. Fluentbit is also not intended to be meshed.
We can get back to the injected application manifest after these other things are sorted out. The deployment manifests are very standard with the tolerations and nodeSelector needed to make pods be scheduled on Windows nodes.
The only other cni on the nodes is the standard azure-cns-win.
Thanks a lot for the detailed response. Lets get this working !
Ok, lets leave taints aside for a bit and sort out the other problems first. It appears to me that this is a RBAC issue where the CNI plugin does not have Kube API permissions to perform a GET on the pod that is being considered.
We ca make things more fault tolerant so it does not mess up the pod creation in this situation. However, the question is still there. Why are there no permissions for the plugin?
The installer is supposed to create a Kube config file for the plugin at: C:\\k\\azurecni\\netconf\linkerd-windows-cni-kubeconfig. Is this file present? How does it look? This file should describe the permissions of the plugin itself. Can we take a look at this file?
Ok can you share the exact environment you are using for this AKS cluster? I will try to reproduce the problem myself. Can you use this kubeconfig to do any operations on the cluster? From your local machine for example? Have you changed anything about the RBAC that has been installed as part of this CNI plugin?
Not sure what to answer about the AKS setup. Pretty standard with mixed node pools.
I did not change anything related to RBAC in the Linkerd CNI installation
So I followed the tutorial and I was not able to reproduce the problem. I have a suspicion that we have a bug in handling expired service account tokens for the plugin. Can you please validate the following. In your linkerd-windows-cni-kubeconfig file, can you check that the JWT token that is populated in linkerd-cni is not expired? Also, if that is the case, can you rollout restart your cni installer DaemonSet and see whether that fixes the problem. But please, check the expiration of the token first.