Linkerd CNI in AKS fails after Calico pod get restarted

Hello All,

We use Linkerd stable-2.14.1 with Linkerd CNI in AKS(1.25.6) with Calico CNI. We started getting issues when Calico pod in one AKS node get restarted, what happen next is the Linkerd CNI won’t be available on that node that means no iptables rules which also means pods will not be able to proxy the connection through the Linkerd side car proxies. The fix we do is to restart LInkerd CNI which will update the CNI configuration in the node to include it again.
I was expecting that the CNI watches /host/etc/cni/net.d/10-calico.conflist where if it changed by other CNI (in this case Calico) it re-apply Linkerd CNI again, but this doesn’t happen and we have to do that manually, what can we do to over come this issue.

Linkerd CNI pod logs:

Wrote linkerd CNI binaries to /host/opt/cni/bin
Installing CNI configuration in "chained" mode for /host/etc/cni/net.d/10-calico.conflist
Using CNI config template from CNI_NETWORK_CONFIG environment variable.
      "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
      "k8s_api_root": "https://X.X.X.X:__KUBERNETES_SERVICE_PORT__",
CNI config: {
  "name": "linkerd-cni",
  "type": "linkerd-cni",
  "log_level": "debug",
  "policy": {
      "type": "k8s",
      "k8s_api_root": "https://X.X.X.X:443",
      "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
  },
  "kubernetes": {
      "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
  },
  "linkerd": {
    "incoming-proxy-port": 4143,
    "outgoing-proxy-port": 4140,
    "proxy-uid": 2102,
    "ports-to-redirect": [],
    "inbound-ports-to-ignore": ["4191","4190"],
    "simulate": false,
    "use-wait-flag": true
  }
}
Created CNI config /host/etc/cni/net.d/10-calico.conflist
Setting up watches.
Watches established.

Linkerd CNI pod after a successful installation and operating as expected:

root@linkerd-cni-f9cgt:/linkerd# cat /host/etc/cni/net.d/10-calico.conflist | jq .plugins[].type
"calico"
"bandwidth"
"portmap"
"linkerd-cni"

Linkerd CNI pod after Calico pod in that node get restarted:

root@linkerd-cni-4adat:/linkerd# cat /host/etc/cni/net.d/10-calico.conflist | jq .plugins[].type
"calico"
"bandwidth"
"portmap"

After more investigation I found out that the Linkerd CNI script monitor only a specific events (CREATE & DELETE) that doesn’t get triggered when calico pod restarted.

monitor() {
  inotifywait -m "${HOST_CNI_NET}" -e create,delete |
    while read -r directory action filename; do
      if [[ "$filename" =~ .*.(conflist|conf)$ ]]; then 
        echo "Detected change in $directory: $action $filename"
        sync "$filename" "$action" "$cni_conf_sha"
        # When file exists (i.e we didn't deal with a DELETE ev)
        # then calculate its sha to be used the next turn.
        if [[ -e "$directory/$filename" && "$action" != 'DELETE' ]]; then
          cni_conf_sha="$(sha256sum "$directory/$filename" | while read -r s _; do echo "$s"; done)"
        fi
      fi
    done
}

These are the events on the CNI directory when I try reproducing the issue by killing calico-node in the node where Linkerd CNI run

root@linkerd-cni-9h2rc:/linkerd $ inotifywait -m /host/etc/cni/net.d                        
Setting up watches.
Watches established.
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ OPEN calico-kubeconfig
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ CLOSE_WRITE,CLOSE calico-kubeconfig
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ MODIFY 10-calico.conflist
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ MODIFY 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_WRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ OPEN calico-kubeconfig
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ CLOSE_WRITE,CLOSE calico-kubeconfig
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist

I guess if the Linkerd CNI script monitor also modify events that could solve this issue.

Hi @mohammed.elmaleeh, thanks for the write-up! I had a look through calico’s code and indeed it does seem that they do a write(). Most distributions do a write to a temp file and a move (when modifying the configuration file); we already cover these two events in the CNI plugin code (Note: if you looked at the code in the linkerd2 repo, we’ve actually moved the CNI installer to a different location).

It would make sense to me to add modify. I’m surprised people haven’t run into this before. Do you mind creating an issue in the linkerd2 repo?

Hi @matei, thanks for your reply and the thorough investigation. Here you go