linkerd-Identity-Issuer not refreshing certificates as expected

Recently we’ve started facing issue with outbound calls for some pods when trying to connect with services within same cluster.

Can see following logs in proxy

[9515905.481616s] WARN ThreadId(01) outbound:proxy{addr=10.247.15.20:4191}:forward{addr=10.247.15.20:4191}: linkerd_reconnect: Failed to connect error=invalid peer certificate: Expired
[9515905.500380s] WARN ThreadId(01) outbound:proxy{addr=10.247.59.3:9996}:forward{addr=10.247.59.3:9996}: linkerd_reconnect: Failed to connect error=invalid peer certificate: Expired
[9515905.500801s] WARN ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.247.59.3:8090}: linkerd_reconnect: Failed to connect error=endpoint 10.247.59.3:8090: invalid peer certificate: Expired error.sources=[invalid peer certificate: Expired]

Further investigation revealed that there was cert renewal for linkerd-identity service, and few pods didn’t have cert refreshed for more than 2 months that included some of linkerd-destination and proxy pods as well.

sum(control_identity_cert_refresh_timestamp_seconds) by (pod) < 2month(epoch timestamp)

According to docs these should be refreshed every 24hrs, and for most of other pods seems to be working fine even in same deployment.

Don’t see any errors in linkerd-identity-issuer pods either.
To recover we had to delete the impacted pods, as I don’t see a way to force refresh certs via identity.

Can someone help in further debugging or is it a linkerd issue?

Setup:

We used offical linkerd helm chart to install linkerd without any custom tuning.
version: edge-24.11.5
k8s version: 1.31
cert manager for auto cert renewal

Hey @Shubham — yes, restarting pods is indeed how to force a reissue.

I’m more curious about the identity issuer rotation – how was that done? Did the control plane get restarted afterward?

Hi @Flynn , thanks for reply.

We use auto cert rotation via cert manager. That shouldn’t require control plane pods restart unless it’s trust anchor. During issue we restarted the control plane in sequence identity-issuer >> destination/proxy. Otherwise it was throwing poststarthook error in linkerd-proxy when trying to connect to control plane pods.

Hi @Flynn , We faced another instance of this same problem in another cluster this time there was no recent cert renewal for any controller pods.

[706995.441763s] ERROR ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_proxy_identity_client::certify: Failed to obtain identity error=status: Unknown, message: "controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: endpoint 10.200.39.163:8080: connection error: received fatal alert: CertificateExpired", details: [], metadata: MetadataMap { headers: {} } error.sources=[controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: endpoint 10.200.39.163:8080: connection error: received fatal alert: CertificateExpired, endpoint 10.200.39.163:8080: connection error: received fatal alert: CertificateExpired, connection error: received fatal alert: CertificateExpired, received fatal alert: CertificateExpired] [707002.681759s]  INFO ThreadId(01) outbound:proxy{addr=10.20.233.112:8080}:service{ns=spr-apps name=live-reporting-ms-tier1-svc port=8080}:endpoint{addr=10.200.5.23:8080}:rescue{client.addr=10.200.49.6:41100}: linkerd_app_core::errors::respond: gRPC request failed error=endpoint 10.200.5.23:8080: connection error: received fatal alert: CertificateExpired error.sources=[connection error: received fatal alert: CertificateExpired, received fatal alert: CertificateExpired]
[706995.441763s] ERROR ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_proxy_identity_client::certify: Failed to obtain identity error=status: Unknown, message: “controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: endpoint 10.200.39.163:8080: connection error: received fatal alert: CertificateExpired”, details: , metadata: MetadataMap { headers: {} } error.sources=[controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: endpoint 10.200.39.163:8080: connection error: received fatal alert: CertificateExpired, endpoint 10.200.39.163:8080: connection error: received fatal alert: CertificateExpired, connection error: received fatal alert: CertificateExpired, received fatal alert: CertificateExpired][707002.681759s]  INFO ThreadId(01) outbound:proxy{addr=10.20.233.112:8080}:service{ns=namespace name=internal-service-name port=8080}:endpoint{addr=10.200.5.23:8080}:rescue{client.addr=10.200.49.6:41100}: linkerd_app_core::errors::respond: gRPC request failed error=endpoint 10.200.5.23:8080: connection error: received fatal alert: CertificateExpired error.sources=[connection error: received fatal alert: CertificateExpired, received fatal alert: CertificateExpired]

Resolved after same restart sequence as above,
Still not sure why it happened for specific workloads, there was no notable resource throttling for these, can you help in suggesting method/metric to detect/dubug this in future? Thanks.