We seem to have the same problem, except that as we set extremely short rotation circles to be able to check everything in the beginning. The trust anchor certificate has a duration of 24h, the issuer cert 6h. In the beginning, I also set a custom renewBefore
in the certificates, but now I’m using the default setting of cert-manager. It worked for ~1 month, and since Dec 28th it continues to break. I restart everything, it works, and the next day I see that it broke again.
Our certificates are auto-renewed by cert-manager, and the the linkerd-identity pods are using a configMap controlled by a Bundle (using the trust-manager) as trust-roots
volume.
When it breaks, linkerd check
looks like this:
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all node podCIDRs
√ cluster networks contains all pods
√ cluster networks contains all services
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
‼ trust anchors are valid for at least 60 days
Anchors expiring soon:
* 26318400229796391699423318841621534888 root.linkerd.cluster.local will expire on 2024-01-09T20:22:44Z
see https://linkerd.io/2.14/checks/#l5d-identity-trustAnchors-not-expiring-soon for hints
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2024-01-09T14:22:45Z
see https://linkerd.io/2.14/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
‼ proxy-injector cert is valid for at least 60 days
certificate will expire on 2024-01-09T14:22:45Z
see https://linkerd.io/2.14/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
√ sp-validator webhook has valid cert
‼ sp-validator cert is valid for at least 60 days
certificate will expire on 2024-01-09T14:22:45Z
see https://linkerd.io/2.14/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
√ policy-validator webhook has valid cert
‼ policy-validator cert is valid for at least 60 days
certificate will expire on 2024-01-09T14:22:52Z
see https://linkerd.io/2.14/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints
linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date
control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match
linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
linkerd-jaeger
--------------
√ linkerd-jaeger extension Namespace exists
√ jaeger extension pods are injected
√ jaeger injector pods are running
‼ jaeger extension proxies are healthy
Some pods do not have the current trust bundle and must be restarted:
* jaeger-injector-57fc7b6c56-76f74
see https://linkerd.io/2.14/checks/#l5d-jaeger-proxy-healthy for hints
√ jaeger extension proxies are up-to-date
√ jaeger extension proxies and cli versions match
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ can initialize the client
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
‼ tap API server cert is valid for at least 60 days
certificate will expire on 2024-01-09T14:22:50Z
see https://linkerd.io/2.14/checks/#l5d-tap-cert-not-expiring-soon for hints
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
‼ viz extension proxies are healthy
Some pods do not have the current trust bundle and must be restarted:
* metrics-api-787845b69d-k7fsf
* tap-6f5d8b9448-jsh9z
* tap-injector-687fcdff96-tkrkh
* web-647c9c5855-9shtm
see https://linkerd.io/2.14/checks/#l5d-viz-proxy-healthy for hints
√ viz extension proxies are up-to-date
√ viz extension proxies and cli versions match
√ viz extension self-check
Status check results are √
I run it in HA mode, an example log of one of the identity-pods is:
...
...
2024-01-08T16:43:54.880894096Z time="2024-01-08T16:43:54Z" level=info msg="issued certificate for tempo.tracing.serviceaccount.identity.linkerd.cluster.local until 2024-01-08 22:22:45 +0000 UTC: 2aa8e1fcfd4adc285b50ece886a7e057da6126a07c71efca539f49d102befb03"
2024-01-08T16:43:56.835814041Z time="2024-01-08T16:43:56Z" level=info msg="issued certificate for default.opentelemetry-operator-system.serviceaccount.identity.linkerd.cluster.local until 2024-01-08 22:22:45 +0000 UTC: 14b054e98c1a4e3ccbf1eae6510391fb35dd962c3d9930ace6789c497af75baf"
2024-01-08T20:22:54.115974712Z time="2024-01-08T20:22:54Z" level=info msg="Updated identity issuer"
2024-01-08T20:41:04.306370574Z time="2024-01-08T20:41:04Z" level=info msg="issued certificate for linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local until 2024-01-09 02:22:45 +0000 UTC: 7cc39a37029fc8e0372cd002b9bcdea3871fbcdf06564d40552613f6e8ac4e13"
2024-01-08T20:41:04.407563899Z time="2024-01-08T20:41:04Z" level=info msg="issued certificate for linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local until 2024-01-09 02:22:45 +0000 UTC: 6a935eaeffedd2cab9f71c91aaab93eae6b9dcdc57983872d87880b7b409e9a2"
20...
2024-01-08T20:41:06.571006602Z time="2024-01-08T20:41:06Z" level=info msg="issued certificate for default.opentelemetry-operator-system.serviceaccount.identity.linkerd.cluster.local until 2024-01-09 02:22:45 +0000 UTC: 588c673ae9b5c03d818179bf96ba705ab0a69daf12594daf9dd5ddc9da6f0461"
2024-01-09T00:23:36.152488605Z time="2024-01-09T00:23:36Z" level=info msg="Updated identity issuer"
2024-01-09T00:40:14.819432718Z time="2024-01-09T00:40:14Z" level=info msg="issued certificate for linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local until 2024-01-09 06:22:45 +0000 UTC: dfe09fe91401cc81a620454be3ce8dec3194a06e509576699df86a9283f00cde"
2024-01-09T00:40:14.841552376Z time="2024-01-09T00:40:14Z" level=info msg="issued certificate for linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local until 2024-01-09 06:22:45 +0000 UTC: 0f4df6b33dc748b525d6ac2708465495e573913d72ba7c113d581cb29922e6c2"
...
2024-01-09T00:40:15.484241474Z time="2024-01-09T00:40:15Z" level=info msg="issued certificate for default.opentelemetry-operator-system.serviceaccount.identity.linkerd.cluster.local until 2024-01-09 06:22:45 +0000 UTC: 6a1b57377bc9fa447f798bac314626be1d3cdd56543363870cad44e0b5c4f992"
2024-01-09T04:23:06.164455375Z time="2024-01-09T04:23:06Z" level=warning msg="Skipping issuer update as certs could not be read from disk: failed to verify issuer credentials for 'identity.linkerd.cluster.local' with trust anchors: x509: certificate has expired or is not yet valid: current time 2024-01-09T04:23:06Z is after 2024-01-09T04:22:44Z - Current Time : 2024-01-09 04:23:06.163801424 +0000 UTC m=+76664.387505192 - Invalid before 2024-01-09 04:22:45 +0000 UTC - Invalid After 2024-01-09 10:22:45 +0000 UTC"
2024-01-09T04:39:59.952187568Z time="2024-01-09T04:39:59Z" level=error msg="could not process CSR because of CA cert validation failure: x509: certificate has expired or is not yet valid: current time 2024-01-09T04:39:59Z is after 2024-01-09T04:22:44Z - Current Time : 2024-01-09 04:39:59.951553053 +0000 UTC m=+77678.175256790 - Invalid before 2024-01-09 00:22:45 +0000 UTC - Invalid After 2024-01-09 06:22:45 +0000 UTC - CSR Identity : linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local"
2024-01-09T04:40:00.072902481Z time="2024-01-09T04:40:00Z" level=error msg="could not process CSR because of CA cert validation failure: x509: certificate has expired or is not yet valid: current time 2024-01-09T04:40:00Z is after 2024-01-09T04:22:44Z - Current Time : 2024-01-09 04:40:00.072337696 +0000 UTC m=+77678.296041434 - Invalid before 2024-01-09 00:22:45 +0000 UTC - Invalid After 2024-01-09 06:22:45 +0000 UTC - CSR Identity : default.component-sia-platform.serviceaccount.identity.linkerd.cluster.local"
...
The logs of the linkerd-proxy containers look like this:
[ 93518.182957s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.245.203.6:8080}: linkerd_reconnect: Failed to connect error=endpoint 10.245.203.6:8080: invalid peer certificate: Expired error.sources=[invalid peer certificate: Expired]
[ 93518.390518s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_stack::failfast: Service entering failfast after 3s
[ 93518.390623s] ERROR ThreadId(02) identity: linkerd_proxy_identity_client::certify: Failed to obtain identity error=status: Unknown, message: "controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: service in fail-fast", details: [], metadata: MetadataMap { headers: {} } error.sources=[controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: service in fail-fast, service in fail-fast]
[ 93528.395376s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.245.141.170:8080}: linkerd_reconnect: Failed to connect error=endpoint 10.245.141.170:8080: invalid peer certificate: Expired error.sources=[invalid peer certificate: Expired]
Our issuer cert looks like this:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
...
creationTimestamp: "2023-10-13T12:08:46Z"
generation: 2
managedFields:
- ...
manager: cert-manager-certificates-readiness
operation: Update
subresource: status
time: "2024-01-09T08:22:48Z"
name: linkerd-identity-issuer
namespace: linkerd
resourceVersion: "366505900"
uid: 7b9373c2-a172-4cc5-be1d-660e49629bd2
spec:
commonName: identity.linkerd.cluster.local
duration: 6h0m0s
isCA: true
issuerRef:
kind: ClusterIssuer
name: linkerd-trust-anchor
privateKey:
algorithm: ECDSA
secretName: linkerd-identity-issuer
usages:
- cert sign
- crl sign
- server auth
- client auth
status:
conditions:
- lastTransitionTime: "2024-01-05T12:22:48Z"
message: Certificate is up to date and has not expired
observedGeneration: 2
reason: Ready
status: "True"
type: Ready
notAfter: "2024-01-09T14:22:45Z"
notBefore: "2024-01-09T08:22:45Z"
renewalTime: "2024-01-09T12:22:45Z"
revision: 1033
I would be very grateful for any help, and am happy to give more information.