Stat output for my meshed services:
linkerd viz stat deployments -n test
Stat output for linkerd namespace:
linkerd viz stat deployments -n linkerd
So based on what you wrote, it seems to be an issue with , the metrics api component.
I am checking the logs of the metrics api, and I see some weird errors that I am not sure how to interpret:
[109361.766619s] WARN ThreadId(01) outbound:proxy{addr=10.100.0.1:443}:balance{addr=kubernetes.default.svc.cluster.local:443}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.100.152.66:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.100.152.66:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[109361.785740s] WARN ThreadId(01) watch{port=8085}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.100.152.66:8090}: linkerd_reconnect: Service failed error=endpoint 172.100.152.66:8090: channel closed error.sources=[channel closed]
[109446.499670s] INFO ThreadId(01) outbound:proxy{addr=172.100.132.234:80}:forward{addr=172.100.132.234:80}:rescue{client.addr=172.100.147.48:36406}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=endpoint 172.100.132.234:80: error trying to connect: connect timed out after 1s error.sources=[error trying to connect: connect timed out after 1s, connect timed out after 1s]
[109522.039108s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:35808
[109528.056392s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:37952
[109546.050998s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:40446
[109564.061668s] INFO ThreadId(01) outbound:proxy{addr=172.100.125.241:80}:forward{addr=172.100.125.241:80}:rescue{client.addr=172.100.147.48:44052}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=endpoint 172.100.125.241:80: error trying to connect: connect timed out after 1s error.sources=[error trying to connect: connect timed out after 1s, connect timed out after 1s]
[109570.127442s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:53298
[109610.133121s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:42806
[109684.030966s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:40124
[109690.129976s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:44146
[109717.413548s] INFO ThreadId(01) outbound:proxy{addr=172.100.149.145:80}: linkerd_detect: Continuing after timeout: linkerd_proxy_http::version::Version protocol detection timed out after 10s
[148243.331630s] INFO ThreadId(01) outbound:proxy{addr=172.100.125.241:80}: linkerd_detect: Continuing after timeout: linkerd_proxy_http::version::Version protocol detection timed out after 10s
[148455.437941s] INFO ThreadId(01) outbound:proxy{addr=172.100.125.241:80}: linkerd_detect: Continuing after timeout: linkerd_proxy_http::version::Version protocol detection timed out after 10s
any ideas what those are about?
UPDATE: As I try various things to troubleshoot this, I noticed this in my logs (it comes from the web component of viz dashboard):
time="2023-05-04T08:49:04Z" level=error msg="rpc error: code = Unknown desc = Query failed: \"histogram_quantile(0.99, sum(irate(response_latency_ms_bucket{direction=\\\"inbound\\\", namespace=\\\"test\\\"}[1m])) by (le, namespace, replicationcontroller))\": server_error: server error: 504"
from linkerd-proxy (running on the same pod as the metrics api, see error below):
[153146.204358s] INFO ThreadId(01) outbound:proxy{addr=172.100.132.234:80}:forward{addr=172.100.132.234:80}:rescue{client.addr=172.100.147.48:58496}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=endpoint 172.100.132.234:80: error trying to connect: connect timed out after 1s error.sources=[error trying to connect: connect timed out after 1s, connect timed out after 1s]
and from the metrics api logs (probably related):
time="2023-05-04T08:49:04Z" level=error msg="queryProm failed with: Query failed: \"histogram_quantile(0.99, sum(irate(response_latency_ms_bucket{direction=\\\"inbound\\\", namespace=\\\"test\\\"}[1m])) by (le, namespace, replicationcontroller))\": server_error: server error: 504"
All those 3 errors happened at the same time, so they may relate to each other. Is there a configuration to set so I can increase this 1 second timeout? Although 1 second is a lot, and probably something else might be wrong in the setup.
Please note though, that those errors about prometheus as shown above, do not happen regularly. They seem to happen randomly. When I refresh viz dashboard and try to get a new view, very few times I see those errors in the log. So I am not sure if this is purely an issue between viz and prometheus.
Another abnormal thing is that I notice TLS errors for the tap pod ONLY.
Initially, I did not set any mTLS for webhooks, and I let helm manage it. Then, while trying to resolve this, I configured mTLS for webhooks manually, using this guide: Automatically Rotating Webhook TLS Credentials | Linkerd
But still, no matter what I try (I also tried to use --set tapInjector.injectCaFrom=linkerd-viz/linkerd-tap-injector, where I specify the custom Certificate resource) instead of
–set-file tap.caBundle=ca.crt`
but I still get this error in the logs, for the tap pod ONLY:
http: TLS handshake error from 172.100.157.9:40638: EOF
This is super weird. Not sure either if this affects anything or not!