Cannot see HTTP stats in viz dashboard

Hi guys. I have installed Linkerd in my k8s cluster via helm. Have installed linkerd CNI and viz via helm too. I have an external prometheus instance, which I have configured to scrap metrics from linkerd (including the linkerd-proxy). All seem to work, I run linkerd viz dashboard and I can see some stats in the dashboard, but the HTTP stats are missing. I can see stats in my prometheus instance (e.g. request_total is on prometheus, along with other stats that I see in linkerd dashboard, like sum(tcp_open_connections{direction="inbound", namespace="test"}) by (namespace, cronjob) ).

Here is what I see (even for linkerd namespace I do not see those http stats). I can only see TCP stats.

But, after importing Grafana Dashboards into my external Grafana, for linkerd from grafana.com, I can see some more stats from the ‘missing’ ones, like:

So it seems that my prometheus scrapes the linkerd-proxy fine, it can even display stats in Grafana, but I cannot see most of the stats in viz dashboard.

Any ideas what I may have missed? I have gone through docs many times, also through some git issues, but no luck.

PS: I run my cluster on EKS. I also exposed viz dashboard via ingress (although I didn’t expect this to affect anything).

Many Thanks for reading!

1 Like

hi @babis

Are you setting the prometheusUrl value to point to your prometheus instance when installing Linkerd viz?

Yes, we have set this:

I am not sure if this relates (and if so, how?), but I see this when I run linkerd check:

‼ prometheus is installed and configured correctly
    missing ClusterRoles: linkerd-linkerd-viz-prometheus
    see https://linkerd.io/2.13/checks/#l5d-viz-prometheus for hints

but the checks succeed overall:

Status check results are √

Have you got the CRDs installed? They can be installed as a Helm chart from the same repository with the chart name linkerd-crds

Yes, I have installed both CRD and CNI, before the Control Plane.

Firstly, I did:

helm install linkerd-cni -n linkerd-cni --create-namespace linkerd/linkerd2-cni

then:

helm install linkerd-crds linkerd/linkerd-crds -n linkerd --create-namespace --set cniEnabled=true

and then:

helm install linkerd-control-plane \
  -n linkerd \
  --set-file identityTrustAnchorsPEM=mtls/current/ca.crt \
  --set identity.issuer.scheme=kubernetes.io/tls \
  --set cniEnabled=true \
  -f values-custom.yaml \
  -f values-ha-custom.yaml \
  linkerd/linkerd-control-plane

My values-custom.yaml:

cniEnabled: true

# I have modified this to include the IPs of our VPC, because on check was failing: cluster networks contains all pods
# One pod was not within the default CIDR range. This was weird, as all other pods are on the same subnet. STRANGE!
clusterNetworks: "10.0.0.0/8,100.64.0.0/10,172.16.0.0/12,192.168.0.0/16,172.100.0.0/16"

controllerLogLevel: info
#controllerLogLevel: debug

proxy:
#  logLevel: warn,linkerd=info,trust_dns=error
  logLevel: info,linkerd=info,trust_dns=info

nodeSelector:
  kubernetes.io/os: linux
  usage: xxxx-linkerd-nodes
  subnet_type: yyyyy

and my values-ha-custom.yaml:

enablePodDisruptionBudget: true

deploymentStrategy:
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 25%

enablePodAntiAffinity: true

proxy:
  resources:
    cpu:
      request: 100m
    memory:
      limit: 250Mi
      request: 20Mi

controllerReplicas: 3
controllerResources: &controller_resources
  cpu: &controller_resources_cpu
    limit: ""
    request: 100m
  memory:
    limit: 250Mi
    request: 50Mi
destinationResources: *controller_resources

identityResources:
  cpu: *controller_resources_cpu
  memory:
    limit: 250Mi
    request: 10Mi

heartbeatResources: *controller_resources

proxyInjectorResources: *controller_resources
webhookFailurePolicy: Fail

spValidatorResources: *controller_resources

the strange thing a I see it, is that I can see TCP stats but not HTTP ones?

At a glance I can’t see anything untoward. Do you get any other errors in the linkerd check output?

Here are a few things I’d do to narrow the problem down:

  • Try looking at HTTP stats on the command line with the linkerd viz stat command. If this commend yields stats, we know it’s some issue with the dashboard. If not, it’s likely an issue with the viz metrics-api deployment.
  • Take a look at the logs of the metrics-api deployment to see if it has any errors querying prometheus.
  • If all else fails, run the metrics-api with log level set to debug to see the actual prometheus queries it’s making and then try making those exact same queries manually to prometheus and compare the results.

Those are the only warnings I get from linkerd check:

linkerd-identity
----------------
...
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-05-06T06:05:29Z
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

this one is OK, the certificate is renewed automatically by the cert-manager.

‼ prometheus is installed and configured correctly
    missing ClusterRoles: linkerd-linkerd-viz-prometheus
    see https://linkerd.io/2.13/checks/#l5d-viz-prometheus for hints

I am not sure about this one, should I declare some ClusterRoles for prometheus? I have the internal prometheus disabled, so my guess is that it complaints because those weren’t installed? Or is it something else?

And finally, this one, which is optional (I think):

buoyant-cloud
-------------
‼ Linkerd health ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-health for hints
‼ Linkerd vulnerability report ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-vulnerability-report for hints
‼ Linkerd data plane upgrade assistance ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-data-plane-upgrade for hints
‼ Linkerd trust anchor rotation assistance ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-trust-anchor-rotation for hints

Overall, the command succeeds:

Status check results are √

Stat output for my meshed services:

linkerd viz stat deployments -n test

Stat output for linkerd namespace:

linkerd viz stat deployments -n linkerd

So based on what you wrote, it seems to be an issue with , the metrics api component.

I am checking the logs of the metrics api, and I see some weird errors that I am not sure how to interpret:

[109361.766619s]  WARN ThreadId(01) outbound:proxy{addr=10.100.0.1:443}:balance{addr=kubernetes.default.svc.cluster.local:443}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.100.152.66:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.100.152.66:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[109361.785740s]  WARN ThreadId(01) watch{port=8085}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.100.152.66:8090}: linkerd_reconnect: Service failed error=endpoint 172.100.152.66:8090: channel closed error.sources=[channel closed]
[109446.499670s]  INFO ThreadId(01) outbound:proxy{addr=172.100.132.234:80}:forward{addr=172.100.132.234:80}:rescue{client.addr=172.100.147.48:36406}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=endpoint 172.100.132.234:80: error trying to connect: connect timed out after 1s error.sources=[error trying to connect: connect timed out after 1s, connect timed out after 1s]
[109522.039108s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:35808
[109528.056392s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:37952
[109546.050998s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:40446
[109564.061668s]  INFO ThreadId(01) outbound:proxy{addr=172.100.125.241:80}:forward{addr=172.100.125.241:80}:rescue{client.addr=172.100.147.48:44052}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=endpoint 172.100.125.241:80: error trying to connect: connect timed out after 1s error.sources=[error trying to connect: connect timed out after 1s, connect timed out after 1s]
[109570.127442s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:53298
[109610.133121s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:42806
[109684.030966s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:40124
[109690.129976s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connect timed out after 1s client.addr=172.100.147.48:44146
[109717.413548s]  INFO ThreadId(01) outbound:proxy{addr=172.100.149.145:80}: linkerd_detect: Continuing after timeout: linkerd_proxy_http::version::Version protocol detection timed out after 10s
[148243.331630s]  INFO ThreadId(01) outbound:proxy{addr=172.100.125.241:80}: linkerd_detect: Continuing after timeout: linkerd_proxy_http::version::Version protocol detection timed out after 10s
[148455.437941s]  INFO ThreadId(01) outbound:proxy{addr=172.100.125.241:80}: linkerd_detect: Continuing after timeout: linkerd_proxy_http::version::Version protocol detection timed out after 10s

any ideas what those are about?

UPDATE: As I try various things to troubleshoot this, I noticed this in my logs (it comes from the web component of viz dashboard):

time="2023-05-04T08:49:04Z" level=error msg="rpc error: code = Unknown desc = Query failed: \"histogram_quantile(0.99, sum(irate(response_latency_ms_bucket{direction=\\\"inbound\\\", namespace=\\\"test\\\"}[1m])) by (le, namespace, replicationcontroller))\": server_error: server error: 504"

from linkerd-proxy (running on the same pod as the metrics api, see error below):

[153146.204358s]  INFO ThreadId(01) outbound:proxy{addr=172.100.132.234:80}:forward{addr=172.100.132.234:80}:rescue{client.addr=172.100.147.48:58496}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=endpoint 172.100.132.234:80: error trying to connect: connect timed out after 1s error.sources=[error trying to connect: connect timed out after 1s, connect timed out after 1s]

and from the metrics api logs (probably related):

time="2023-05-04T08:49:04Z" level=error msg="queryProm failed with: Query failed: \"histogram_quantile(0.99, sum(irate(response_latency_ms_bucket{direction=\\\"inbound\\\", namespace=\\\"test\\\"}[1m])) by (le, namespace, replicationcontroller))\": server_error: server error: 504"

All those 3 errors happened at the same time, so they may relate to each other. Is there a configuration to set so I can increase this 1 second timeout? Although 1 second is a lot, and probably something else might be wrong in the setup.

Please note though, that those errors about prometheus as shown above, do not happen regularly. They seem to happen randomly. When I refresh viz dashboard and try to get a new view, very few times I see those errors in the log. So I am not sure if this is purely an issue between viz and prometheus.

Another abnormal thing is that I notice TLS errors for the tap pod ONLY.

Initially, I did not set any mTLS for webhooks, and I let helm manage it. Then, while trying to resolve this, I configured mTLS for webhooks manually, using this guide: Automatically Rotating Webhook TLS Credentials | Linkerd

But still, no matter what I try (I also tried to use --set tapInjector.injectCaFrom=linkerd-viz/linkerd-tap-injector, where I specify the custom Certificate resource) instead of –set-file tap.caBundle=ca.crt`

but I still get this error in the logs, for the tap pod ONLY:

http: TLS handshake error from 172.100.157.9:40638: EOF

This is super weird. Not sure either if this affects anything or not!

It looks like you metrics-api pod isn’t able to connect to your prometheus instance for some reason. I’d double check that your prometheus is configured to listen on port 80 and that it’s not expecting TLS. Another thing to possible check is that you don’t have any firewall rules or anything similar which would prevent metrics-api from connecting to prometheus.

My prometheus works with http, as I use it in my other Grafana instance. I don’t think there is a firewall issue, all are within the same VPC and rules allow access between components. Also, if it was a firewall issue then I wouldn’t receive 504 response from prometheus, so metrics-api pod is able to talk to prometheus.

I have noticed that some stats are not accurate in the Grafana dashboard that I downloaded from Grafana. For example, response rate is shown as 100% always, but I ran some requests that cause error 500 and the response rate did not change. After digging with the query that the dashboard ran and stats in my prometheus, it turns out that the query was using wrong stats.

I think I will give up on it. I have spent tons of time, some things are working, some others do not, there are some resources in the web but not always accurate, and I have not seen any actual benefit from using linkerd, if I cannot get those metrics then I cannot observe and evaluate my setup.

Anyway, thanks for you help!

Sorry to hear that you weren’t able to get this working! If you do come back to this in the future and are able to reproduce an issue, we’d be happy to continue to investigate.

I also not able to see the stats for deployments, inbound, and edges

It worked in another environment with a normal external prometheus, instead of this google managed prometheus. I’m not sure if it’s prometheus is not scraping the data, or metrics-api is not able to pull the data because prometheus does not relabel the data properly.

SETUP:
linkerd version: stable-2.13.5
GMP with self-deployed data collection
HA mode

Hey,
how did you configure the scrape configs for linkerd?
I just figured e.g. the scrapeConfig crd isn’t working due to a false label value in scrape_job: linkerd is only looking for linkerd-proxy, linkerd-controller and linkerd-service-mirror. Not for scrapeConfig/namespace/linkerd-proxy…
If you check for the other job, you might realize prometheus is scraping correctly but linkerd can’t find the values…