Hello, we’re seeing an issue where a backend server’s successful replies are being blocked by linkerd after a short time running properly; it works at first, and then replies are being stopped later. It seems to coincide with when the GRPC Status never resolves when watching live calls in Linkerd Viz (See screenshot below). The requests are getting through and the server is replying properly, but the data never makes it back to the client.
Not seeing any errors on the Linkerd-Proxy though it’s running at a limited logging level at the moment. EIther way, when watching live calls, we noticed if the GRPC status never resolves to ‘–’ then things break since the client never gets the server reply.
The issue doesn’t seem to be a protocol detection delay since it never resolves. It just keeps spinning and eventually the client times out.
Wondering what might be causing this behavior? Restarting the pod for the server that is replying fixes the issue temporarily, but since the server is replying there’s not an easy way to tell if the replies are making it all the way back to the client.
Another update with something we noticed while running tap.
Command [redacted]
linkerd viz tap -n app-prod -o wide deployment/app-deployment-backend
The connections that work show the “req”, then a “rsp” and finally the “end”. Syn - Ack - Fin. All three were there. When the connections that stopped working, the “end” wasn’t showing, So that’s probably why the GRPC never resolves, is the connection is still technically open.
Still not sure why the request and response would be sent without a final end.
Example when working req id: 10:2
proxy: in
src: 10.174.6.5:37206
dst: 10.174.6.10:3000
tls: true
:method: POST
:authority: app-prod-backend:3000
:path: /api/cases/filter
… rsp id: 10:2
proxy: in
src: 10.174.6.5:37206
dst: 10.174.6.10:3000
tls: true
:status: 200
latency: 152471µs
src_client_id: prod-gateway.app-prod.serviceaccount.identity.linkerd.cluster.local
src_control_plane_ns: linkerd
src_deployment: prod-frontend
… end id: 10:2
proxy: in
src: 10.174.6.5:37206
dst: 10.174.6.10:3000
tls: true
duration: 725µs
response-length: 127015B
src_client_id: prod-gateway.app-prod.serviceaccount.identity.linkerd.cluster.local
When the service stops working, that last piece “end” doesn’t show up.
Looking like this may be an issue with protocol detection. The backend works until logs saying the protocol was detected as HTTP/2 start showing up. The protocol for this connection should always be HTTP/1.1 though.
Protocol Detection
2023-09-13T22:45:55.075864812Z [518242.614422s] DEBUG ThreadId(01) inbound:accept{client.addr=10.174.5.218:41150}:server{port=3000}: linkerd_tls::server: Peeked bytes from TCP stream sz=0
2023-09-13T22:45:55.075879389Z [518242.614431s] DEBUG ThreadId(01) inbound:accept{client.addr=10.174.5.218:41150}:server{port=3000}: linkerd_tls::server: Attempting to buffer TLS ClientHello after incomplete peek
2023-09-13T22:45:55.075882900Z [518242.614433s] DEBUG ThreadId(01) inbound:accept{client.addr=10.174.5.218:41150}:server{port=3000}: linkerd_tls::server: Reading bytes from TCP stream buf.capacity=8192
2023-09-13T22:45:55.075885379Z [518242.614443s] DEBUG ThreadId(01) inbound:accept{client.addr=10.174.5.218:41150}:server{port=3000}: linkerd_proxy_http::server: Creating HTTP service version=HTTP/2
2023-09-13T22:45:55.075888097Z [518242.614469s] DEBUG ThreadId(01) inbound:accept{client.addr=10.174.5.218:41150}:server{port=3000}: linkerd_proxy_http::server: Handling as HTTP version=HTTP/2
2023-09-13T22:45:55.075925343Z [518242.614634s] DEBUG ThreadId(01) inbound:accept{client.addr=10.174.5.218:41150}:server{port=3000}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Err(hyper::Error(Io, Custom { kind: UnexpectedEof, error: “connection closed before reading preface” }))
Seems like it’s failing to read any bytes during the TCP streak “peek” and then fails over to HTTP/2. So the client is closing the connection after issues reading HTTP/2 when it was expecting HTTP/1.1.
Opaque ports isn’t a good option here since we leverage service profiles that enforce some actions based on the HTTP method. Is there a way to enforce specific HTTP versions?
It seems unlikely to me that this would have anything to do with protocol detection. Those log DEBUG log messages seem benign.
Can you share the output of linkerd diagnostics proxy-metrics ... on a pod? There may have been a regression in the way that grpc statuses are exposed from pods. Or, that it looks like the UI is loading the status, it may be that the underlying prometheus query isn’t returning results in a timely fashion.
I’d probably try to isolate whether the grpc status information is missing in the proxy’s metrics export or whether this is a UI issue.