503 Service Unavailable from proxy with large number of connections

yaniv.naor · May 19, 2023, 2:48pm

After upgrading linkerd-control-plane from version 1.9.3 (stable-2.12.1) to version 1.12.3 (stable-2.13.3), I started getting “503 service unavailable” responses when a large number of connections are used (~100 connections or higher).
When I remove the linkerd-proxy sidecar, everything works without any errors.

The network topology is a pod-to-pod HTTP communication. Both the client and the server pods have only one pod instance.
I am using fortio as a load-generation tool.

Please note that this issue started only after upgrading the linkerd-control-plane chart.

I configured the log level of the linkerd-proxy sidecar to debug and found the following logs that I think may be relevant:
server-side:

DEBUG ThreadId(01) inbound:accept{client.addr=10.129.0.64:37346}:server{port=8080}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Err(hyper::Error(Io, Custom { kind: NotConnected, error: "server: Transport endpoint is not connected (os error 107)" }))
DEBUG ThreadId(01) inbound:accept{client.addr=10.129.0.64:37346}: linkerd_app_core::serve: Connection closed reason=connection error: server: Transport endpoint is not connected (os error 107)
DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.129.0.2:44476}: linkerd_tls::server: Peeked bytes from TCP stream sz=0
DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.129.0.2:44476}: linkerd_tls::server: Attempting to buffer TLS ClientHello after incomplete peek
DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.129.0.2:44476}: linkerd_tls::server: Reading bytes from TCP stream buf.capacity=8192
DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.129.0.2:44476}: linkerd_tls::server: Read bytes from TCP stream buf.len=108
DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.129.0.2:44476}: linkerd_detect: Detected protocol protocol=Some(HTTP/1) elapsed=3.17µs
DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.129.0.2:44476}: linkerd_proxy_http::server: Creating HTTP service version=HTTP/1
DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.129.0.2:44476}: linkerd_proxy_http::server: Handling as HTTP version=HTTP/1
DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.129.0.2:44476}:http: linkerd_app_inbound::policy::http: Request authorized server.group= server.kind=default server.name=all-unauthenticated route.group= route.kind=default route.name=probe authz.group= authz.kind=default authz.name=probe client.tls=None(NoClientHello) client.ip=10.129.0.2
DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.129.0.2:44476}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.129.0.2:44476}: linkerd_app_core::serve: Connection closed

client-side:

DEBUG ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}:proxy{addr=172.30.159.54:8080}: linkerd_detect: Detected protocol protocol=Some(HTTP/1) elapsed=4.51µs
DEBUG ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}:proxy{addr=172.30.159.54:8080}: linkerd_proxy_http::server: Creating HTTP service version=HTTP/1
DEBUG ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}:proxy{addr=172.30.159.54:8080}: linkerd_app_outbound::sidecar: Using ClientPolicy routes
DEBUG ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}:proxy{addr=172.30.159.54:8080}: linkerd_proxy_http::server: Handling as HTTP version=HTTP/1
DEBUG ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}:proxy{addr=172.30.159.54:8080}:http: linkerd_app_outbound::http::logical::policy::router: Selected route meta=RouteRef(Default { name: "http" })
DEBUG ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}:proxy{addr=172.30.159.54:8080}:http: linkerd_stack::loadshed: Service has become unavailable
DEBUG ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}:proxy{addr=172.30.159.54:8080}:http: linkerd_stack::loadshed: Service shedding load
INFO ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}:proxy{addr=172.30.159.54:8080}:http:rescue{client.addr=10.129.0.64:41544}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 172.30.159.54:8080: service unavailable error.sources=[service unavailable]
DEBUG ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}:proxy{addr=172.30.159.54:8080}:http: linkerd_app_core::errors::respond: Handling error on HTTP connection status=503 Service Unavailable version=HTTP/1.1 close=true
DEBUG ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}:proxy{addr=172.30.159.54:8080}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
DEBUG ThreadId(01) outbound:accept{client.addr=10.129.0.64:41544}: linkerd_app_core::serve: Connection closed

I’ll appreciate your assistance with this issue.

alpeb · May 19, 2023, 8:34pm

Can you check how does the CPU consumption of the proxy containers look like? If that’s the bottleneck, you can play with the proxy.resources.cpu and proxy.cores settings.

Anything particular about the server main container we should know about? Are you able to reproduce this with a simple image like nginx?

yaniv.naor · May 20, 2023, 6:38pm

Hi, thanks for your reply.
I don’t think it’s a resource (CPU/Memory) consumption issue. The containers (application and the proxies) don’t have any limits defined, and they are hosted on a Node with enough CPU (8 cores).

The server is just a simple HTTP server written in GO for testing purposes. If you are interested, its implementation can be found here.

I suspect the number of open connections between the client and server causes this issue. Even with a low number of requests per second but a large number of open connections, the client receives the 503’s responses.
Is there a configuration that limits the number of open connections? At first, I suspected the new Circuit Breaking feature that was introduced in version 2.13 but I didn’t explicitly enable it.

The exact same setup works perfectly with linkerd-control-plane version 1.9.3 (stable-2.12.1) but fail with version 1.12.3 (stable-2.13.3)

rick7712 · June 13, 2023, 7:19am

Hi,
I got the same problem here.
I was using locust to evaluate performance with linkerd 2.13.2, and got lots of 503 with locust users=128.
But it didn’t happen without linkerd. Besides, istio works safely with 128 users.

[   265.902625s]  INFO ThreadId(01) outbound:proxy{addr=10.244.2.64:7712}:http:rescue{client.addr=10.244.2.65:49176}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.244.2.64:7712: service unavailable error.sources=[service unavailable]

Topic		Replies	Views
High Linkerd TCP Connection and timeout Linkerd General Discussion ingress	1	735	June 26, 2023
Linkerd Proxy memory usage increase & OOM when app response with ~5MB payload over ~12 requests/sec Linkerd General Discussion proxy	1	1385	July 24, 2023
Thousands of TCP Connections Remain Established Linkerd General Discussion	0	42	December 4, 2024
gRPC request failed operation was canceled: connection closed Linkerd General Discussion proxy	0	528	March 6, 2024
Meshed pods fail to connect to unmeshed smtp/redis pods with Linkerd 2.14.2 Linkerd General Discussion	10	1736	November 9, 2023

503 Service Unavailable from proxy with large number of connections

Related topics