What we observed
During AKS node initialization, when pulling container images from the registry, if the registry server became temporarily unresponsive, the HTTP client waited for approximately 2 minutes 47 seconds before timing out — with 0 bytes received. An immediate retry after the failure succeeded in just 1.6 seconds.
This wait time is too long. The client should be able to detect unresponsive connections much sooner.
Where the issue is
The CRI image pull HTTP request flows through this chain:
PullImage
└─ pullRequestReporterRoundTripper.RoundTrip() // counts active requests & bytes
└─ http.Transport.RoundTrip() // actual HTTP request goes out here
└─ newTransport() (image_pull.go L569) // DialContext.Timeout=30s, but NO ResponseHeaderTimeout
newTransport() in internal/cri/server/images/image_pull.go#L569-L581 creates the http.Transport that actually sends the request. Once the TCP connection and TLS handshake succeed, there is no timeout for waiting on response headers — so the client hangs until the OS TCP stack gives up (~2-3 minutes).
What we propose
Add ResponseHeaderTimeout to newTransport():
func newTransport() *http.Transport {
return &http.Transport{
Proxy: http.ProxyFromEnvironment,
DialContext: (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
FallbackDelay: 300 * time.Millisecond,
}).DialContext,
MaxIdleConns: 10,
IdleConnTimeout: 30 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
ExpectContinueTimeout: 5 * time.Second,
ResponseHeaderTimeout: 30 * time.Second, // <-- add this
}
}
This allows the client to fail fast and retry sooner, instead of waiting minutes on a stalled connection.
What we observed
During AKS node initialization, when pulling container images from the registry, if the registry server became temporarily unresponsive, the HTTP client waited for approximately 2 minutes 47 seconds before timing out — with 0 bytes received. An immediate retry after the failure succeeded in just 1.6 seconds.
This wait time is too long. The client should be able to detect unresponsive connections much sooner.
Where the issue is
The CRI image pull HTTP request flows through this chain:
newTransport()ininternal/cri/server/images/image_pull.go#L569-L581creates thehttp.Transportthat actually sends the request. Once the TCP connection and TLS handshake succeed, there is no timeout for waiting on response headers — so the client hangs until the OS TCP stack gives up (~2-3 minutes).What we propose
Add
ResponseHeaderTimeouttonewTransport():This allows the client to fail fast and retry sooner, instead of waiting minutes on a stalled connection.