We are currently using Npgsql with a multi-host configuration for PostgreSQL failover (no load balancing). While testing failure scenarios we observed behavior which can introduce significant latency for user requests.
It appears that the current host recheck mechanism may cause user requests to block on connection attempts to a previously failed host, even though a healthy host is available.
Configuration
Example connection string:
Host=server1,server2;
load_balance_hosts=false;
Host Recheck Seconds=5;
Timeout=2;
target_session_attrs=primary
Maximum Pool Size=5;
Scenario assumptions:
server1 = primary host
server2 = failover host
- no load balancing
- application uses the default connection pooling
Observed behavior
- Application initially connects to
server1.
server1 becomes unavailable.
- A connection attempt fails and
server1 is marked as down.
- The driver starts using
server2 successfully.
- After
HostRecheckSeconds (e.g. 5 seconds), the driver tries server1 again.
If server1 is still unavailable, multiple concurrent requests attempt to connect to it again.
Because the connection attempt waits for the configured Timeout, these requests block for the full timeout duration before falling back to server2.
Example timeline:
Hosts: server1, server2
HostRecheckSeconds = 5
Timeout = 2
If server1 is still down:
- every 5 seconds the driver retries
server1
- concurrent connection attempts wait up to 2 seconds
- only afterwards do they fall back to
server2
This results in user requests experiencing ~2 seconds of latency, even though server2 is fully healthy and could respond immediately.
With higher concurrency (e.g. Maximum Pool Size), many requests may be delayed simultaneously.
Expected behavior
From an application perspective, a preferable strategy would be:
- once a host is marked as down, continue using the healthy host
- periodically probe the failed host in the background
- avoid blocking user requests on retry attempts to the failed host
This would prevent unnecessary latency spikes when one host remains unavailable.
Question
Is the current behavior intentional?
If so, is there a recommended way to avoid user-facing delays when using multi-host failover without load balancing?
Alternatively, would it make sense to introduce an option to probe failed hosts in the background rather than during user connection attempts?
Environment
- Npgsql version: 8.0.6
- .NET SDK-version: 8.0.418
We are currently using Npgsql with a multi-host configuration for PostgreSQL failover (no load balancing). While testing failure scenarios we observed behavior which can introduce significant latency for user requests.
It appears that the current host recheck mechanism may cause user requests to block on connection attempts to a previously failed host, even though a healthy host is available.
Configuration
Example connection string:
Host=server1,server2;
load_balance_hosts=false;
Host Recheck Seconds=5;
Timeout=2;
target_session_attrs=primary
Maximum Pool Size=5;
Scenario assumptions:
server1= primary hostserver2= failover hostObserved behavior
server1.server1becomes unavailable.server1is marked as down.server2successfully.HostRecheckSeconds(e.g. 5 seconds), the driver triesserver1again.If
server1is still unavailable, multiple concurrent requests attempt to connect to it again.Because the connection attempt waits for the configured
Timeout, these requests block for the full timeout duration before falling back toserver2.Example timeline:
Hosts: server1, server2
HostRecheckSeconds = 5
Timeout = 2
If
server1is still down:server1server2This results in user requests experiencing ~2 seconds of latency, even though
server2is fully healthy and could respond immediately.With higher concurrency (e.g.
Maximum Pool Size), many requests may be delayed simultaneously.Expected behavior
From an application perspective, a preferable strategy would be:
This would prevent unnecessary latency spikes when one host remains unavailable.
Question
Is the current behavior intentional?
If so, is there a recommended way to avoid user-facing delays when using multi-host failover without load balancing?
Alternatively, would it make sense to introduce an option to probe failed hosts in the background rather than during user connection attempts?
Environment