Skip to content

Add DNS SRV service discovery support (RFC 2782)#6554

Draft
x4m wants to merge 2 commits into
npgsql:mainfrom
x4m:dnssrv
Draft

Add DNS SRV service discovery support (RFC 2782)#6554
x4m wants to merge 2 commits into
npgsql:mainfrom
x4m:dnssrv

Conversation

@x4m

@x4m x4m commented Apr 22, 2026

Copy link
Copy Markdown

DNS SRV Service Discovery for PostgreSQL HA Clusters

Problem

High-availability PostgreSQL deployments (Patroni, Stolon, etc.) expose several
nodes that change over time as primaries fail over or replicas are added.
Today, every client connection string must hard-code the full list of hosts:

Host=pg1.example.com,pg2.example.com,pg3.example.com;TargetSessionAttributes=read-write

Updating that list when topology changes requires redeploying every service that
connects to the database.

Solution

RFC 2782 DNS SRV records were
designed to solve exactly this problem. The operator publishes a single DNS name:

_postgresql._tcp.cluster.example.com  SRV  10 1 5432 pg1.example.com.
_postgresql._tcp.cluster.example.com  SRV  20 1 5432 pg2.example.com.
_postgresql._tcp.cluster.example.com  SRV  20 1 5432 pg3.example.com.

Clients look up that name once and receive an ordered list of hosts. Topology
changes become a DNS update—no application restart required.

This PR adds a SrvHost connection string property. When set, Npgsql queries
_postgresql._tcp.<SrvHost> at NpgsqlDataSource build time, sorts the
returned records by priority ascending, weight descending (RFC 2782 §3),
and passes the resulting host:port,... list into the existing
NpgsqlMultiHostDataSource infrastructure. All existing features—
TargetSessionAttributes, load balancing, health checks—work unchanged.

API

New connection string keyword

Keyword Type Default Description
SrvHost string null Cluster DNS domain for SRV discovery

SrvHost and Host are mutually exclusive. Specifying both throws
ArgumentException at build time.

New builder property

// NpgsqlDataSourceBuilder and NpgsqlSlimDataSourceBuilder
public DnsClient.ILookupClient? SrvLookupClient { get; set; }

Leave as null to use the OS resolver. Inject a custom ILookupClient in
tests to return deterministic records without a real DNS server.

Usage

// Connection string
await using var ds = NpgsqlDataSource.Create(
    "SrvHost=cluster.example.com;Database=app;Username=appuser");

// Builder API (same result)
var builder = new NpgsqlDataSourceBuilder
{
    ConnectionStringBuilder =
    {
        SrvHost   = "cluster.example.com",
        Database  = "app",
        Username  = "appuser",
        TargetSessionAttributes = TargetSessionAttributes.ReadWrite,
    }
};
await using var ds = builder.Build();

The resolved hosts behave identically to a hand-written
Host=pg1:5432,pg2:5432,pg3:5432 list.

Implementation Details

New files

File Role
src/Npgsql/SrvLookup.cs Core DNS logic via DnsClient NuGet
test/Npgsql.SrvTests/SrvLookupTests.cs Unit + live integration tests
test/Npgsql.SrvTests/Npgsql.SrvTests.csproj Isolated test project

Modified files

File Change
NpgsqlConnectionStringBuilder.cs New SrvHost property + mutual exclusivity check
NpgsqlSlimDataSourceBuilder.cs ResolveSrvIfNeeded() called in Build() / BuildMultiHost()
NpgsqlDataSourceBuilder.cs Forward SrvLookupClient to the internal builder
Npgsql.csproj DnsClient 1.8.0 dependency
Directory.Packages.props Central version pin for DnsClient
PublicAPI.Unshipped.txt New public symbols declared
Properties/AssemblyInfo.cs InternalsVisibleTo for Npgsql.SrvTests

Why DnsClient?

.NET's built-in System.Net.Dns only resolves A/AAAA records. DnsClient is
the canonical .NET library for SRV lookups. It is MIT-licensed, has no
transitive dependencies beyond .NET itself, and targets netstandard2.0.

Resolution timing

SRV records are resolved once at NpgsqlDataSource build time, not per
connection. This mirrors the model used for static multi-host connection
strings. Applications that need periodic re-discovery can rebuild the data
source (e.g. on a timer or after a connection error).

Testing

Unit tests (no database, no DNS required)

Npgsql.SrvTests is an isolated project that does not inherit the
assembly-level PostgreSQL [OneTimeSetUp] from Npgsql.Tests, so all unit
tests run on any developer machine.

Tests cover:

  • SrvHost connection-string roundtrip and keyword parsing.
  • SrvHost / Host mutual exclusivity throws ArgumentException.
  • RFC 2782 sort order: lower priority number wins; higher weight wins on tie.
  • Trailing-dot stripping from FQDNs (pg.example.com.pg.example.com).
  • Priority/weight ordering verified against the four real records at mmatvei.ru.
  • Empty SRV result set raises NpgsqlException.

Live integration test

ResolveSrvLive queries _postgresql._tcp.mmatvei.ru, a set of real public
SRV records maintained for this purpose:

_postgresql._tcp.mmatvei.ru  SRV  96  1 5432 pg4.mmatvei.ru.
_postgresql._tcp.mmatvei.ru  SRV  97  1 5432 pg3.mmatvei.ru.
_postgresql._tcp.mmatvei.ru  SRV  99  1 5432 pg2.mmatvei.ru.
_postgresql._tcp.mmatvei.ru  SRV 100  1 5432 pg.mmatvei.ru.

The test skips automatically when DNS is unavailable so it never fails an
offline build. Override the nameserver via environment variable if the system
resolver has a stale negative cache:

NPGSQL_TEST_SRV_DNS=88.212.208.183 dotnet test test/Npgsql.SrvTests/

Run tests

# Unit tests only (no DNS needed)
dotnet test test/Npgsql.SrvTests/

# With live DNS test against specific nameserver
NPGSQL_TEST_SRV_DNS=88.212.208.183 dotnet test test/Npgsql.SrvTests/

Prior Art

  • pgx (Go): PR #2538
    postgres+srv:// URI scheme and LookupSRVFunc hook.
  • pgjdbc (Java): PR #4036
    jdbc:postgresql+srv:// URI scheme using JNDI DNS.
  • This Npgsql implementation follows the same design goals and uses the
    SrvHost= keyword approach, consistent with Npgsql's key-value connection
    string convention.

Introduce a new `SrvHost` connection string property that enables a single
DNS name to represent an entire high-availability PostgreSQL cluster.  When
set, Npgsql resolves `_postgresql._tcp.<SrvHost>` SRV records at
`NpgsqlDataSource` build time and uses the returned host/port pairs as the
multi-host list, sorted by priority (ascending) then weight (descending) per
RFC 2782.

### Changes

**`NpgsqlConnectionStringBuilder`**
- New `SrvHost` string property, exposed as the `SrvHost` keyword in
  connection strings (e.g. `SrvHost=cluster.example.com`).
- `PostProcessAndValidate` updated to allow a null/empty `Host` when `SrvHost`
  is set, and to enforce mutual exclusivity: supplying both throws
  `ArgumentException`.

**`SrvLookup.cs`** (new)
- Static helper class that queries `_postgresql._tcp.<srvHost>` via the
  `DnsClient` NuGet package, sorts SRV records by priority / weight, strips
  trailing FQDN dots, and returns a comma-separated `host:port,host:port,...`
  string ready for use as `NpgsqlConnectionStringBuilder.Host`.
- `SortAndBuild(IEnumerable<SrvRecord>)` is `internal` so unit tests can
  exercise sorting logic without a live DNS server.

**`NpgsqlSlimDataSourceBuilder` / `NpgsqlDataSourceBuilder`**
- New `SrvLookupClient` property (`DnsClient.ILookupClient?`).  When `null`
  (the default), the OS resolver is used.  Inject a custom client in tests to
  return deterministic SRV records without hitting DNS.
- `Build()` and `BuildMultiHost()` call `ResolveSrvIfNeeded()` before
  `PostProcessAndValidate()`, expanding SRV results into the `Host` property
  so the existing multi-host path handles all subsequent connection logic.

**Dependency**
- `DnsClient 1.8.0` added to `Directory.Packages.props` and `Npgsql.csproj`.

### New test project — `Npgsql.SrvTests`

Isolated from the main `Npgsql.Tests` assembly (which requires a live
PostgreSQL server) so SRV unit tests can run on any machine without a
database.

Unit tests cover:
- Connection-string roundtrip and keyword parsing.
- `SrvHost` / `Host` mutual exclusivity.
- RFC 2782 sort order: priority ascending, weight descending.
- Trailing-dot stripping from FQDNs returned by DnsClient.
- Priority/weight ordering mirroring the real records at `mmatvei.ru`.
- Empty result set throws `NpgsqlException`.

`ResolveSrvLive` performs an end-to-end DNS lookup against
`_postgresql._tcp.mmatvei.ru` (four real SRV records, priorities 96–100)
and verifies ordering.  The test skips automatically if the records are
unreachable.  Set `NPGSQL_TEST_SRV_DNS=<ip>` to force a specific nameserver
(useful when the system resolver has a stale negative cache).

### Usage

```csharp
// Connection string keyword
var ds = NpgsqlDataSource.Create(
    "SrvHost=cluster.example.com;Database=app;Username=app_user");

// Builder API
var builder = new NpgsqlDataSourceBuilder();
builder.ConnectionStringBuilder.SrvHost = "cluster.example.com";
builder.ConnectionStringBuilder.Database = "app";
var ds = builder.Build();
```

### Connection string format
```
SrvHost=cluster.example.com;Database=mydb;Username=myuser;Password=...
```

### Notes
- SRV resolution happens once at `Build()` time.  Re-build the data source to
  re-query DNS (matches how `NpgsqlMultiHostDataSource` works today).
- `TargetSessionAttributes` (e.g. `read-write`, `primary`, `standby`) work
  unchanged with the resolved host list.
- `SrvHost` and `Host` are mutually exclusive; mixing them throws
  `ArgumentException` at build time.

Made-with: Cursor
@x4m x4m requested review from roji and vonzshik as code owners April 22, 2026 05:37
@vonzshik

Copy link
Copy Markdown
Contributor

Neither pgjdbc or pgx has this feature. In fact, there are only pr's submitted by you to them just yesterday. BTW the links to them are completely wrong, but what else can I expect from AI.
Oh, and no, we're definitely not going to reference another library.

@x4m

x4m commented Apr 22, 2026

Copy link
Copy Markdown
Author

Yeah, PR links are wrong, my bad. Unlike RFC link, which is correct. And, of course, AI was used.
Removing extra dependency makes sense, I'll look what can be done about it.

Implement DNS SRV resolution directly using UdpClient and manual
DNS wire-format parsing, eliminating the DnsClient NuGet package.

System DNS servers are obtained via NetworkInterface; the packet
parser handles compressed names per RFC 1035.  Public API surface
is unchanged: only SrvHost is exposed; the internal SrvLookupClient
override (used in tests) is replaced by calling SortAndBuild()
directly with the now-internal SrvRecord type.

Made-with: Cursor
@x4m x4m marked this pull request as draft April 22, 2026 10:58
@x4m

x4m commented Apr 22, 2026

Copy link
Copy Markdown
Author

@vonzshik I've pushed a draft, no more DnsClient. Though I'm curious if 200 lines of DNS wire-format parsing is a net win over one library? I'm happy to keep it either way, just asking.
In libpq the request was an opposite - delegate code maintenance if possible.

@roji

roji commented Apr 22, 2026

Copy link
Copy Markdown
Member

In libpq the request was an opposite - delegate code maintenance if possible.

Can you point us to the libpq discussion around this? We generally tend to follow libpq in terms of functionality like this - if this is accepted in the standard PostgreSQL client library, that provides good motivation for considering it here.

Otherwise, some general thoughts:

  • As @vonzshik wrote above, we're generally reluctant to add new dependencies, especially for functionality which would only be needed by a fraction of our users. I also wouldn't love the idea of maintaining DNS wire-parsing code in a database driver such as Npgsql. This point would be easier if .NET contained built-in client support for DNS SRV, but the lack thereof may tell us something about the overall importance/popularity of SRV records...
  • Overall, the same goals addressed by SRV are frequently already addressed by having a single IP with a load balancer/proxy in front of the actual services (I'm thinking of the cloud service offerings).
  • Single-IP based approaches have the advantage of being able to dynamically move IPs in and out of the list without a DNS deployment change. For example, DNS relies heavily on caching, so topology changes would first have to propagate.
  • Taking a wider look, I'm not sure it makes sense/is scalable to go over all the possible services out there - why not add SRV support to MySQL, SQL Server, and all other database clients? For that matter, why restrict to database clients - should all services out there support DNS SRV?

Overall, I'd prefer for the feature to be actually requested by real-world users in order to address their actual problems, rather than proposed as additions into all drivers like this; in the history of Npgsql I don't recall a single user asking for it.

@x4m

x4m commented Apr 22, 2026

Copy link
Copy Markdown
Author

@roji Many thanks for your thoughtful reply!

In libpq the request was an opposite - delegate code maintenance if possible.

Can you point us to the libpq discussion around this? We generally tend to follow libpq in terms of functionality like this - if this is accepted in the standard PostgreSQL client library, that provides good motivation for considering it here.

There are several relevant discussions on pgsql-hackers:

Most recent attempt: resolving DNS A record into multiple IPs https://www.postgresql.org/message-id/flat/AM9PR09MB49008B02CDF003054D5D4E00977DA%40AM9PR09MB4900.eurprd09.prod.outlook.com
Older discussion with a more complete patch https://www.postgresql.org/message-id/flat/CAKK5BkGK8gZRH48cHD7Di8WXfjdG3_1QAFD1O1FPCbt76Wq_zQ%40mail.gmail.com#a09343495bd00b0ae5d3eba487abeca3
Thread specifically about DNS SRV https://www.postgresql.org/message-id/flat/8398C22D-429A-4980-9028-4F941F2B7483%40yandex-team.ru#b220f9a55de09af1719365a6aa9bffdc
I work for a cloud provider. When clients use a non-Go driver today, they essentially have to buy a proxy from us. pgx already helps by trying all IPs returned by DNS. After reviewing the multi-IP DNS-A patch it became clear that the RFC-compliant direction is DNS SRV.

It totally makes sense to wait for a libpq-approved design - I needed a prototype, and interested users can build Npgsql from this branch in the meantime. Happy to keep this PR open as a reference once the libpq thread lands, or close it if you prefer.

Otherwise, some general thoughts:

* As @vonzshik wrote above, we're generally reluctant to add new dependencies, especially for functionality which would only be needed by a fraction of our users. I also wouldn't love the idea of maintaining DNS wire-parsing code in a database driver such as Npgsql. This point would be easier if .NET contained built-in client support for DNS SRV, but the lack thereof may tell us something about the overall importance/popularity of SRV records...

On Windows, .NET does have built-in SRV support via DnsQuery. The gap is on non-Windows, which is why the wire-parsing code exists. That said, I agree it's not ideal to maintain in a database driver - this was the original motivation for using DnsClient.

* Overall, the same goals addressed by SRV are frequently already addressed by having a single IP with a load balancer/proxy in front of the actual services (I'm thinking of the cloud service offerings).

Agreed - and we do offer that. But it adds a latency hop and a cost that some users would rather avoid, especially for internal deployments.

* Single-IP based approaches have the advantage of being able to dynamically move IPs in and out of the list without a DNS deployment change. For example, DNS relies heavily on caching, so topology changes would first have to propagate.

Drivers using target_session_attrs already handle dead host removal gracefully. The harder problem is adding a new host to the cluster - and for that, eventual propagation via DNS is perfectly acceptable.

* Taking a wider look, I'm not sure it makes sense/is scalable to go over all the possible services out there - why not add SRV support to MySQL, SQL Server, and all other database clients? For that matter, why restrict to database clients - should **all** services out there support DNS SRV?

They already have it. AFAICT MySQL, SQL Server, MongoDB, and Valkey all support DNS SRV in their official clients.

Overall, I'd prefer for the feature to be actually requested by real-world users in order to address their actual problems, rather than proposed as additions into all drivers like this; in the history of Npgsql I don't recall a single user asking for it.

I'm submitting this PR on behalf of our users who asked for this.

@roji

roji commented Apr 22, 2026

Copy link
Copy Markdown
Member

They already have it. AFAICT MySQL, SQL Server, MongoDB, and Valkey all support DNS SRV in their official clients.

Thank you, I did not know that. And thanks for the rest of the context and the conversation as well, that's all useful.

Let's see how the PostgreSQL folks react to this. If they decide that this is worth doing in libpq, that's definitely good motivation for us to at least consider it too; at that point we can work out the details of how to do SRV lookup etc.

@x4m

x4m commented Apr 23, 2026

Copy link
Copy Markdown
Author

Thanks!

For the sake of correctness: MS SQL support for DNS SRV is not documented, and for Valkey only Java driver supports DNS SRV.

FWIW there's alternative approach, when a single hostname resolves to multiple A record IPs, Npgsql iterates through them at the TCP level but not at the target_session_attrs level - NpgsqlMultiHostDataSource is only activated when Host contains a comma.

pgx handles this with try_all_addrs: each IP from DNS is treated as an independent candidate, so failover and read/write routing work without listing hosts explicitly.

SRV is essentially the same idea taken further - instead of multiple IPs behind one name, you get multiple hostnames with ports and priorities. Both are about representing a whole cluster as a single DNS entry.

Would you be willing to have the feature for A-record instead\along with SRV record?

@roji

roji commented Apr 23, 2026

Copy link
Copy Markdown
Member

FWIW there's alternative approach, when a single hostname resolves to multiple A record IPs, Npgsql iterates through them at the TCP level

Unless I'm mistaken, Npgsql already does that; note the for loop just below over the different resolved addresses. This is old logic and was definitely not meant for round-robbin or anything like that - but it does go through the resolved addresses in order, trying later ones if connections to earlier ones fail.

@x4m

x4m commented Apr 23, 2026

Copy link
Copy Markdown
Author

You're right that the loop is there. But it only continues to the next IP on a socket exception; on TCP success it returns immediately, and target_session_attrs is checked later. So my understanding is that role discovery doesn't benefit from the multi-IP iteration. But I may be missing something — happy to be corrected.

If pg.example.com resolves to three IPs - one primary and two standbys - and the primary happens to be first, everything works. But if a standby responds first, Npgsql connects to it, exits the loop, and then fails the target_session_attrs check without ever trying the other IPs. The iteration is purely for TCP reachability, not for role discovery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants