-
Notifications
You must be signed in to change notification settings - Fork 877
Fix deadlock while cancelling NpgsqlConnection.Wait via CancellationToken #5976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vonzshik
wants to merge
3
commits into
main
Choose a base branch
from
5975-wait-cancellation-token-cancel-deadlock-fix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next
Next commit
Fix deadlock while cancelling NpgsqlConnection.Wait via CancellationT…
…oken
- Loading branch information
commit c59210fb6d5d45c4b1ba907f3a6cfda153765776
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is guaranteed to help.
cancellationToken.ThrowIfCancellationRequested()only provides a read acquire barrier, and I don't see any other barriers as a part of settingReadBuffer.Timeout, so the runtime is free to move the call toThrowIfCancellationRequested()before the write to the timeout.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't according to memory model acquire barrier guarantees that any reads and writes can only happen after that read?
And from my understanding that's relevant if and only if
ThrowIfCancellationRequestedis inlined, as otherwise the runtime can't guarantee that it doesn't have side effects regardingReadBuffer, so it can't move it afterReadBuffer.Timeout =.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You won't see an x64 cpu do so but on arm I suppose it could happen...
The JIT likely won't move the code as the write to the timeout and the read of the cts reference (until the
if canceledbranch) should be part of the same basic block, but that's mostly an implementation detail.When I set these multi threaded timeout values in Slon I have an
Interlocked.MemoryBarrier()after it, mostly to make sure it propagates to timer threads in a timely fashion. It would be enough here too (assuming this solves the deadlock, I haven't looked too closely).I don't see the volatile read @WizardBrony referred to in ThrowIfCancellationRequested. Writing an object reference forces a barrier and is guaranteed to be atomic but reading an object reference has no barriers in .NET (unlike many low latency GCs that require them).
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think I follow. My point was that this change depends on the write to
ReadBuffer.Timeouthappening before the token is checked, and according to the C# memory model, it is valid for the runtime to reorder the write after the read. Please let me know if I am missing something.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/dotnet/runtime/blob/040cbe276907174316e2cc07b35814b3069874a6/src/libraries/System.Private.CoreLib/src/System/Threading/CancellationTokenSource.cs#L73
And
_stateis marked asvolatile.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first one (keepalive) is performed whenever we do not have an active query, so we're not racing there (lock over
SyncObjshould guarantee that).The second one is a bit more tricky. We call that method just before we start executing a query/start an action, so it's supposed to be the only one being able to change things around. But there is still a miniscule possibility of concurrent write due to
StartCancellableOperation, which registers cancellation token callback. Overall a chance of something like that ever happening is extremely miniscule since cancellation callback has to do a lot of work before even being able to try and cancel a query, but just in case I did move it after changing the timeout.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with the change in this PR, I still see a possible deadlock:
Say the call to
NpgsqlConnection.Waitmakes it all the way to this line inNpgsqlReadBuffer.EnsureLongwithout cancellation:https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Internal/NpgsqlReadBuffer.cs#L295-L297
That line will see
buffer.Timeout == TimeSpan.Zero.Just before the call to
buffer.Cts.Reset(), let's say cancellation occurs:https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Internal/NpgsqlConnector.cs#L1824-L1825
Now
buffer.Cts.Reset()runs. Because the cancellation callback calledReadBuffer.Cts.Cancel(),Reset()will see_cts.IsCancellationRequested == true:https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Util/ResettableCancellationTokenSource.cs#L119-L123
That will create a new, uncanceled
CancellationTokenSource, and the cancellation callback will already be completed: Deadlock.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code there is indeed problematic, gonna have to think for a bit whether we can work around without adding a
lockthere (NpgsqlReadBuffer.Ensureis the most hot method we have which actually does IO).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vonzshik, just a side-note. Right before you replied to me I had deleted my comment asking about the writes to
ReadBuffer.Timeoutbecause I had come to the same conclusions you did, but also for the write inDoStartUserActionI realized that even if there was a race with the cancellation callback it wouldn’t really cause an issue forWait, so I figured it was off-topic for this PR. But I appreciate you being thorough and fixing it for the non-Waitcode paths anyway. Thanks for all your work on this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vonzshik What I would recommend doing is simply taking the
CancelLockduring each concurrent access toNpgsqlReadBuffer'sCtsand/orTimeout. For example, changing:https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Internal/NpgsqlReadBuffer.cs#L295-L297
to:
Then we can remove
ResettableCancellationTokenSource.lockObjectsince there will no longer be concurrent access toNpgsqlReadBuffer.Cts. That way we're not adding alocktoNpgsqlReadBuffer.Ensure, and accesses toCtsandTimeoutare still thread-safe.Additionally, async logic like this:
https://github.com/npgsql/npgsql/blob/5975-wait-cancellation-token-cancel-deadlock-fix/src/Npgsql/Internal/NpgsqlReadBuffer.cs#L360-L364
needs to be:
Otherwise if cancellation races with timeout and cancellation wins, the
_cancelImmediatelyTimeoutwill get stomped on.I also have a question about that logic. The documentation states that a
Cancellation Timeoutof "0 means infinite wait." But ifbuffer.Timeoutis only updated whenCancellationTimeout > 0, how does it ever infinitely wait ifbuffer.Timeout > TimeSpan.Zero?