Vhost-guest (was virtio-vhost-user) device#7695
Vhost-guest (was virtio-vhost-user) device#7695DemiMarie wants to merge 16 commits intocloud-hypervisor:mainfrom
Conversation
db7e861 to
187e228
Compare
|
I’m now running into a serious bottleneck. The version of the virtio vhost-user device backend spec I am implementing requires support for additional interrupts beyond what virtio itself needs. Specifically, a vhost-user device backend (VDB) device will only have one queue. However, implementing a device with N queues needs N MSI-X interrupts, N externally-provided irqfds, and 2N + 1 externally-provided ioeventfds. This is unique to the VDB device, and is a result of VDB being optimized for performance: it aims to avoid the userspace VMMs being involved in interrupt handling at runtime. I don’t see a way to avoid this without a substantial performance loss. For the desktop workloads @alyssais and I care about, this might be acceptable, but I doubt cloud workloads would be okay with it. How should these be wired up to the rest of Cloud Hypervisor? Should I resort to slow userspace emulation? I could resort to handling interrupts in-band via control messages on the admin queue, which would work but not be performant. That’s fine for a proof of concept, but fixing it is a backwards-incompatible change (from the backend VM’s perspective) and so I want to fix it before writing and submitting the kernel driver. Once that is done, the rest of the protocol should hopefully be straightforward to implement. Most messages require no involvement from Cloud Hypervisor beyond checking that they are of known numbers and the framing is correct. Protocol feature negotiation does need Cloud Hypervisor involvement, but only to ensure that no messages with unknown types are generated by the frontend or backend. |
187e228 to
291fa79
Compare
cf2d063 to
63d6c83
Compare
63d6c83 to
bc2de9d
Compare
|
What's the status here? The PR description is empty. |
995d721 to
7195efe
Compare
This device is still incomplete. It has not been tested at all, which is not great. It's not even in a state where it can be tested. For instance, reading configuration space hits a What is largely implemented is processing incoming and outgoing vhost-user messages. The existing vhost-user crate makes too many assumptions that don't work for this code, so it needs to reimplement parsing itself. The result is that this device is quite complex. It also uses core code that is currently not used and which I implemented specifically for this device. Some of it (#7841) is not even merged yet. |
7195efe to
e3a5270
Compare
|
I think if there is no intention to land some code in this PR. It would be better to just reference a branch from an issue. Having too many open PRs creates noise and distractions. |
I do intend to land this code. It’s not ready yet, though, which is why it is marked as draft. At this point, I expect that most work will be fixing bugs, adding the device to the rest of the system, and testing it. In particular, I’d like feedback on the code structure, as I don’t intend to change that unless required. |
22e3be5 to
7e4c22a
Compare
668df93 to
a864563
Compare
It might help if the PR had a proper title and description! |
virtio-vhost-user requires for PCI doorbells used to notify the frontend of events. Add support for them. Each doorbell requires an MSI-X interrupt, so account for this. The specification for this capability is <https://stefanha.github.io/virtio/vhost-user-slave.html>. No functional change intended as the code is currently not used. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Virtio-vhost-user devices need one interrupt per queue, plus one interrupt per config change, plus one interrupt per queue of the *device they are implementing*. Allow devices to override the number of interrupts they need instead of assuming it is always one more than the number of queues. No functional change for existing devices, which use a default implementation that returns the number of queues. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
It will always be 0. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
No functional change intended. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
787c6c2 to
95ee5be
Compare
The code will soon need to iterate over ioeventfds provided by the inner virtio device. Accessing that device requires a lock, which works fine with internal iteration but breaks with external iteration. While an Arc could be used to avoid the need for the callback, implementations of this method will need to grab a device-internal lock and invoke the callback with the lock held. Returning a MutexGuard would constrain the device implementation unnecessarily. This also changes VirtioTransport::ioeventfds() to take *two* addresses, not one. The callback also takes two addresses. This is because relocating a BAR must happen atomically with respect to registering and unregistering ioeventfds by the device. If the device manager used one call to unregister ioeventfds and a second call to reregister them, a device implementation could register ioeventfds between the two calls. Such ioeventfds would not be unregistered, potentially causing errors during reregistration. Using a single call for both unregistration and reregistration allows the device implementation to hold a lock across both operations. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
The device manager calls VirtioTransport::ioeventfds whenever it needs to unregister or register ioeventfds. This function invokes a callback with each virtqueue ioeventfd, its old base address, and its new base address. Also invoke the callback with ioeventfds provided by the device itself. No functional change as no current device provides any. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Devices that register their own ioeventfds need to know the address they should register them at. It is not possible for this to be a callback on PciDevice because PciDevice contains the VirtioDevice. Calls to the callback after activation are ignored. Such calls are always preceeded by a call to VirtioPciDevice::ioeventfds(), which already provides the new address. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Map memory from externally-provided FDs into the guest address space. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Virtio-vhost-user uses four virtqueues. These are grouped into two pairs of two queues each. In each pair, one pair is used for requests and the other is used for replies. Each pair is also associated with an AF_UNIX stream socket used for communication with the frontend. Implement a queue pair abstraction that handles message framing and communication with the VM. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
These are significantly simpler than frontend->backend requests. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Implement frontend-to-backend requests and the corresponding backend-to-frontend replies. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
This is a partial implementation of a virtio vhost-user device backend. There is no plan to implement migration for this device. In most use-cases, the VM using this device is tied to physical hardware and makes no sense to migrate. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
This removes dependencies on the rest of Cloud Hypervisor in preparation for making it a standalone crate. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
This allows it to be used outside of Cloud Hypervisor. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
This code is now used. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
The virtio-vhost-user crate already validates eventfds, so the device model in Cloud Hypervisor doesn't need to unsafely assume that using a non-eventfd as an eventfd is safe. Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
95ee5be to
be092a6
Compare
Implement a vhost-guest device, formerly known as virtio-vhost-user. A vhost-guest device is a vhost-user server that forwards all messages to the guest.
Vhost-guest requires additional features not used by other virtio devices. Therefore, this PR is based on top of #7841. It is also very complex, as shown by its rather large size. To mitigate the complexity, most of the code is in a separate crate that does not depend on Cloud Hypervisor internals. This will allows it to be used by other VMMs (like crosvm), and (more importantly) will allow it to be unit tested without having to create a VM. I expect the number of unit tests needed to be significant.
A full-featured vhost-guest device also needs a fix to #7839. Securely hot-unplugging a vhost-guest device requires a fix to #7955. While the current code is not yet testable, I hope to have a testable version fairly soon after fixing #7955. It will require changes to Cloud Hypervisor core code to implement notifications.
Right now, the only vhost-guest driver I am aware of is an old version of crosvm. Crosvm uses VFIO to drive the device, which requires
allow_unsafe_interrupts. I plan to upstream a Linux kernel driver that will provide a vhost-msg compatible API. There is corresponding work in QEMU to implement support for using such a driver, which will allow any QEMU-supported device to be used.This implementation is optimized for performance at the expense of simplicity. Frontend-provided vring call and error eventfds are registered with the ioeventfd mechanism, and frontend-provided vring kick eventfds are registered as irqfds. The goal is to avoid Cloud Hypervisor being involved in any hot paths. For instance, this might allow a future version of DPDK and/or VPP to provide high-performance network services to other VMs.
This implementation is designed to be secure against both malicious guests and malicious vhost-user clients. I assumed that vhost-user clients are likely in separate sandboxes from Cloud Hypervisor, and therefore cannot compromise it directly without a separate vulnerability. I consider it to be a security vulnerability in this code if a malicious guest or malicious vhost-user client can compromise it, even if the two must coordinate with each other to perform the attack.
This implementation uses a substantial amount of unsafe code. Some of it is taken from the vhost-user crate, which is not designed for non-AF_UNIX transports. I have encapsulated this code to the extent possible, but very careful review would be greatly appreciated.
The protocol library (under
virtio-vhost-user/) is ready to review. I expect the main future change to be the addition of lots of unit tests. The device implementation (invirtio-devices/src/virtio_vhost_user.rs) is incomplete and not hooked up to the rest of Cloud Hypervisor. While review would be appreciated, it is less important at this time.