为什么要有 ioeventfd?

哦 Use Case 1: KVM/QEMU VirtIO Font-end/Back-end Communication

这是 KVM 提供的让其可以和 QEMU 高效通信的接口。

我们以一个例子来说明如何使用 ioeventfd。

如果 guest IO(PIO/MMIO) 一个地址,触发 vmexit,然后 kvm exit 到 QEMU 中,QEMU 需要处理一定时间,然后再 ioctl VCPU_RUN 进来,这段时间 vCPU 没有办法做其他事情。

On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution.

一个问题:vCPU 为什么要做其他事情?如果 MMIO 都还没有结束,那么 vCPU 不是理所应当 hang 住等待吗?

这是因为这次 IO 写的根本没意义,这个 IO 只是一个通知,用于触发另一个 IO(比如说 kick off 一个 out-of-band 的 DMA 请求),这种情况下没有必要像普通 IO 一样等待数据完全写完,只需要触发通知并等待具体 IO 完成即可。举个例子,下面的代码用来触发一个 DMA,请看 inline 注释:

int dad_transfer(struct dad_dev *dev, bool write, void *buffer,  size_t count)
{
    dma_addr_t bus_addr;
    unsigned long flags;

    // Map the buffer for DMA
    bus_addr = pci_map_single(dev->pci_dev, buffer, count, dev->dma_dir);
    //...

    // writeb 其实就是触发了一次 mmio,就是通过 mov 指令来实现的
    // Set up the device
    // 设置 device 的 command 以及 device 上一些 register 的值比如 addr 和 len
    writeb(dev->registers.command, DAD_CMD_DISABLEDMA);
    writeb(dev->registers.command, write ? DAD_CMD_WR : DAD_CMD_RD);
    writel(dev->registers.addr, cpu_to_le32(bus_addr));
    writel(dev->registers.len, cpu_to_le32(count));

    // 通过写 device 的 register 来告诉 device 开始 DMA
    // 这一步其实就是触发一个 DMA,如果在 guest 中,我们可以对着一部分进行加速,
    // 不 exit 到 userspace,尽早返回以进行后面代码的执行
    // 当然,因为 guest driver 是更改过为了适配 virtio 的,所以其实 DMA 里设置
    // addr, len 的操作其实就是设置 virtqueue,触发的操作其实就对应 ioeventfd
    // 其实都一样的。
    writeb(dev->registers.command, DAD_CMD_ENABLEDMA);
    return 0;
}

ioeventfd 由 QEMU 的非 vcpu thread (iothread) 负责处理,这样其实 QEMU vCPU 线程 exit,然后通过 fd 通知到到的 userspace 是 iothread,QEMU 的 vcpu 线程仅仅出现了从 non-root mode 到 root kernel mode (KVM) 的 exit,没有出现从 root kernel mode 到 root userspace mode 的 exit,所以 vCPU 线程性能受影响变小了,可以节省时间。通知到位剩下的让 iothread 来 handle 就行了。总结:ioeventfd 就是为了省去 vCPU 线程 KVM exit 到 QEMU 的这部分开销。

qemu-kvm的ioeventfd机制 - EwanHai - 博客园

ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd signal when written to by a guest. QEMU can register any arbitrary IO address with a corresponding eventfd and then pass the eventfd to a specific end-point of interest for handling.

Normal IO requires a blocking round-trip since the operation may cause side-effects in the emulated model or may return data to the caller. Therefore, an IO traps from the guest to the host, causes a VMX/SVM "heavy-weight" exit back to userspace, and is ultimately serviced by qemu's device model synchronously before returning control back to the vcpu.

However, there is a subclass of IO which acts purely as a trigger for other IO (such as to kick off an out-of-band DMA request, etc). For these patterns, the synchronous call is particularly expensive since we really only want to simply get our notification transmitted asychronously and return as quickly as possible. All the sychronous infrastructure to ensure proper data-dependencies are met in the normal IO case are just unecessary overhead for signalling. This adds additional computational load on the system, as well as latency to the signalling path.

The purpose of this mechanism is to make guest notify host in a lightweight way. This is lightweight because it will not cause a VM-exit back to userspace(重点!), serviced by QEMU then returning control back to the guest. this kind of heavy-weight IO sync mechanism is not necessary for the triggers, these triggers only want to transmit a notify asynchronously and return as quickly as possible. It is expensive for them to use the normal IO。比如 virtio QEMU 有 iothread,可以让它来 poll fd,因为 IOThread 本身运行在 root mode,但是 vcpu 线程不需要 kernel->user context switch,这样可以在每次 guest IO 的时候尽快返回,然后等 QEMU 把这个 trigger 想要触发的 IO 的事情做完了,再通知到 guest 里面。而不是让 vcpu 线程傻傻地等待着直到 trigger 返回。

QEMU 可以将虚拟机特定地址(GPA?)关联一个 eventfd,对该 eventfd 进行 POLL,并利用 ioctl(KVM_IOEVENTFD) 向 KVM 注册这段地址,当 Guest 进行 IO 操作 exit 到 kvm 后,kvm 可以判断本次 exit 是否发生在这段地址中,如果是,则直接调用 eventfd_signal 发送信号到对应的 eventfd,导致 QEMU 的监听循环返回,触发具体的操作函数,进行 IO 操作。

ioeventfd 经常和 irqfd^ 一起提起。

KVM ioeventfd support patch: http://git.kernel.org/linus/d34e6b175e61821026893ec5298cc8e7558df43a

qemu-kvm的ioeventfd机制 - EwanHai - 博客园

qemu中的eventfd——ioeventfd-CSDN博客

分成下面几步:

  • 先 get 到一个 fd。

为什么要有 irqfd?

QEMU 直接 ioctl KVM_INTERRUPT KVM 注入中断不可以吗,为什么要设计一个新的 fd 机制出来就为了 userspace 向 KVM 中断的注入?

irqfd 机制将一个 eventfd 与一个全局中断号联系起来,当向这个 eventfd 发送信号时,就会导致对应的中断注入到虚拟机中。

原来不好的地方:All must be injected to the guest via the KVM infrastructure. This patch adds a new mechanism to inject a specific interrupt to a guest using a decoupled eventfd mechanism: Any legal signal on the irqfd (using eventfd semantics from either userspace or kernel) will translate into an injected interrupt in the guest at the next available interrupt window.

所以设计这个可能也是为了让中断注入和 vCPU 线程解绑,可以让 iothread 来负责中断的注入。

kvm: add support for irqfd [LWN.net]

struct EventNotifier QEMU

event_notifier_init

代表了一个 event fd。

struct EventNotifier {
    //...
    int rfd;
    int wfd;
    bool initialized;
};

Create the ioeventfd

Register ioeventfd to KVM

[Qemu-devel] [PATCH v6 0/4] virtio: Use ioeventfd for virtqueue notify - Stefan Hajnoczi

我们以 MMIO 为例(PIO 也是类似的)。

virtio_pci_common_write
    virtio_pci_start_ioeventfd

Create an event fd:

kvm_set_ioeventfd_mmio() QEMU

address_space_update_ioeventfds
    address_space_add_del_ioeventfds
        kvm_mem_ioeventfd_add
            fd = event_notifier_get_fd(e);
            kvm_set_ioeventfd_mmio
static int kvm_set_ioeventfd_mmio(int fd, hwaddr addr, uint32_t val,
                                  bool assign, uint32_t size, bool datamatch)
{
    struct kvm_ioeventfd iofd = {
        .datamatch = datamatch ? adjust_ioeventfd_endianness(val, size) : 0,
        // GPA start address will be listened
        .addr = addr,
        // GPA size
        .len = size,
        .flags = 0,
        .fd = fd,
    };

    //...
    if (datamatch) {
        iofd.flags |= KVM_IOEVENTFD_FLAG_DATAMATCH;
    }
    if (!assign) {
        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
    }

    ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &iofd);
    //...
}

kvm_mem_ioeventfd_add() QEMU

static void kvm_mem_ioeventfd_add(MemoryListener *listener,
                                  MemoryRegionSection *section,
                                  bool match_data, uint64_t data,
                                  EventNotifier *e)
{
    int fd = event_notifier_get_fd(e);

    r = kvm_set_ioeventfd_mmio(fd, section->offset_within_address_space,
                               data, true, int128_get64(section->size),
                               match_data);
    //...
}

Ioeventfd notify to QEMU

我们以 virtio blk 是如何使用 ioeventfd 的来说明。

Features/VirtioIoeventfd - QEMU

notify 是以一个 virtqueue 为基础的。

pci_host_config_write_common
virtio_write_config
virtio_address_space_write
memory_region_dispatch_write
memory_region_write_accessor
mr->ops->write(mr->opaque, addr, tmp, size);

virtio_pci_config_write
virtio_ioport_write
    case VIRTIO_PCI_QUEUE_NOTIFY:
        if (val < VIRTIO_QUEUE_MAX) {
            virtio_queue_notify(vdev, val);
        }

Driver notify device is ready using ioeventfd

以 virtio blk 为例。

// driver in guest kernel
.probe()
    virtblk_probe
        virtio_device_ready
            // driver 告诉 device 我们已经好了
        	dev->config->set_status(dev, status | VIRTIO_CONFIG_S_DRIVER_OK);

// device in QEMU
VIRTIO_PCI_COMMON_STATUS / VIRTIO_PCI_STATUS // modern and legacy virtio spec
    if (val & VIRTIO_CONFIG_S_DRIVER_OK)
        virtio_device_start_ioeventfd / virtio_pci_start_ioeventfd / virtio_mmio_start_ioeventfd
            virtio_bus_start_ioeventfd
                vdc->start_ioeventfd(vdev);
                    virtio_blk_data_plane_start
                        // assign 传进来的值是 true
                        // 这里是为了创建 event fd
                        k->set_guest_notifiers(qbus->parent, nvqs, true);
                            virtio_pci_set_guest_notifiers
                                virtio_pci_set_guest_notifier
                                    event_notifier_init
                                        e->rfd = fds[0];
                                        e->wfd = fds[1];
                        virtio_queue_aio_attach_host_notifier
                            // 当 ioeventfd 出现新的 event 的时候会 call 到这个 handler 
                            aio_set_event_notifier(ctx, &vq->host_notifier, true, virtio_queue_host_notifier_read,
                                virtio_queue_host_notifier_aio_poll,
                                virtio_queue_host_notifier_aio_poll_ready);
                                virtio_queue_host_notifier_aio_poll_ready
                                    virtio_queue_notify_vq
                                        vq->handle_output()

Guest driver send request: