VFIO
VFIO 是依赖于底层硬件的,比如 IOMMU。
VFIO 看起来是 Linux 当中的概念。
VFIO decomposes the physical device as a set of userspace API and recomposes the physical device’s resource to a virtual device in qemu.
The VFIO driver is an IOMMU/device agnostic framework for exposing direct device access to userspace, in a secure, IOMMU protected environment. In other words, this allows safe 2, non-privileged, userspace drivers.
为什么要这么设计呢?Virtual machines often make use of direct device access (“device assignment” / 或者叫做 passthrough?) when configured for the highest possible I/O performance. From a device and host perspective, this simply turns the VM into a userspace driver(为什么这么说呢?non-root mode 不代表就在 userspace 呀。这是因为使用的 API 还是在 QEMU 当中被调用的), with the benefits of significantly reduced latency, higher bandwidth, and direct use of bare-metal device drivers。
所以说,因为 VFIO 而受益的不只是 VM,还有其他比如高性能计算领域。
How to use
要使用 VFIO, 必须在 Linux 启动时添加启动项 intel_iommu=on
,因为 VFIO 的底层依赖 IOMMU.
加载 VFIO-PCI module。
sudo modprobe vfio-pci
查看一个设备所在 group:
readlink /sys/bus/pci/devices/0000:06:00.0/iommu_group
查看一个设备同一 group 下的其它设备:
# 查看 bdf 为 0000:06:00.0 所在 group 下面的所有 device。
ls /sys/bus/pci/devices/0000:06:00.0/iommu_group/devices/
为了将设备透传到虚拟机中,需要将设备与其对应的驱动解绑,这样该设备就可以使用 VFIO 的驱动了. 注意,不仅要将要透传的设备解绑,还要将与设备同 iommu_group 的设备都解绑,才能透传成功。
$ echo 0000:06:00.0 | sudo tee /sys/bus/pci/devices/0000:06:00.0/driver/unbind
0000:06:00.0
$ echo 0000:00:05.0 | sudo tee /sys/bus/pci/devices/0000:00:05.0/driver/unbind
0000:00:05.0
$ echo 0000:00:05.1 sudo tee /sys/bus/pci/devices/0000:00:05.1/driver/unbind
0000:00:05.1
查看设备的 Vendor 和 DeviceID:
$ lspci -n -s 06:00.0
06:00.0 0200: 10ec:8168 (rev 15)
将设备绑定到 vfio-pci module:
echo 10ec 8168 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
可以通过 ls /dev/vfio
查看是否绑定成功,如果绑定成功,/dev/vfio 目录下会出现该 device 所属的 iommu_group 号
给这个加上权限:
chown user:user /dev/vfio/26
Userspace 如何调用 API 使用该设备:
int container, group, device, i;
struct vfio_group_status group_status =
{ .argsz = sizeof(group_status) };
struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
/* Create a new container */
container = open("/dev/vfio/vfio", O_RDWR);
if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
/* Unknown API version */
if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
/* Doesn't support the IOMMU driver we want. */
/* Open the group */
group = open("/dev/vfio/26", O_RDWR);
/* Test the group is viable and available */
ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
/* Group is not viable (ie, not all devices bound for vfio) */
/* Add the group to the container */
ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
/* Enable the IOMMU model we want */
ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
/* Get addition IOMMU info */
ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
/* Allocate some space and setup a DMA mapping */
dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
dma_map.size = 1024 * 1024;
dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
/* Get a file descriptor for the device */
device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
/* Test and setup the device */
ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
for (i = 0; i < device_info.num_regions; i++) {
struct vfio_region_info reg = { .argsz = sizeof(reg) };
reg.index = i;
ioctl(device, VFIO_DEVICE_GET_REGION_INFO, ®);
/* Setup mappings... read/write offsets, mmaps
* For PCI devices, config space is a region */
}
for (i = 0; i < device_info.num_irqs; i++) {
struct vfio_irq_info irq = { .argsz = sizeof(irq) };
irq.index = i;
ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq);
/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
}
/* Gratuitous device reset and go... */
ioctl(device, VFIO_DEVICE_RESET);
VFIO Device
In the context of virtualization, a VFIO device refers to a hardware device that has been assigned to a virtual machine (VM) using the VFIO framework. This framework allows for direct and isolated access to the device from the VM, bypassing the traditional virtualization layer.
When using vfio, the qemu command line should add following option:
-device vfio-pci,host=00:12.0,id=net0
Group
Group:group 是 IOMMU 能够进行 DMA 隔离的最小硬件单元,一个 group 内可能只有一个 device,也可能有多个 device,这取决于物理平台上硬件的 IOMMU 拓扑结构。设备直通的时候一个 group 里面的设备必须都直通给同一个 VM。不能够让一个 group 里的多个 device 分别从属于 2 个不同的 VM,也不允许部分 device 在 host 上而另一部分被分配到 guest 里,因为就这样一个 guest 中的 device 可以利用 DMA 攻击获取另外一个 guest 里的数据,就无法做到物理上的 DMA 隔离。Group is the minimum granularity that can be assigned to a VM
通过把 host 的 device 和对应 driver 解绑,然后绑定在 VFIO 的 driver 上,就会在 /dev/vfio/
目录下出现一个 group, 这个 group 就是 IOMMU_GROUP 号,如果需要在该 group 上使用 VFIO, 需要将该 group 下的所有 device 与其对应的驱动解绑。
为什么要设计 group 的概念呢?For instance, an individual device may be part of a larger multifunction enclosure. While the IOMMU may be able to distinguish between devices within the enclosure, the enclosure may not require transactions between devices to reach the IOMMU. Examples of this could be anything from a multi-function PCI device with backdoors between functions to a non-PCI-ACS (Access Control Services) capable bridge allowing redirection without reaching the IOMMU.
Container
Containter 是 VFIO 软件设计的概念不是 IOMMU 这个硬件上的概念。
Container:对于虚机,Container 这里可以简单理解为一个 VM Domain 的物理内存空间。对于用户态驱动,Container 可以是多个 Group 的集合。Containers is a set of groups.
当我们想在不同的 IOMMU_GROUP 之间共享 TLB 和 page tables (用于地址翻译的页表) 时,就将这些 group 放到同一个 container 中,因此 Container 可以看做是 IOMMU_GROUP 的集合。
On its own, the container provides little functionality, with all but a couple version and extension query interfaces locked away. The user needs to add a group into the container for the next level of functionality.
Once the group is ready, it may be added to the container by opening the VFIO group character device (/dev/vfio/$GROUP
) and using the VFIO_GROUP_SET_CONTAINER
ioctl, passing the file descriptor of the previously opened container file.