Intel VT-d is identical to IOMMU, which in AMD is also called AMD-Vi. IOMMU 是设备直通的重要组件，整个设备直通都是基于 IOMMU 构建而来的。

DMA-remapping 解决从 device 角度来看内存虚拟化的问题，interrupt-remapping 解决的是从 device 角度来看 CPU 虚拟化的问题。

现代的 x86 系列 CPU 通常包含一个 IOMMU。因为 IOMMU 通常是集成在 Root Complex 下面的，所以可以说一个 CPU 有一个 IOMMU。In modern systems, IOMMU is commonly integrated with the PCIe root complex.

Intel IOMMU 的 driver 在 drivers/iommu/intel/ 目录下面，同时这也是 Intel IOMMU 子系统，由 Lu Baolu 作为 maintainer。

IOMMU 将设备虚拟地址翻译为物理内存地址；
MMU 将 CPU 虚拟地址翻译为物理内存地址。

在没有 IOMMU 的时候，设备直接通过物理地址来 DMA，有了 IOMMU 后，引入了设备虚拟地址的概念。

这篇文章不错，但是只有联想员工可以下载：Lenovo Press

(56 封私信 / 80 条消息) 《Linux Kernel IOMMU》翻译 - 知乎

不开启 SRIOV 的物理设备可以直通给 Guest 吗？

VFIO 设备直通绝对不需要依赖 SRIOV。它们是两个独立但可以协同工作的技术。VFIO 的核心功能就是直通整个物理设备（PF） 给单个虚拟机使用。这是 VFIO 最基础、最常用的应用场景。

用户态通过 ioctl 来和 kernel 中的 VFIO 子系统沟通。

当然是可以的。不开启 SRIOV 的设备直通给 Guest 后，host 还能对这个设备进行管理吗？

这种情况下，Host 和 Guest lspci 都能看到这台设备吗？

设备直通对设备有要求吗？Legacy 设备可以直通吗？ IOMMU 对设备是无侵入的吗？

没有要求，都可以直通。

直通设备如何解决控制面问题

直通设备如何以一个 PCIe 的形式展现在 Guest 面前的？

`iommu=pt`

pt = pass-thru.

使用 KVM pass-thru 设备时，通常会设置 intel_iommu=on,iommu=pt 内核参数，其中 intel_iommu=on 就是使能 intel iommu。

iommu=pt only enables IOMMU translation for devices used in pass-thru, doesn’t enable IOMMU translation for host used devices, and this will improve performance for host PCIe devices (which are not pass-thru to a VM).

也就是说，只给直通给 guest 的设备加 DMA Remapping。对于 host 自己使用的设备，不需要做 DMA Remapping。也就是说，host 自己使用的设备可以 DMA 到任何地方，包括 VM 自己的内存空间。

这个设置也是对于 host 上的设备以及 VMM 善意假设的，毕竟本来 host 上的 VMM 就能访问任何 guest 的内存空间，那么 host device 为什么不行呢？

配置了 iommu=pt 就会实现 identity mapping:

如果 Hardware supports pass-through translation type（这是一个 IOMMU 硬件能力），则配置 pass-through translation type 即可实现 identity mapping，此时无需配置 IOMMU 页表;
如果 Hardware doesn’t support pass-through translation type（此硬件能力不支持），则需要配置 IOMMU 页表，使得 IOVA 与 HPA 1:1 映射。当 hw_pass_through=0 时，依然要走 IOMMU 页表，因此性能是不如 hw_pass_through=1 的。

实际上，这个参数有很多支持的值：The kernel’s command-line parameters — The Linux Kernel documentation

Notes about iommu=pt kernel parameter - L

iommu_setup
    if (!strncmp(p, "pt", 2))
        iommu_set_default_passthrough(true);
            iommu_cmd_line |= IOMMU_CMD_LINE_DMA_API;
            iommu_def_domain_type = IOMMU_DOMAIN_IDENTITY;

`DMAR`

DMA remapping.

VT-d / IOMMU Concepts

一些需要提前说明的由 VT-d 所定义的概念：

IOMMU Domain

Abstract isolated environments in the platform to which a subset of host physical memory is allocated. 每一个设备都可以被分配到一个 domain 里面，这是通过让每一个 domain 都有一个特有的 paging structures 实现的。When the device attempts to access system memory, the DMA-remapping hardware intercepts the access and utilizes the page tables.

就像 TLB，这些 paging structure 也是可以被缓存的，不只是 IOTLB，see 6.2 Address Translation Caches。

The DMA remapping architecture facilitates flexible assignment of I/O devices to an arbitrary number of domains. 所以 Domain 和 Device 应该是多对多的关系。

DMA Remapping

VMM 为了能让 guest direct access 一个 device，需要支持 DMA request 的隔离（不然一个 device 可以通过 DMA 访问到任何地方）。

直通设备在工作的时候同样要使用 DMA 技术来访问虚拟机的主存以提升 IO 性能。那么问题来了，直接分配给某个特定的虚拟机的，我们必须要保证直通设备 DMA 的安全性，一个 VM 的直通设备不能通过 DMA 访问到其他 VM 的内存，同时也不能直接访问 Host 的内存，否则会造成极其严重的后果。因此，必须对直通设备进行“DMA 隔离”和“DMA 地址翻译”：

隔离将直通设备的 DMA 访问限制在其所在 VM 的物理地址空间内保证不发生访问越界，
地址翻译则保证了直通设备的 DMA 能够被正确重定向到虚拟机的物理地址空间所对应的真实物理地址空间内。

为什么直通设备会存在 DMA 访问的安全性问题呢？原因也很简单：由于直通设备进行 DMA 操作的时候 guest driver 直接使用 GPA 来访问内存的，这就导致如果不加以隔离和地址翻译必然会访问到其他 VM 的物理内存或者破坏 Host 内存（因为会被理解为 HPA，而与 GPA 等值的那块 HPA 可能在任何地方），因此必须有一套机制能够将 GPA 转换为对应的 HPA 这样直通设备的 DMA 操作才能够顺利完成。

牢记这个字符图里的 DMA 模型对理解 DMA Remapping 非常有帮助。

DMA Remapping type

Remapping hardware treats inbound memory requests from root-complex integrated devices and PCI Express* attached discrete devices into two categories（普通 DMA，带额外信息的 DMA）:

Requests without address-space-identifier: These are the normal memory requests from endpoint devices. These requests typically specify the type of access (read/write/atomics), targeted DMA address/size, and source-id of the device originating the request (e.g., BDF). (Requests-without-PASID)
Requests with address-space-identifier: These are memory requests with additional information identifying the targeted address space from endpoint devices. Beyond attributes in normal requests, these requests specify the targeted Process Address Space Identifier (PASID), and Privileged-mode-Requested (PR) flag (to distinguish user versus supervisor access). For details, refer to the PASID Extended Capability Structure in the PCI Express specification. (Requests-with-PASID)

DMA Remapping use case

Depending on the software usage model, the DMA address space of a device may be

the Guest-Physical Address (GPA) space of a virtual machine to which it is assigned
Virtual Address (VA) space of host application on whose behalf it is performing DMA requests
Guest Virtual Address (GVA) space of a client application executing within a virtual machine
I/O virtual address (IOVA) space managed by host software
Guest I/O virtual address (GIOVA) space managed by guest software

In all cases, DMA remapping transforms the address in a DMA request issued by an I/O device to its corresponding Host-Physical Address (HPA).

IOMMU Page Fault

IOMMU 会发生缺页异常吗？

好像一般不会吧，除非配置了 Nested Translation 这个 feature？

IOMMUFD: Deliver IO page faults to user space [LWN.net]

在非虚拟化但是有 IOMMU 的平台上，IOMMU 在设备 DMA 时会被用到吗？

三种可能：

完全没有被用到；
用到了，IOMMU 有硬件 identity mapping 能力，因此使用了这种方式；
用到了，1:1 映射配置 IOMMU 页表的方式。

应该是第二种（cmdline 使用了 iommu=pt）。

IOMMU 页表

位于内存中。需要编程 IOMMU 相关指定寄存器以指向此页表。

为什么设备访问 CPU 内存需要 IOMMU Page Table 来 IOVA -> PA 映射，而 CPU 访问设备存储不需要 PA -> IOVA 映射？

因为 CPU（也就是驱动）有 dma_handle（See DMA note），直接有翻译之后的结果，本质在于如何访问这个翻译之后的结果：

CPU 可以直接访问 dma_handle：需要引入一个新的页表，为了避免引入…
CPU 把 dma_handle 通过 MMIO 写入设备（控制路径，MMIO 映射 PA 到 Bus/DMA Address 一般都是线性映射，已经写入到了 host bridge 当中，不需要复杂的翻译逻辑比如页表），让设备来走 IO Page Table 写入内存（数据路径），相当于曲线救国了，这样就可以避免多设计一套页表了。

IOMMU 页表地址转换

普通环境下，如果使用 IOMMU 来做 DMA，那么 IOMMU 页表存储的是 IOVA -> HPA 的转换。
虚拟化环境下，IOMMU 页表存储的是 IOVA -> GPA 的转换。

GVA -> GPA。

IOMMU 页表在 host 环境下的更新

这种情况指的是， Host 上的设备驱动要指使设备使用 IOMMU 进行 DMA 的情况。

何时会更新 IOMMU 页表？有两种情况可用：

在 ACPI 表中定义了直接映射（或标识映射）。在探测/初始化 IOMMU 硬件时，IOMMU 硬件关联层解析存储在 ACPI 表中的直接映射信息，并基于 ACPI 表配置 I/O 页表。
驱动发起 DMA 请求：Driver、DMA 子系统和 IOMMU 子系统中转发并处理它们。
1. Driver 调用 dma_map_page() 或 dma_map_single() 等 API 来获取 PA。请注意，如果是 Guest，则该物理地址表示 GPA；
2. 如果 DMA 请求被直接映射，则 DMA 子系统会将计算出的 PA 直接返回给 driver，这种情况，PA 就是 DMA Address，不需要页表转换，效率最高；
3. 如果该 DMA 请求没有被直接映射：DMA 子系统调用 iommu_dma_map_page() 来请求 IOMMU 子系统为 PA 生成一个对应的 IOVA。
4. IOMMU 子系统调用到 vendor specific 子系统比如 Intel IOMMU 子系统，该子系统将 IOVA 映射到 PA，并写入相应的 IOMMU 页表，使 IOMMU 硬件能够正确地进行地址转换。

i40e_tx_map
    // 传进来的是 VA，要返回 DMA Address^
    dma_map_single(tx_ring->dev, skb->data, size, DMA_TO_DEVICE)
        // virt_to_page() 将 VA 转成 PA
        dma_map_page_attrs(dev, virt_to_page(skb->data)……)
            // 直接地址映射，当 iommu=pt 时调用该函数，amd_iommu=off 时应该也是，待验证
            // 如果开启了 swiotlb，那么会使用 swiotlb。
            // 如果没有开启，那么是 phys_to_dma()
            dma_direct_map_page(dev, page……) 
                // 有可能直接返回 PA，就是说 PA 和 DMA Address handle 是一个。
                phys_to_dma()
            // 将 IOVA (DMA address) 映射到 PA，创建 iommu 的 io page table entry
            // 这个分支首先拿到 PA，然后给 device 分配一个 IOVA 作为 DMA Address 代替。
            // 然后在 iommu 页表建立 IOVA -> PA 映射（这一步是 device specific 的，也就是 intel_iommu_map_pages() 的）。
            else if (use_dma_iommu(dev))
                iommu_dma_map_page(dev, page….)
                    // 将 page 转为物理地址 PA，并基于 PA 去映射 IOVA (DMA address) 到 PA
                    iova = __iommu_dma_map(dev, phys…)
                        // 给 PA 生成一个对应的 IOVA 作为 DMA handle
                        iova = iommu_dma_alloc_iova(domain, size,..)
                            // 优先分配 32 位的 IOVA 地址，DMA_BIT_MASK(32) >> shift 之后得到最大 page frame number
                            alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift, false)
                            // 如果 32 位分配失败，则分配 64 位
                            alloc_iova_fast(iovad, iova_len, dma_limit >> shift, true)
                        // 更新 IOVA 到 PA 的映射 IOMMU Page table，IOMMU Vendor Specific
                        iommu_map
                            iommu_map_nosync
                                ops->map_pages()
                                    intel_iommu_map_pages
                                        intel_iommu_map
                                            __domain_mapping
                                                pfn_to_dma_pte
                                                    // 页表基地址在这里
                                                    parent = domain->pgd

Interrupt Remapping

首先要明白中断为什么需要重定向，中断是一个 device 发给某一个特定的 CPU 的。如果一个 device 只被 assign 给了特定的 vCPU，那么它就不应该有将中断发送到其他 CPU 的能力。

The VMM may utilize the interrupt-remapping hardware to distinguish interrupt requests from specific devices and route them to the appropriate VMs to which the respective devices are assigned.

中断请求会先被中断重映射硬件（IOMMU）截获后再通过查询中断重映射表的方式最终投递到目标 CPU 上。

中断重定向还可以用来区分一个 VM 内部的 IPI 和外部中断，通过设置不同的 attributes。

Interrupt Migration

The interrupt-remapping architecture may be used to support dynamic re-direction of interrupts when the target for an interrupt request is migrated from one logical processor to another logical processor. Without interrupt-remapping hardware support, re-balancing of interrupts require software to reprogram the interrupt sources. However re-programming of these resources are non-atomic (requires multiple registers to be re-programmed), often complex (may require temporary masking of interrupt source), and dependent on interrupt source characteristics (e.g. no masking capability for some interrupt sources; edge interrupts may be lost when masked on some sources, etc.)

Interrupt-remapping enables software to efficiently re-direct interrupts without re-programming the interrupt configuration at the sources. Interrupt migration may be used by OS software for balancing load across processors (such as when running I/O intensive workloads), or by the VMM when it migrates virtual CPUs of a partition with assigned devices across physical processors to improve CPU utilization.

VT-d Posted Interrupt / Interrupt Posting

VT-x 也有 interrupt posting 的概念，这里不作赘述。

SRIOV 和 SIOV 引入了 VF 的概念，这让整个系统对于中断向量号的需求大大增加了。VT-d Posted Interrupt 主要解决的问题之一就是在虚拟化环境下中断向量号不够用的问题。

Hardware support for interrupt posting addresses this problem by allowing interrupt requests from device functions/resources assigned to virtual machines to operate in virtual vector space, thereby scaling naturally with the number of virtual machines or virtual processors.

也就是说一个物理的 CPU 上可能跑这多个 vCPU，每一个 vCPU 都有自己的 virtual vector space 所以他们之间有了隔离性。从而自然地随着 vCPU 的增多，能用的 vector 也就增多了。每一个 interrupt 都多了一个域表示是 target 某一个 vCPU 而不是所有的 CPU 上都是一样的了。这就扩展了向量空间。

解决的问题之二就是效率问题：Specifically, whenever an external interrupt destined for a virtual machine is received by the CPU, control is transferred to the VMM, requiring the VMM to process and inject corresponding virtual interrupt to the virtual machine. The control transfers associated with such VMM processing of external interrupts incurs both hardware and software overheads.

If the target virtual processor is running on any logical processor, hardware can directly deliver external interrupts to the virtual processor without any VMM intervention. Interrupts received while the target virtual processor is preempted (waiting for its turn to run) can be accumulated in memory by hardware for delivery when the virtual processor is later scheduled. This avoids disrupting execution of currently running virtual processors on external interrupts for non-running virtual machines. If the target virtual processor is halted (idle) at the time of interrupt arrival or if the interrupt is qualified as requiring real-time processing, hardware can transfer control to VMM, enabling VMM to schedule the virtual processor and have hardware directly deliver pending interrupts to that virtual processor.

即使这个 vCPU migrate 到另一个 PCPU 上了，target for it 的 virtual interrupt 也能够被自动转移并处理：Hardware support for interrupt posting enables VMM software to atomically comigrate all interrupts targeting a virtual processor when the virtual processor is scheduled to another logical processor.

IOMMU Endpoint

可以表示一个设备？

typedef struct VirtIOIOMMUEndpoint {
    // 有自己的 id
    uint32_t id;
    // 有所属的 domain（虚拟机吧）
    VirtIOIOMMUDomain *domain;
    // 有自己的一片内存区域。
    IOMMUMemoryRegion *iommu_mr;
    // 可以作为一个链表里的 entry
    QLIST_ENTRY(VirtIOIOMMUEndpoint) next;
} VirtIOIOMMUEndpoint;

vIOMMU / VirtIO-IOMMU

VT-d emulation (guest vIOMMU).

Here is a simplest example to boot a Q35 machine with an e1000 card and a guest vIOMMU:

qemu-system-x86_64 -machine q35,accel=kvm,kernel-irqchip=split -m 2G \
                   -device intel-iommu,intremap=on \
                   -netdev user,id=net0 \
                   -device e1000,netdev=net0 \
                   $IMAGE_PATH

Features/VT-d - QEMU

这里有一篇论文：

Amit.pdf

`struct IOMMUDevice` QEMU

这个可能表示一个支持 IOMMU 功能的设备，或者说是一个 IOMMU 所管理的设备，而不是 IOMMU 设备本身。

virtio_iommu_device_bypassed
    // 可以看到 sid 是对应一个 endpoint 的，所以这个表示一个 IOMMU 功能的设备。
    ep = g_tree_lookup(s->endpoints, GUINT_TO_POINTER(sid));

typedef struct IOMMUDevice {
    // 管理这个设备的 vIOMMU, 其实是 `VirtIOIOMMU` 类型的
    void         *viommu;
    // 这个设备的 BDF, see `virtio_iommu_get_bdf()`
    PCIBus       bus;
    int           devfn;
    IOMMUMemoryRegion  iommu_mr;
    AddressSpace  as;
    MemoryRegion root;          /* The root container of the device */
    // 这是 shared memory region，也就是
    // 所有处于 bypass mode 的 device 所共享使用的 memory region
    MemoryRegion bypass_mr;
} IOMMUDevice;

`virtio_iommu_find_add_as()` QEMU

一个 BDF 应该能够标识一个设备，这个函数做一些初始化的工作。

这个函数主要是为了返回 AddressSpace。

pci_device_iommu_address_space
static AddressSpace *virtio_iommu_find_add_as(PCIBus *bus, void *opaque, int devfn)
{
    VirtIOIOMMU *s = opaque;
    IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
    static uint32_t mr_index;
    IOMMUDevice *sdev;

    if (!sbus) {
        sbus = g_malloc0(sizeof(IOMMUPciBus) + sizeof(IOMMUDevice *) * PCI_DEVFN_MAX);
        sbus->bus = bus;
        g_hash_table_insert(s->as_by_busptr, bus, sbus);
    }

    sdev = sbus->pbdev[devfn];
    if (!sdev) {
        char *name = g_strdup_printf("%s-%d-%d", TYPE_VIRTIO_IOMMU_MEMORY_REGION, mr_index++, devfn);

        // 创建一个 device
        sdev = sbus->pbdev[devfn] = g_new0(IOMMUDevice, 1);
        // 
        sdev->viommu = s;
        sdev->bus = bus;
        sdev->devfn = devfn;

        //...
        // 初始化 root memory region
        memory_region_init(&sdev->root, OBJECT(s), name, UINT64_MAX);
        // 初始化 root address space
        address_space_init(&sdev->as, &sdev->root, TYPE_VIRTIO_IOMMU);

        /*
         * Build the IOMMU disabled container with aliases to the
         * shared MRs.  Note that aliasing to a shared memory region
         * could help the memory API to detect same FlatViews so we
         * can have devices to share the same FlatView when in bypass
         * mode. (either by not configuring virtio-iommu driver or with
         * "iommu=pt").  It will greatly reduce the total number of
         * FlatViews of the system hence VM runs faster.
         */
        memory_region_init_alias(&sdev->bypass_mr, OBJECT(s),
                                 "system", get_system_memory(), 0,
                                 memory_region_size(get_system_memory()));

        memory_region_init_iommu(&sdev->iommu_mr, sizeof(sdev->iommu_mr),
                                 TYPE_VIRTIO_IOMMU_MEMORY_REGION,
                                 OBJECT(s), name,
                                 UINT64_MAX);

        /*
         * Hook both the containers under the root container, we
         * switch between iommu & bypass MRs by enable/disable
         * corresponding sub-containers
         */
        memory_region_add_subregion_overlap(&sdev->root, 0,
                                            MEMORY_REGION(&sdev->iommu_mr),
                                            0);
        memory_region_add_subregion_overlap(&sdev->root, 0,
                                            &sdev->bypass_mr, 0);

        virtio_iommu_switch_address_space(sdev);
        g_free(name);
    }
    return &sdev->as;
}

`struct VirtIOIOMMU` QEMU

这个表示一个 VirtIO-IOMMU 的设备。

struct VirtIOIOMMU {
    // 本身也是一个 VirtIO 设备
    VirtIODevice parent_obj;
    VirtQueue *req_vq;
    VirtQueue *event_vq;
    struct virtio_iommu_config config;
    uint64_t features;
    GHashTable *as_by_busptr;
    IOMMUPciBus *iommu_pcibus_by_bus_num[PCI_BUS_MAX];
    // 一个 VirtIOIOMMU 必须在一个 PCIBus 上。
    PCIBus *primary_bus;
    ReservedRegion *reserved_regions;
    uint32_t nb_reserved_regions;
    // 所管理的所有 domains。
    GTree *domains;
    QemuRecMutex mutex;
    // 所管理的所有 endpoints
    GTree *endpoints;
    bool boot_bypass;
    Notifier machine_done;
    bool granule_frozen;
};

IOMMU 和 PCI Bus 看来是一对一的关系，一个 IOMMU 必须要绑定一个 PCI Bus 上，而一个 PCI Bus 也至多只能有一个 IOMMU 设备。这个设备必须 attach 在一个 PCI bus 上，不然：

virtio_iommu_device_realize
    VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
    //...
    if (s->primary_bus) {
        // bus->iommu_ops = ops;
        // bus->iommu_opaque = opaque;
        // IOMMU 已经关联了 bus，让 bus 也关联上 IOMMU
        pci_setup_iommu(s->primary_bus, &virtio_iommu_ops, s);
    } else {
        error_setg(errp, "VIRTIO-IOMMU is not attached to any PCI bus!");
    }

`virtio_iommu_get_endpoint()` QEMU

这个 endpoint_id 有没有 IOMMU_MR 是关键，所以如果其没有在 s->endpoints 里面，那么也会把它 insert 进去。

static VirtIOIOMMUEndpoint *virtio_iommu_get_endpoint(VirtIOIOMMU *s, uint32_t ep_id)
{
    VirtIOIOMMUEndpoint *ep;
    IOMMUMemoryRegion *mr;

    ep = g_tree_lookup(s->endpoints, GUINT_TO_POINTER(ep_id));
    if (ep) {
        return ep;
    }
    mr = virtio_iommu_mr(s, ep_id);
    if (!mr) {
        return NULL;
    }
    ep = g_malloc0(sizeof(*ep));
    ep->id = ep_id;
    ep->iommu_mr = mr;
    trace_virtio_iommu_get_endpoint(ep_id);
    g_tree_insert(s->endpoints, GUINT_TO_POINTER(ep_id), ep);
    return ep;
}

`virtio_iommu_switch_address_space()` QEMU

根据一个 device 是否 bypass 选择不同的 address space。

// Return whether the device is using IOMMU translation.
static bool virtio_iommu_switch_address_space(IOMMUDevice *sdev)
{
    bool use_remapping;

    //...
    // bypass 表示我们不用 IOMMU 来 translate
    use_remapping = !virtio_iommu_device_bypassed(sdev);
    //...
    // Turn off first then on the other
    // 把当前 bypass 所使用的 MR disable
    // enable translate 所使用的 MR
    if (use_remapping) {
        memory_region_set_enabled(&sdev->bypass_mr, false);
        memory_region_set_enabled(MEMORY_REGION(&sdev->iommu_mr), true);
    // 相反
    } else {
        memory_region_set_enabled(MEMORY_REGION(&sdev->iommu_mr), false);
        memory_region_set_enabled(&sdev->bypass_mr, true);
    }

    return use_remapping;
}

`struct IOMMUPciBus` QEMU

相比于 PCIBus 唯一的区别在于多了一个 IOMMUDevice 数组。用来表示已经 probe 的 devices。

typedef struct IOMMUPciBus {
    PCIBus       *bus;
    IOMMUDevice  *pbdev[]; /* Parent array is sparse, so dynamically alloc */
} IOMMUPciBus;

`pci_device_get_iommu_bus_devfn()` QEMU

这个函数给我们一个信息，对于 PCIe，一个 IOMMU bus 下面接着很多普通的 PCI Bus，这个 IOMMU bus 所在 domain 下面所管理的所有的设备，可能直接接在这个 IOMMU bus 上，也可以接在下面的 PCI bus 上，只有 IOMMU bus 才能调用 get_address_space() 函数来获得整个 domain 的 address space（之所以一个 domain 一个 address space，可能是为了 IOMMU 做 DMA 隔离用？），这个函数的作用就是返回这个设备所在的 PCI bus 以及 IOMMU bus。

给定一个 PCIDevice，返回其所在的 PCI bus，IOMMU bus 以及 BDF 里的 DF。

static void pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **aliased_pbus, PCIBus **piommu_bus, uint8_t *aliased_pdevfn)
{
    // 这个 device 所在的 PCIBus 很好拿到
    PCIBus *bus = pci_get_bus(dev);
    // 初始化 iommu bus 为这个 bus
    PCIBus *iommu_bus = bus;
    // DF 在这个设备里也是有的
    uint8_t devfn = dev->devfn;
    // 既然都有，那么我们直接返回上面的这些就好了，为什么我们还要
    // 有下面的代码呢？

    // get_address_space() callback is mandatory, so needs to ensure its
    // presence in the iommu_bus search.
    // bus 是嵌套的，所以一直往上面找 bus
    // 如果
    //  - iommu_bus 不为空：如果为空就不用 iommu_bus
    //  - 并且没有 ops 或者没有 get_address_space()：如果有这个 function，那么我们直接调用就好了
    //  - 并且这个 bus 有父 bus：没有父 bus 就没有办法向上迭代了
    while (iommu_bus && (!iommu_bus->iommu_ops || !iommu_bus->iommu_ops->get_address_space) && iommu_bus->parent_dev) {
        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);

        /*
         * The requester ID of the provided device may be aliased, as seen from
         * the IOMMU, due to topology limitations.  The IOMMU relies on a
         * requester ID to provide a unique AddressSpace for devices, but
         * conventional PCI buses pre-date such concepts.  Instead, the PCIe-
         * to-PCI bridge creates and accepts transactions on behalf of down-
         * stream devices.  When doing so, all downstream devices are masked
         * (aliased) behind a single requester ID.  The requester ID used
         * depends on the format of the bridge devices.  Proper PCIe-to-PCI
         * bridges, with a PCIe capability indicating such, follow the
         * guidelines of chapter 2.3 of the PCIe-to-PCI/X bridge specification,
         * where the bridge uses the seconary bus as the bridge portion of the
         * requester ID and devfn of 00.0.  For other bridges, typically those
         * found on the root complex such as the dmi-to-pci-bridge, we follow
         * the convention of typical bare-metal hardware, which uses the
         * requester ID of the bridge itself.  There are device specific
         * exceptions to these rules, but these are the defaults that the
         * Linux kernel uses when determining DMA aliases itself and believed
         * to be true for the bare metal equivalents of the devices emulated
         * in QEMU.
         */
        // Instead, the PCIe-to-PCI bridge creates and accepts transactions on behalf of down-
        // stream devices.
        if (!pci_bus_is_express(iommu_bus)) {
            PCIDevice *parent = iommu_bus->parent_dev;

            // 父设备是一个 PCIe Bridge
            if (pci_is_express(parent) && pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
                // devfn (0, 0)
                devfn = PCI_DEVFN(0, 0);
                // 变成本轮的 iommu_bus
                bus = iommu_bus;
            // 不是
            } else {
                // 变成父设备的 devfn。
                devfn = parent->devfn;
                // 变成本轮 iommu_bus 的 parent bus
                bus = parent_bus;
            }
        }

        // 无论如何，每一轮 iommu_bus 都会变成 parent_bus
        iommu_bus = parent_bus;
    }
    *aliased_pbus = bus;
    *piommu_bus = iommu_bus;
    *aliased_pdevfn = devfn;
}

`pci_bus_bypass_iommu()` QEMU

是根据这个 bus 所在的 root bus 来判断的需不需要 bypass iommu 的。也就是说整个平台有没有 bypass iommu。

bool pci_bus_bypass_iommu(PCIBus *bus)
{
    PCIBus *rootbus = bus;
    PCIHostState *host_bridge;

    if (!pci_bus_is_root(bus))
        rootbus = pci_device_root_bus(bus->parent_dev);

    host_bridge = PCI_HOST_BRIDGE(rootbus->qbus.parent);

    //...
    return host_bridge->bypass_iommu;
}

`pci_device_root_bus()` QEMU

拿到这个 PCI device 的 root bus。

PCIBus *pci_device_root_bus(const PCIDevice *d)
{
    PCIBus *bus = pci_get_bus(d);

    // 直到拿到 root bus
    while (!pci_bus_is_root(bus)) {
        d = bus->parent_dev;
        //...
        bus = pci_get_bus(d);
    }

    return bus;
}

Device Passthrough

在 KVM 中对 PCI 设备的直通需要通过 VFIO-PCI 接口来完成。VFIO-PCI 是 Linux 内核对 IOMMU 和 PCI 底层逻辑的抽象封装 API，提供给运行在用户态的 QEMU 来配置虚拟设备的 IO 映射关系，从而允许虚拟机内核驱动直接访问硬件资源，以达到较高的 IO 效率。

Direct I/O Live Migration in QEMU

虚拟化场景下，热迁移、HA 都会受到部分设备的影响。设备的实现上，包含“透传”、“直通”、“passthrough”，基本上就限制了虚拟机的迁移能力。

[kvm][qemu]影响虚拟化热迁移的设备-腾讯云开发者社区-腾讯云

对热迁移的兼容性是 PCI 直通设备的一大难点。因为热迁移操作依赖于对虚拟机状态的提取、保持、传输等操作，而 PCI 直通设备的状态对于 hypervisor 是不透明的。

Migration of VFIO devices currently consists of a single stop-and-copy phase. During the stop-and-copy phase the guest is stopped and the entire VFIO device data is transferred to the destination. The pre-copy phase of migration is currently not supported for VFIO devices. Support for VFIO pre-copy will be added later on.

Postcopy migration is currently not supported for VFIO devices.

vfio_device_mig_state.

好好看看下面这个：

VFIO device Migration — QEMU 8.0.50 documentation

QEMU Commit: 31bcbbb5be04c7036223ce680a12927f5e51dc77 vfio/migration: Implement VFIO migration protocol v2

一个新的 vfio device handler 挂载在这里。

static const SaveVMHandlers savevm_vfio_handlers = {
    .save_prepare = vfio_save_prepare,
    .save_setup = vfio_save_setup,
    .save_cleanup = vfio_save_cleanup,
    .state_pending_estimate = vfio_state_pending_estimate,
    .state_pending_exact = vfio_state_pending_exact,
    .is_active_iterate = vfio_is_active_iterate,
    .save_live_iterate = vfio_save_iterate,
    .save_live_complete_precopy = vfio_save_complete_precopy,
    .save_state = vfio_save_state,
    .load_setup = vfio_load_setup,
    .load_cleanup = vfio_load_cleanup,
    .load_state = vfio_load_state,
    .switchover_ack_needed = vfio_switchover_ack_needed,
};

VFIO migration finite state machine

enum vfio_device_mig_state {
	VFIO_DEVICE_STATE_ERROR = 0,
	VFIO_DEVICE_STATE_STOP = 1,
	VFIO_DEVICE_STATE_RUNNING = 2,
	VFIO_DEVICE_STATE_STOP_COPY = 3,
	VFIO_DEVICE_STATE_RESUMING = 4,
	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
	VFIO_DEVICE_STATE_PRE_COPY = 6,
	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
};

VFIO device migration feature bit

VFIO device 有一些关于 migration 的 feature bit：

struct vfio_device_feature_migration {
	__aligned_u64 flags;
#define VFIO_MIGRATION_STOP_COPY	(1 << 0)
#define VFIO_MIGRATION_P2P		(1 << 1)
#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
};

VFIO

VFIO 是依赖于底层硬件的，比如 Intel IOMMU。VFIO 是 Linux 内核层的一个软件框架，是 Linux 中的用户态驱动^。VFIO 大体上提供了两个功能：

利用 IOMMU 把设备直通给 VM；
用户态驱动。

引用一下下面这个很清楚的引入 VFIO 的逻辑：

通常设备的资源只能在内核态被访问，这也是为什么驱动都是内核态的。为了实现用户态驱动，就需要将原来在内核态能看到的硬件资源上提到用户态也可以看见并能访问。以 PCI 设备为例，所谓的硬件资源即设备的 config space 和 bar 空间。为了使设备资源能暴露到用户态，于是有了 vfio_pci.ko。
但 PCIE 设备大多具备 DMA 的能力，倘若此设备的用户态驱动，向不属于这个用户态进程的地址进行了一个恶意写操作，将会给系统带来无法预测的威胁。所以限制用户态驱动只能 DMA 读写分配给它的地址空间是很有必要的，于是有了 Intel 的 VT-d / IOMMU 技术，同样 VFIO 也做了软件支持，即 vfio_iommu_type1.ko（以 type1 为例）。
在虚拟化场景中，多个 VFIO device 可以被 passthrough 给同一个虚拟机，而这些 device 的 IOVA 又统一都是 GPA，所以给每一个设备一套 page table 是不必要的，于是有了 container 的概念，使多个 device 共用一套 page table，container 概念的实现对应 vfio.ko。

鉴于目前社区 IOMMUFD 相关的 feature 一个个被合入，最新的 VFIO 的代码已经开始变样，曾经的 VFIO 也开始被称为 legacy VFIO 了。

所以说，因为 VFIO 而受益的不只是 VM，还有其他比如高性能计算领域。

VFIO 与用户态驱动

UIO 不支持 DMA，所以通过 DMA 传输大流量数据的 IO 设备，如网卡、显卡等设备，无法使用 UIO 框架，VFIO 作为 UIO 的升级版，主要就是解决了这个问题。通过用户态配置 IOMMU 接口，可以将 DMA 地址空间映射限制在进程虚拟空间中。（所以 IOMMU 不只是可以用在虚拟化中，因为是进程内存）。

从设备角度来看，VM 也是一个执行了设备驱动的 vCPU 线程。因此直通到 VM 里的进程还是到用户态驱动进程没有什么区别，都是直通。而且不管是 VM 还是用户态驱动，都是需要通过用户态 ioctl 接口来使用 VFIO 的，VM 也是需要先 exit 出来才能映射。

为什么要这么设计呢？Virtual machines often make use of direct device access (“device assignment” / 或者叫做 passthrough?) when configured for the highest possible I/O performance. From a device and host perspective, this simply turns the VM into a userspace driver（为什么这么说呢？non-root mode 不代表就在 userspace 呀。这是因为使用的 VFIO API 等等 ioctl 还是在 QEMU 当中被调用的）, with the benefits of significantly reduced latency, higher bandwidth, and direct use of bare-metal device drivers。

IOMMU 页表在 VFIO 环境下的更新

// Kernel VFIO 子系统层
vfio_iommu_type1_ioctl
    // QEMU 中可以在不同地方找到对这个 ioctl 的调用
	case VFIO_IOMMU_MAP_DMA:
        vfio_iommu_type1_map_dma
            vfio_dma_do_map
                vfio_pin_map_dma
                    // HVA -> HPA
                    vfio_pin_pages_remote
                    // IOVA (GPA) -> HPA
                    vfio_iommu_map
                        // 见普通环境下如何映射的
                        iommu_map

想象中的整体流程：

Guest driver 想让设备 DMA，传入对应的 GPA；
Guest exit 出来到 KVM；
KVM exit 出来到 QEMU；
QEMU 调用 fd ioctl 设置 mapping，GPA (IOVA) -> HPA。

实际的流程：

Guest DMA，设置的 dma handle 肯定是 GPA，那么 IOMMU 页表就需要 handle GPA -> HPA。
在 RAMBlock 添加的时候（region_add），就把所有 GPA -> HPA 的映射写到 IOMMU 页表中去，就不动态地添加和删除了：

// NVMe 设备
qemu_vfio_open_pci
    qemu_vfio_open_common
        s->ram_notifier.ram_block_added = qemu_vfio_ram_block_added;
            qemu_vfio_dma_map
                qemu_vfio_do_mapping
                    ioctl(s->container, VFIO_IOMMU_MAP_DMA, &dma_map)

// QEMU 中的初始化路径
vfio_pci_realize
    vfio_device_attach
        vfio_legacy_attach_device
            vfio_group_get
                vfio_container_connect
                    vfio_listener_register
                        // 可以看到是 region_add 的时候就触发了 IOMMU 页表更新
                        vfio_memory_listener.region_add = vfio_listener_region_add,

// Map 路径
vfio_listener_region_add
    vfio_container_region_add
        vfio_container_dma_map
            vioc->dma_map = vfio_legacy_dma_map;
                vfio_legacy_dma_map
                    ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)

How to use VFIO

要使用 VFIO, 必须在 Linux 启动时添加启动项 intel_iommu=on，因为 VFIO 的底层依赖 IOMMU.

加载 VFIO-PCI module：

sudo modprobe vfio-pci

只需要知道一个设备的 BDF，那么就可以查看一个设备所在 group：

readlink /sys/bus/pci/devices/0000:06:00.0/iommu_group

查看一个设备同一 group 下的其它设备：

# 查看 bdf 为 0000:06:00.0 所在 group 下面的所有 device。
ls /sys/bus/pci/devices/0000:06:00.0/iommu_group/devices/

为了将设备透传到虚拟机中，需要将设备与其对应的驱动解绑，这样该设备就可以使用 VFIO 的驱动了. 注意，不仅要将要透传的设备解绑，还要将与设备同 iommu_group 的设备都解绑，才能透传成功，因为一次我们透传一个 group 给 VM。

$ echo 0000:06:00.0 | sudo tee /sys/bus/pci/devices/0000:06:00.0/driver/unbind
0000:06:00.0
$ echo 0000:00:05.0 | sudo tee /sys/bus/pci/devices/0000:00:05.0/driver/unbind
0000:00:05.0 
$ echo 0000:00:05.1  sudo tee /sys/bus/pci/devices/0000:00:05.1/driver/unbind
0000:00:05.1

查看设备的 Vendor 和 Device ID：

$ lspci -n -s 06:00.0
06:00.0 0200: 10ec:8168 (rev 15)

将设备绑定到 vfio-pci module：

echo 10ec 8168 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id

可以通过 ls /dev/vfio 查看是否绑定成功，如果绑定成功，/dev/vfio 目录下会出现该 device 所属的 iommu_group 号

给这个加上权限：

chown user:user /dev/vfio/26

用户态驱动如何调用 API 使用该设备：

int container, group, device, i;
struct vfio_group_status group_status =
                                { .argsz = sizeof(group_status) };
struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
struct vfio_device_info device_info = { .argsz = sizeof(device_info) };

// Create a new container
container = open("/dev/vfio/vfio", O_RDWR);

// Open the group
group = open("/dev/vfio/26", O_RDWR);

// Test the group is viable and available
ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);

if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
        /* Group is not viable (ie, not all devices bound for vfio) */

/* Add the group to the container */
ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);

/* Enable the IOMMU model we want */
ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);

/* Get addition IOMMU info */
ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);

/* Allocate some space and setup a DMA mapping */
dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
dma_map.size = 1024 * 1024;
dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;

ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);

// Get a fd for the device
device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");

// Test and setup the device
ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);

for (i = 0; i < device_info.num_regions; i++) {
        struct vfio_region_info reg = { .argsz = sizeof(reg) };
        reg.index = i;
        ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);

        /* Setup mappings... read/write offsets, mmaps
         * For PCI devices, config space is a region */
}

for (i = 0; i < device_info.num_irqs; i++) {
        struct vfio_irq_info irq = { .argsz = sizeof(irq) };
        irq.index = i;
        ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq);

        /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
}

/* Gratuitous device reset and go... */
ioctl(device, VFIO_DEVICE_RESET);

VFIO 概述 - EwanHai - 博客园

VFIO Concepts

一个或多个 device 从属于某个 group，而一个或多个 group 又从属于一个 container。如果要将一个 device 直通给 VM，那么先要找到这个设备从属的 iommu group，然后将整个 group 加入到 container 中即可。

VFIO Device

In the context of virtualization, a VFIO device refers to a hardware device that has been assigned to a virtual machine (VM) using the VFIO framework. This framework allows for direct and isolated access to the device from the VM, bypassing the traditional virtualization layer.

When using vfio, the qemu command line should add following option:

-device vfio-pci,host=00:12.0,id=net0

VFIO Group

Group：group 是 IOMMU 能够进行 DMA 隔离的最小硬件单元，一个 group 内可能只有一个 device，也可能有多个 device，这取决于物理平台上硬件的 IOMMU 拓扑结构。设备直通的时候一个 group 里面的设备必须都直通给同一个 VM。不能够让一个 group 里的多个 device 分别从属于 2 个不同的 VM，也不允许部分 device 在 host 上而另一部分被分配到 guest 里，因为就这样一个 guest 中的 device 可以利用 DMA 攻击获取另外一个 guest 里的数据，就无法做到物理上的 DMA 隔离。Group is the minimum granularity that can be assigned to a VM

一个 device 的多个 function 必须放到同一个 group。

通过把 host 的 device 和对应 driver 解绑，然后绑定在 VFIO 的 driver 上，就会在 /dev/vfio/ 目录下出现一个 group, 这个 group 就是 IOMMU_GROUP 号，如果需要在该 group 上使用 VFIO, 需要将该 group 下的所有 device 与其对应的驱动解绑。

为什么要设计 group 的概念呢？For instance, an individual device may be part of a larger multifunction enclosure. While the IOMMU may be able to distinguish between devices within the enclosure, the enclosure may not require transactions between devices to reach the IOMMU. Examples of this could be anything from a multi-function PCI device with backdoors between functions to a non-PCI-ACS (Access Control Services) capable bridge allowing redirection without reaching the IOMMU. 翻译：Function（注意不是 VF 而是 BDF 里的 function，See BDF^）之间因为物理上属于同一个硬件，所以可能存在互相攻击的可能性。所以需要把同一个设备的所有 Function 放到同一个 group 中。

VFIO 概述 - EwanHai - 博客园

VFIO Container

一个 Container 对应了一个虚拟机。 主要是为了设备间共享 IOMMU 页表。

Containter 是 VFIO 软件设计的概念不是 IOMMU 这个硬件上的概念。

Container：对于虚机，Container 这里可以简单理解为一个 VM Domain 的物理内存空间。对于用户态驱动，Container 可以是多个 Group 的集合。Containers is a set of groups.

当我们想在不同的 IOMMU_GROUP 之间共享 TLB 和 IOMMU page tables 时，就将这些 group 放到同一个 container 中，因此 Container 可以看做是 IOMMU_GROUP 的集合。

On its own, the container provides little functionality, with all but a couple version and extension query interfaces locked away. The user needs to add a group into the container for the next level of functionality.

Once the group is ready, it may be added to the container by opening the VFIO group character device (/dev/vfio/$GROUP) and using the VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the previously opened container file.

VFIO 概述 - EwanHai - 博客园

VFIO 就是内核针对 IOMMU 提供的软件框架，支持 DMA Remapping 和 Interrupt Remapping，这里只讲 DMA Remapping。VFIO 利用 IOMMU 这个特性，可以屏蔽物理地址对上层的可见性，可以用来开发用户态驱动，也可以实现设备透传。

VFIO 概述_hx_op 的博客 - CSDN 博客_vfio

不开启 SRIOV 的物理设备可以直通给 Guest 吗？

设备直通对设备有要求吗？Legacy 设备可以直通吗？ IOMMU 对设备是无侵入的吗？

直通设备如何解决控制面问题

iommu=pt

DMAR

VT-d / IOMMU Concepts

IOMMU Domain

DMA Remapping

DMA Remapping type

DMA Remapping use case

IOMMU Page Fault

在非虚拟化但是有 IOMMU 的平台上，IOMMU 在设备 DMA 时会被用到吗？

IOMMU 页表

IOMMU 页表地址转换

IOMMU 页表在 host 环境下的更新

Interrupt Remapping

Interrupt Migration

VT-d Posted Interrupt / Interrupt Posting

IOMMU Endpoint

vIOMMU / VirtIO-IOMMU

struct IOMMUDevice QEMU

virtio_iommu_find_add_as() QEMU

struct VirtIOIOMMU QEMU

virtio_iommu_get_endpoint() QEMU

virtio_iommu_switch_address_space() QEMU

struct IOMMUPciBus QEMU

pci_device_get_iommu_bus_devfn() QEMU

pci_bus_bypass_iommu() QEMU

pci_device_root_bus() QEMU

Device Passthrough

Direct I/O Live Migration in QEMU

VFIO migration finite state machine

VFIO device migration feature bit

VFIO

VFIO 与用户态驱动

IOMMU 页表在 VFIO 环境下的更新

How to use VFIO

VFIO Concepts

VFIO Device

VFIO Group

VFIO Container

IOMMUFD

`iommu=pt`

`DMAR`

`struct IOMMUDevice` QEMU

`virtio_iommu_find_add_as()` QEMU

`struct VirtIOIOMMU` QEMU

`virtio_iommu_get_endpoint()` QEMU

`virtio_iommu_switch_address_space()` QEMU

`struct IOMMUPciBus` QEMU

`pci_device_get_iommu_bus_devfn()` QEMU

`pci_bus_bypass_iommu()` QEMU

`pci_device_root_bus()` QEMU