Intel VT-d is identical to IOMMU, which in AMD is also called AMD-Vi.

DMA-remapping 解决从 device 角度来看内存虚拟化的问题,interruptremapping 解决的是从 device 角度来看 CPU 虚拟化的问题。

现代的 x86 系列 CPU 通常包含一个 IOMMU。因为 IOMMU 通常是集成在 Root Complex 下面的,所以可以说一个 CPU 有一个 IOMMU。In modern systems, IOMMU is commonly integrated with the PCIe root complex.

iommu=pt

pt = pass-thru.

使用 KVM pass-thru 设备时,通常会设置 intel_iommu=on,iommu=pt 内核参数,其中 intel_iommu=on 就是使能 intel iommu。

This only enables IOMMU translation for devices used in pass-thru ,doesn’t enable IOMMU translation for host used devices ,and this will improve performance for host PCIe devices (which are not pass-thru to a VM).

也就是说,只给直通给 guest 的设备加 DMA Remapping。对于 host 自己使用的设备,不需要做 DMA Remapping。也就是说,host 自己使用的设备可以 DMA 到任何地方,包括 VM 自己的内存空间。

这个设置也是对于 host 上的设备以及 VMM 善意假设的,毕竟本来 host 上的 VMM 就能访问任何 guest 的内存空间,那么 host device 为什么不行呢?

配置了 iommu=pt 就会实现 identity mapping:

  • 如果 Hardware supports pass-through translation type,则配置 pass-through translation type 即可实现 identity mapping,此时无需配置 iommu 页表;
  • 如果 Hardware doesn’t support pass-through translation type,则需要配置 iommu 页表,使得 IOVA 与 HPA 1:1 映射。

hw_pass_through=0 时,依然要走 iommu 页表,因此性能是不如 hw_pass_through=1 的。

实际上,这个参数有很多支持的值:The kernel’s command-line parameters — The Linux Kernel documentation

Notes about iommu=pt kernel parameter - L

VT-d Concepts

一些需要提前说明的由 VT-d 所定义的概念:

Domain:

Abstract isolated environments in the platform to which a subset of host physical memory is allocated. 每一个设备都可以被分配到一个 domain 里面,这是通过让每一个 domain 都有一个特有的 paging structures 实现的。When the device attempts to access system memory, the DMA-remapping hardware intercepts the access and utilizes the page tables to determine

  • whether the access can be permitted;
  • the actual location to access.

就像 TLB,这些 paging structure 也是可以被缓存的,不只是 IOTLB,see 6.2 Address Translation Caches

The DMA remapping architecture facilitates flexible assignment of I/O devices to an arbitrary number of domains. 所以 Domain 和 Device 应该是多对多的关系。

DMA Remapping

VMM 为了能让 guest direct access 一个 device,需要支持 DMA request 的隔离(不然一个 device 可以通过 DMA 访问到任何地方)。

直通设备在工作的时候同样要使用 DMA 技术来访问虚拟机的主存以提升 IO 性能。那么问题来了,直接分配给某个特定的虚拟机的,我们必须要保证直通设备 DMA 的安全性,一个 VM 的直通设备不能通过 DMA 访问到其他 VM 的内存,同时也不能直接访问 Host 的内存,否则会造成极其严重的后果。因此,必须对直通设备进行“DMA 隔离”和“DMA 地址翻译”:

  • 隔离将直通设备的 DMA 访问限制在其所在 VM 的物理地址空间内保证不发生访问越界,
  • 地址翻译则保证了直通设备的 DMA 能够被正确重定向到虚拟机的物理地址空间所对应的真实物理地址空间内。

为什么直通设备会存在 DMA 访问的安全性问题呢?原因也很简单:由于直通设备进行 DMA 操作的时候 guest driver 直接使用 GPA 来访问内存的,这就导致如果不加以隔离和地址翻译必然会访问到其他 VM 的物理内存或者破坏 Host 内存(因为会被理解为 HPA,而与 GPA 等值的那块 HPA 可能在任何地方),因此必须有一套机制能够将 GPA 转换为对应的 HPA 这样直通设备的 DMA 操作才能够顺利完成。

DMA Remapping type

Remapping hardware treats inbound memory requests from root-complex integrated devices and PCI Express* attached discrete devices into two categories(普通 DMA,带额外信息的 DMA):

  • Requests without address-space-identifier: These are the normal memory requests from endpoint devices. These requests typically specify the type of access (read/write/atomics), targeted DMA address/size, and source-id of the device originating the request (e.g., BDF). (Requests-without-PASID)
  • Requests with address-space-identifier: These are memory requests with additional information identifying the targeted address space from endpoint devices. Beyond attributes in normal requests, these requests specify the targeted Process Address Space Identifier (PASID), and Privileged-mode-Requested (PR) flag (to distinguish user versus supervisor access). For details, refer to the PASID Extended Capability Structure in the PCI Express specification. (Requests-with-PASID)

DMA Remapping use case

Depending on the software usage model, the DMA address space of a device may be

  • the Guest-Physical Address (GPA) space of a virtual machine to which it is assigned
  • Virtual Address (VA) space of host application on whose behalf it is performing DMA requests
  • Guest Virtual Address (GVA) space of a client application executing within a virtual machine
  • I/O virtual address (IOVA) space managed by host software
  • Guest I/O virtual address (GIOVA) space managed by guest software

In all cases, DMA remapping transforms the address in a DMA request issued by an I/O device to its corresponding Host-Physical Address (HPA).

Interrupt Remapping

首先要明白中断为什么需要重定向,中断是一个 device 发给某一个特定的 CPU 的。如果一个 device 只被 assign 给了特定的 vCPU,那么它就不应该有将中断发送到其他 CPU 的能力。

The VMM may utilize the interrupt-remapping hardware to distinguish interrupt requests from specific devices and route them to the appropriate VMs to which the respective devices are assigned.

中断请求会先被中断重映射硬件(IOMMU)截获后再通过查询中断重映射表的方式最终投递到目标 CPU 上。

中断重定向还可以用来区分一个 VM 内部的 IPI 和外部中断,通过设置不同的 attributes。

Interrupt Migration

The interrupt-remapping architecture may be used to support dynamic re-direction of interrupts when the target for an interrupt request is migrated from one logical processor to another logical processor. Without interrupt-remapping hardware support, re-balancing of interrupts require software to reprogram the interrupt sources. However re-programming of these resources are non-atomic (requires multiple registers to be re-programmed), often complex (may require temporary masking of interrupt source), and dependent on interrupt source characteristics (e.g. no masking capability for some interrupt sources; edge interrupts may be lost when masked on some sources, etc.)

Interrupt-remapping enables software to efficiently re-direct interrupts without re-programming the interrupt configuration at the sources. Interrupt migration may be used by OS software for balancing load across processors (such as when running I/O intensive workloads), or by the VMM when it migrates virtual CPUs of a partition with assigned devices across physical processors to improve CPU utilization.

VT-d Posted Interrupt / Interrupt Posting

VT-x 也有 interrupt posting 的概念,这里不作赘述。

SRIOV 和 SIOV 引入了 VF 的概念,这让整个系统对于中断向量号的需求大大增加了。VT-d Posted Interrupt 主要解决的问题之一就是在虚拟化环境下中断向量号不够用的问题。

Hardware support for interrupt posting addresses this problem by allowing interrupt requests from device functions/resources assigned to virtual machines to operate in virtual vector space, thereby scaling naturally with the number of virtual machines or virtual processors.

也就是说一个物理的 CPU 上可能跑这多个 vCPU,每一个 vCPU 都有自己的 virtual vector space 所以他们之间有了隔离性。从而自然地随着 vCPU 的增多,能用的 vector 也就增多了。每一个 interrupt 都多了一个域表示是 target 某一个 vCPU 而不是所有的 CPU 上都是一样的了。这就扩展了向量空间。

解决的问题之二就是效率问题:Specifically, whenever an external interrupt destined for a virtual machine is received by the CPU, control is transferred to the VMM, requiring the VMM to process and inject corresponding virtual interrupt to the virtual machine. The control transfers associated with such VMM processing of external interrupts incurs both hardware and software overheads.

If the target virtual processor is running on any logical processor, hardware can directly deliver external interrupts to the virtual processor without any VMM intervention. Interrupts received while the target virtual processor is preempted (waiting for its turn to run) can be accumulated in memory by hardware for delivery when the virtual processor is later scheduled. This avoids disrupting execution of currently running virtual processors on external interrupts for non-running virtual machines. If the target virtual processor is halted (idle) at the time of interrupt arrival or if the interrupt is qualified as requiring real-time processing, hardware can transfer control to VMM, enabling VMM to schedule the virtual processor and have hardware directly deliver pending interrupts to that virtual processor.

即使这个 vCPU migrate 到另一个 PCPU 上了,target for it 的 virtual interrupt 也能够被自动转移并处理:Hardware support for interrupt posting enables VMM software to atomically comigrate all interrupts targeting a virtual processor when the virtual processor is scheduled to another logical processor.

IOMMU Endpoint

可以表示一个设备?

typedef struct VirtIOIOMMUEndpoint {
    // 有自己的 id
    uint32_t id;
    // 有所属的 domain(虚拟机吧)
    VirtIOIOMMUDomain *domain;
    // 有自己的一片内存区域。
    IOMMUMemoryRegion *iommu_mr;
    // 可以作为一个链表里的 entry
    QLIST_ENTRY(VirtIOIOMMUEndpoint) next;
} VirtIOIOMMUEndpoint;

vIOMMU / VirtIO-IOMMU

VT-d emulation (guest vIOMMU).

Here is a simplest example to boot a Q35 machine with an e1000 card and a guest vIOMMU:

qemu-system-x86_64 -machine q35,accel=kvm,kernel-irqchip=split -m 2G \
                   -device intel-iommu,intremap=on \
                   -netdev user,id=net0 \
                   -device e1000,netdev=net0 \
                   $IMAGE_PATH

Features/VT-d - QEMU

这里有一篇论文:

Amit.pdf

struct IOMMUDevice QEMU

这个可能表示一个支持 IOMMU 功能的设备,或者说是一个 IOMMU 所管理的设备,而不是 IOMMU 设备本身。

virtio_iommu_device_bypassed
    // 可以看到 sid 是对应一个 endpoint 的,所以这个表示一个 IOMMU 功能的设备。
    ep = g_tree_lookup(s->endpoints, GUINT_TO_POINTER(sid));
typedef struct IOMMUDevice {
    // 管理这个设备的 vIOMMU, 其实是 `VirtIOIOMMU` 类型的
    void         *viommu;
    // 这个设备的 BDF, see `virtio_iommu_get_bdf()`
    PCIBus       bus;
    int           devfn;
    IOMMUMemoryRegion  iommu_mr;
    AddressSpace  as;
    MemoryRegion root;          /* The root container of the device */
    // 这是 shared memory region,也就是
    // 所有处于 bypass mode 的 device 所共享使用的 memory region
    MemoryRegion bypass_mr;
} IOMMUDevice;

virtio_iommu_find_add_as() QEMU

一个 BDF 应该能够标识一个设备,这个函数做一些初始化的工作。

这个函数主要是为了返回 AddressSpace

pci_device_iommu_address_space
static AddressSpace *virtio_iommu_find_add_as(PCIBus *bus, void *opaque, int devfn)
{
    VirtIOIOMMU *s = opaque;
    IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
    static uint32_t mr_index;
    IOMMUDevice *sdev;

    if (!sbus) {
        sbus = g_malloc0(sizeof(IOMMUPciBus) + sizeof(IOMMUDevice *) * PCI_DEVFN_MAX);
        sbus->bus = bus;
        g_hash_table_insert(s->as_by_busptr, bus, sbus);
    }

    sdev = sbus->pbdev[devfn];
    if (!sdev) {
        char *name = g_strdup_printf("%s-%d-%d", TYPE_VIRTIO_IOMMU_MEMORY_REGION, mr_index++, devfn);

        // 创建一个 device
        sdev = sbus->pbdev[devfn] = g_new0(IOMMUDevice, 1);
        // 
        sdev->viommu = s;
        sdev->bus = bus;
        sdev->devfn = devfn;

        //...
        // 初始化 root memory region
        memory_region_init(&sdev->root, OBJECT(s), name, UINT64_MAX);
        // 初始化 root address space
        address_space_init(&sdev->as, &sdev->root, TYPE_VIRTIO_IOMMU);

        /*
         * Build the IOMMU disabled container with aliases to the
         * shared MRs.  Note that aliasing to a shared memory region
         * could help the memory API to detect same FlatViews so we
         * can have devices to share the same FlatView when in bypass
         * mode. (either by not configuring virtio-iommu driver or with
         * "iommu=pt").  It will greatly reduce the total number of
         * FlatViews of the system hence VM runs faster.
         */
        memory_region_init_alias(&sdev->bypass_mr, OBJECT(s),
                                 "system", get_system_memory(), 0,
                                 memory_region_size(get_system_memory()));

        memory_region_init_iommu(&sdev->iommu_mr, sizeof(sdev->iommu_mr),
                                 TYPE_VIRTIO_IOMMU_MEMORY_REGION,
                                 OBJECT(s), name,
                                 UINT64_MAX);

        /*
         * Hook both the containers under the root container, we
         * switch between iommu & bypass MRs by enable/disable
         * corresponding sub-containers
         */
        memory_region_add_subregion_overlap(&sdev->root, 0,
                                            MEMORY_REGION(&sdev->iommu_mr),
                                            0);
        memory_region_add_subregion_overlap(&sdev->root, 0,
                                            &sdev->bypass_mr, 0);

        virtio_iommu_switch_address_space(sdev);
        g_free(name);
    }
    return &sdev->as;
}

struct VirtIOIOMMU QEMU

这个表示一个 VirtIO-IOMMU 的设备。

struct VirtIOIOMMU {
    // 本身也是一个 VirtIO 设备
    VirtIODevice parent_obj;
    VirtQueue *req_vq;
    VirtQueue *event_vq;
    struct virtio_iommu_config config;
    uint64_t features;
    GHashTable *as_by_busptr;
    IOMMUPciBus *iommu_pcibus_by_bus_num[PCI_BUS_MAX];
    // 一个 VirtIOIOMMU 必须在一个 PCIBus 上。
    PCIBus *primary_bus;
    ReservedRegion *reserved_regions;
    uint32_t nb_reserved_regions;
    // 所管理的所有 domains。
    GTree *domains;
    QemuRecMutex mutex;
    // 所管理的所有 endpoints
    GTree *endpoints;
    bool boot_bypass;
    Notifier machine_done;
    bool granule_frozen;
};

IOMMU 和 PCI Bus 看来是一对一的关系,一个 IOMMU 必须要绑定一个 PCI Bus 上,而一个 PCI Bus 也至多只能有一个 IOMMU 设备。这个设备必须 attach 在一个 PCI bus 上,不然:

virtio_iommu_device_realize
    VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
    //...
    if (s->primary_bus) {
        // bus->iommu_ops = ops;
        // bus->iommu_opaque = opaque;
        // IOMMU 已经关联了 bus,让 bus 也关联上 IOMMU
        pci_setup_iommu(s->primary_bus, &virtio_iommu_ops, s);
    } else {
        error_setg(errp, "VIRTIO-IOMMU is not attached to any PCI bus!");
    }

virtio_iommu_get_endpoint() QEMU

这个 endpoint_id 有没有 IOMMU_MR 是关键,所以如果其没有在 s->endpoints 里面,那么也会把它 insert 进去。

static VirtIOIOMMUEndpoint *virtio_iommu_get_endpoint(VirtIOIOMMU *s, uint32_t ep_id)
{
    VirtIOIOMMUEndpoint *ep;
    IOMMUMemoryRegion *mr;

    ep = g_tree_lookup(s->endpoints, GUINT_TO_POINTER(ep_id));
    if (ep) {
        return ep;
    }
    mr = virtio_iommu_mr(s, ep_id);
    if (!mr) {
        return NULL;
    }
    ep = g_malloc0(sizeof(*ep));
    ep->id = ep_id;
    ep->iommu_mr = mr;
    trace_virtio_iommu_get_endpoint(ep_id);
    g_tree_insert(s->endpoints, GUINT_TO_POINTER(ep_id), ep);
    return ep;
}

virtio_iommu_switch_address_space() QEMU

根据一个 device 是否 bypass 选择不同的 address space。

// Return whether the device is using IOMMU translation.
static bool virtio_iommu_switch_address_space(IOMMUDevice *sdev)
{
    bool use_remapping;

    //...
    // bypass 表示我们不用 IOMMU 来 translate
    use_remapping = !virtio_iommu_device_bypassed(sdev);
    //...
    // Turn off first then on the other
    // 把当前 bypass 所使用的 MR disable
    // enable translate 所使用的 MR
    if (use_remapping) {
        memory_region_set_enabled(&sdev->bypass_mr, false);
        memory_region_set_enabled(MEMORY_REGION(&sdev->iommu_mr), true);
    // 相反
    } else {
        memory_region_set_enabled(MEMORY_REGION(&sdev->iommu_mr), false);
        memory_region_set_enabled(&sdev->bypass_mr, true);
    }

    return use_remapping;
}

struct IOMMUPciBus QEMU

相比于 PCIBus 唯一的区别在于多了一个 IOMMUDevice 数组。用来表示已经 probe 的 devices。

typedef struct IOMMUPciBus {
    PCIBus       *bus;
    IOMMUDevice  *pbdev[]; /* Parent array is sparse, so dynamically alloc */
} IOMMUPciBus;

pci_device_get_iommu_bus_devfn() QEMU

这个函数给我们一个信息,对于 PCIe,一个 IOMMU bus 下面接着很多普通的 PCI Bus,这个 IOMMU bus 所在 domain 下面所管理的所有的设备,可能直接接在这个 IOMMU bus 上,也可以接在下面的 PCI bus 上,只有 IOMMU bus 才能调用 get_address_space() 函数来获得整个 domain 的 address space(之所以一个 domain 一个 address space,可能是为了 IOMMU 做 DMA 隔离用?),这个函数的作用就是返回这个设备所在的 PCI bus 以及 IOMMU bus。

给定一个 PCIDevice,返回其所在的 PCI bus,IOMMU bus 以及 BDF 里的 DF。

static void pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **aliased_pbus, PCIBus **piommu_bus, uint8_t *aliased_pdevfn)
{
    // 这个 device 所在的 PCIBus 很好拿到
    PCIBus *bus = pci_get_bus(dev);
    // 初始化 iommu bus 为这个 bus
    PCIBus *iommu_bus = bus;
    // DF 在这个设备里也是有的
    uint8_t devfn = dev->devfn;
    // 既然都有,那么我们直接返回上面的这些就好了,为什么我们还要
    // 有下面的代码呢?

    // get_address_space() callback is mandatory, so needs to ensure its
    // presence in the iommu_bus search.
    // bus 是嵌套的,所以一直往上面找 bus
    // 如果
    //  - iommu_bus 不为空:如果为空就不用 iommu_bus
    //  - 并且没有 ops 或者没有 get_address_space():如果有这个 function,那么我们直接调用就好了
    //  - 并且这个 bus 有父 bus:没有父 bus 就没有办法向上迭代了
    while (iommu_bus && (!iommu_bus->iommu_ops || !iommu_bus->iommu_ops->get_address_space) && iommu_bus->parent_dev) {
        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);

        /*
         * The requester ID of the provided device may be aliased, as seen from
         * the IOMMU, due to topology limitations.  The IOMMU relies on a
         * requester ID to provide a unique AddressSpace for devices, but
         * conventional PCI buses pre-date such concepts.  Instead, the PCIe-
         * to-PCI bridge creates and accepts transactions on behalf of down-
         * stream devices.  When doing so, all downstream devices are masked
         * (aliased) behind a single requester ID.  The requester ID used
         * depends on the format of the bridge devices.  Proper PCIe-to-PCI
         * bridges, with a PCIe capability indicating such, follow the
         * guidelines of chapter 2.3 of the PCIe-to-PCI/X bridge specification,
         * where the bridge uses the seconary bus as the bridge portion of the
         * requester ID and devfn of 00.0.  For other bridges, typically those
         * found on the root complex such as the dmi-to-pci-bridge, we follow
         * the convention of typical bare-metal hardware, which uses the
         * requester ID of the bridge itself.  There are device specific
         * exceptions to these rules, but these are the defaults that the
         * Linux kernel uses when determining DMA aliases itself and believed
         * to be true for the bare metal equivalents of the devices emulated
         * in QEMU.
         */
        // Instead, the PCIe-to-PCI bridge creates and accepts transactions on behalf of down-
        // stream devices.
        if (!pci_bus_is_express(iommu_bus)) {
            PCIDevice *parent = iommu_bus->parent_dev;

            // 父设备是一个 PCIe Bridge
            if (pci_is_express(parent) && pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
                // devfn (0, 0)
                devfn = PCI_DEVFN(0, 0);
                // 变成本轮的 iommu_bus
                bus = iommu_bus;
            // 不是
            } else {
                // 变成父设备的 devfn。
                devfn = parent->devfn;
                // 变成本轮 iommu_bus 的 parent bus
                bus = parent_bus;
            }
        }

        // 无论如何,每一轮 iommu_bus 都会变成 parent_bus
        iommu_bus = parent_bus;
    }
    *aliased_pbus = bus;
    *piommu_bus = iommu_bus;
    *aliased_pdevfn = devfn;
}

pci_bus_bypass_iommu() QEMU

是根据这个 bus 所在的 root bus 来判断的需不需要 bypass iommu 的。也就是说整个平台有没有 bypass iommu。

bool pci_bus_bypass_iommu(PCIBus *bus)
{
    PCIBus *rootbus = bus;
    PCIHostState *host_bridge;

    if (!pci_bus_is_root(bus))
        rootbus = pci_device_root_bus(bus->parent_dev);

    host_bridge = PCI_HOST_BRIDGE(rootbus->qbus.parent);

    //...
    return host_bridge->bypass_iommu;
}

pci_device_root_bus() QEMU

拿到这个 PCI device 的 root bus。

PCIBus *pci_device_root_bus(const PCIDevice *d)
{
    PCIBus *bus = pci_get_bus(d);

    // 直到拿到 root bus
    while (!pci_bus_is_root(bus)) {
        d = bus->parent_dev;
        //...
        bus = pci_get_bus(d);
    }

    return bus;
}

Device Passthrough

在 KVM 中对 PCI 设备的直通需要通过 VFIO-PCI 接口来完成。VFIO-PCI 是 Linux 内核对 IOMMU 和 PCI 底层逻辑的抽象封装 API,提供给运行在用户态的 QEMU 来配置虚拟设备的 IO 映射关系,从而允许虚拟机内核驱动直接访问硬件资源,以达到较高的 IO 效率。

Direct I/O Live Migration in QEMU

虚拟化场景下,热迁移、HA 都会受到部分设备的影响。设备的实现上,包含“透传”、“直通”、“passthrough”,基本上就限制了虚拟机的迁移能力。

[kvm][qemu]影响虚拟化热迁移的设备-腾讯云开发者社区-腾讯云

对热迁移的兼容性是 PCI 直通设备的一大难点。因为热迁移操作依赖于对虚拟机状态的提取、保持、传输等操作,而 PCI 直通设备的状态对于 hypervisor 是不透明的。

Migration of VFIO devices currently consists of a single stop-and-copy phase. During the stop-and-copy phase the guest is stopped and the entire VFIO device data is transferred to the destination. The pre-copy phase of migration is currently not supported for VFIO devices. Support for VFIO pre-copy will be added later on.

Postcopy migration is currently not supported for VFIO devices.

vfio_device_mig_state.

好好看看下面这个:

VFIO device Migration — QEMU 8.0.50 documentation

QEMU Commit: 31bcbbb5be04c7036223ce680a12927f5e51dc77 vfio/migration: Implement VFIO migration protocol v2

一个新的 vfio device handler 挂载在这里。

static const SaveVMHandlers savevm_vfio_handlers = {
    .save_prepare = vfio_save_prepare,
    .save_setup = vfio_save_setup,
    .save_cleanup = vfio_save_cleanup,
    .state_pending_estimate = vfio_state_pending_estimate,
    .state_pending_exact = vfio_state_pending_exact,
    .is_active_iterate = vfio_is_active_iterate,
    .save_live_iterate = vfio_save_iterate,
    .save_live_complete_precopy = vfio_save_complete_precopy,
    .save_state = vfio_save_state,
    .load_setup = vfio_load_setup,
    .load_cleanup = vfio_load_cleanup,
    .load_state = vfio_load_state,
    .switchover_ack_needed = vfio_switchover_ack_needed,
};

VFIO migration finite state machine

enum vfio_device_mig_state {
	VFIO_DEVICE_STATE_ERROR = 0,
	VFIO_DEVICE_STATE_STOP = 1,
	VFIO_DEVICE_STATE_RUNNING = 2,
	VFIO_DEVICE_STATE_STOP_COPY = 3,
	VFIO_DEVICE_STATE_RESUMING = 4,
	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
	VFIO_DEVICE_STATE_PRE_COPY = 6,
	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
};

VFIO device migration feature bit

VFIO device 有一些关于 migration 的 feature bit:

struct vfio_device_feature_migration {
	__aligned_u64 flags;
#define VFIO_MIGRATION_STOP_COPY	(1 << 0)
#define VFIO_MIGRATION_P2P		(1 << 1)
#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
};