Live migration refers to the process of moving a running virtual machine or application between different physical machines without disconnecting the client or application. That's maybe the reason why the down time should be < TCP timeout.

Shared pages are handled in sequential with private pages, shared pages are migrated faster than private pages, most of the pages are private pages.

TDX does not migrate 2MB or 1GB pages, it migrates only 4KB pages. You can always break a 2MB page into 512 4KB pages using TDH.MEM.PAGE.DEMOTE and then migrate.

The TD typically may be assigned a different HKID (and will be always assigned a different ephemeral key) on the destination platform chosen to migrate the TD.

All migration-related metadata is stored in TDCS. Such as MIGRATABLE bit, etc.

On the source platform, a private page may be mapped to be non-writable (a.k.a. blocked for writing) to allow for the page contents to be exported.

For 1GB and 2MB pages, secure EPT mapping demotion (to a 4KB page size) is required as a pre-condition to exporting contents of a page(注意这个 page 是 PTE 所描述的 page,而不是所在的 page ) for migration.

However, there is no intrinsic guarantee of ordering across migration streams.

All dirty pages must be re-exported by the host VMM for the in-order migration phase to be completed

The start token MBMD’s TOTAL_MB field verification enforces that all exported state has been imported (in-order) on destination (TOTAL_MB: The total number of migration bundles, including the current one, which have been exported since the beginning of the migration session).

Attestation between 2 MigTDs on 2 different platforms

以前测试的时候,都是做的在同一个平台上的 local live migration。所以 src migtd, dst migtd 都是在同一个平台上起的,然后起 src TD 和 dst TD,然后给两边 TD 下达指令进行 pre-migration,然后设置两边 TD 触发 migration。

两边的 MigTD 之间应该是要做 attestation 的,这种时候应该是 local attestation 吧。

interface_functions_completion_status

https://www.intel.com/content/www/us/en/developer/tools/trust-domain-extensions/documentation.html

下载里面的 Intel TDX Module v1.5 ABI Definitions. 里面有很多 json files。

first_time_import_bitmap QEMU

每一个 stream 都有一个此结构。是一个数组,由 8 个 u64 组成,所以一次可以支持 512 个 page。

每一次 import 的时候,它都会被清空。

对于每一个要 import 的 GFN,我们找到其对应的 SPTE,如果旧的 SPTE 并不存在,说明我们是第一次 import,那么把对应的 bit 置上。

First time import 的时候会采用 in-place import 的方式,反之亦然。

请看 kvm_tdp_mmu_import_private_pages()

SERVTD_INFO_HASH

TDX 1.5 Spec: SERVTD_ATTR: Service TD Binding Attributes.

Initialize consideration for TD live migration

A VCPU can be entered only when its logical TDVPS control structure, composed of TDVPR and TDCX pages,

  • is available in memory and has been initialized by TDH.VP.INIT.
  • successfully imported by TDH.IMPORT.STATE.VP.

所以这两个 SEAMCALL 是互斥的。执行了一个就不用执行另一个了。

migration 和传统的区别在于把 td_initialized 置上的时机不一样。主要是:

  • tdx_td_vcpu_setup(): TDH.VP.CREATE, TDH.VP.ADDCX,
  • TDH.VP.INIT,
  • tdx_td_vcpu_post_init() 里面的一些内容。
// ------------------------ 这是正常的流程 ------------------------
//  如果要 post init,表示后面要 migration,那么上面的两个函数不会被执行。
// 这是 QEMU 里的 vCPU 线程,本来应该把 td_initialized 设置成 true 的 path。
qemu_thread_create(cpu->thread, thread_name, kvm_vcpu_thread_fn, cpu, QEMU_THREAD_JOINABLE);
    kvm_vcpu_thread_fn
        kvm_init_vcpu
            kvm_arch_pre_create_vcpu
                tdx_pre_create_vcpu
                    if (runstate_check(RUN_STATE_INMIGRATE))
                        flags = KVM_TDX_INIT_VM_F_POST_INIT;
                    tdx_vm_ioctl
                        // 对于每一个 vcpu,为什么都要 call 一下 INIT VM 呢,VM 不是应该
                        // 只需要 init 一次吗?难道在 init 第一个 vCPU 的时候 TD 就已经
                        // 变 initialized 了吗?
                        case KVM_TDX_INIT_VM:
                            tdx_td_init
                                __tdx_td_init(..., cmd->flags & KVM_TDX_INIT_VM_F_POST_INIT)
                                    if (!post_init)
                                        tdx_td_post_init
                                            kvm_tdx->td_initialized = true;

tdx_finalize_vm
    tdx_post_init_vcpus
        tdx_vcpu_ioctl
            kvm_vcpu_ioctl(KVM_MEMORY_ENCRYPT_OP)
                default:
                    kvm_arch_vcpu_ioctl
                    	case KVM_MEMORY_ENCRYPT_OP:
                            vt_vcpu_mem_enc_ioctl
                                tdx_vcpu_ioctl
                                    tdx_td_vcpu_init
                                        if (!kvm_tdx->td_initialized)
                                    		return 0;
                                    	tdx_td_vcpu_setup
                                            TDH.VP.CREATE
                                            TDH.VP.ADDCX
                                        TDH.VP.INIT // 如果是 migration,那么这个可以用 TDH.IMPORT.STATE.VP 替代
                                    tdx_td_vcpu_post_init
                                        if (!kvm_tdx->td_initialized)
                                    		return;

// ------------------------ 这是 migration 的流程 ------------------------
//  如果要 post init,表示后面要 migration,那么上面的两个函数会被执行;
tdx_mig_savevm_state_end
    // 这是 QEMU 里的 migration 来设置 td_initialized 为 true
    tdx_mig_save_td
        KVM_TDX_MIG_EXPORT_STATE_TD
            tdx_mig_stream_ioctl
                KVM_TDX_MIG_IMPORT_STATE_TD
                    tdx_mig_import_state_td
                        tdx_td_post_init
                            kvm_tdx->td_initialized = true;
    tdx_mig_save_vcpus
        tdx_mig_save_one_vcpu
            KVM_TDX_MIG_IMPORT_STATE_VP
                tdx_mig_import_state_vp
                    tdx_td_vcpu_setup
                        TDH.VP.CREATE
                        TDH.VP.ADDCX
                    TDH.IMPORT.STATE.VP // TDH.VP.INIT 的 migration 替代。
                    tdx_td_vcpu_post_init // 此函数在后面也出现过
                        if (!kvm_tdx->td_initialized)
                    		return;

CANCEL

一定要区分 CANCEL 和 REMIGRATE:

  • CANCEL 之后的 re-import 可以用一个新的 pfn;
  • REMIGRATE^ 只能还是用之前的 pfn,不然 TDX Module 会报错(根据 Wei 的 comments 是这样的)。

If a page has been exported before but need to be removed, promoted or demoted, cancel its migration by marking the list entry as CANCEL.(TDH.EXPORT.MEM(CANCEL))

Once a page has been exported during the current export session(注意,及时是之前 epoch exported 的也算), it can’t be blocked, removed, promoted, demoted or relocated. This prevents the destination platform from using a stale copy of that page.

为什么 CANCEL 之后一定要重新 import 呢?对于 post-copy cancel 是为了让 access 会发生 page fault,所以 re-import 是正常的,如果一个 page 就是想要被转成 shared 呢?那不一定后面还要 re-import 了。

一定是会 reimport 的,这种情况 reimport 的就是 shared page。

In order to perform such memory management operations on an exported page, the host VMM must first execute TDH.EXPORT.MEM indicating a CANCEL operation for the page. No migration buffer is required for this GPA list entry.

An SEPT page can only be removed if all its entries are FREE; specifically, it can’t be removed if any entry state is REMOVED. 也就是说,如果有 CANCEL 的 page,这个 page 所在的 SEPT page 就没法 remove 掉。这个应该怎么办呢?难道一个 CANCEL 了的 page 还需要调用 TDH.MEM.PAGE.REMOVE 来 remove 一次?当然不是。这个机制是为了保证 cancel 之后必须要 re-import,而不是再 remove。当然在 re-import 的时候,GFN 和 PFN 都是作为输入进去的,这就能够让 TDX Module 根据新的 PFN 来更新之前 Removed 的 entry 里的 pfn 值,并且将 Removed 改为 Mapped 的状态。

  • 如果我们要把一个 page 从 private 转换为 shared 呢?这种情况其实不需要 cancel,仅仅在 load shared page 的时候就能转掉并且把原来的 private page 从 TDX Module 里面移除了。
  • 如果我们要把一个 cancel 了的 private page 转成 shared,比如在 post-copy 我们要取消掉之前所有 dirty 的 private page 的 import(prepare postcopy)。这种情况是不会发生的,因为在 prepare postcopy 的时候,src 已经停止运行所以不会再产生新的 dirty page 了,那么:
    • 如果一个 private page 变 dirty 是因为变成了 shared,那么在 prepare postcopy 的时候压根就不会发送 cancel 这个 page 的请求,因为此时它已经是一个 shared page 了,让 destination load 这个 shared page 就好了;
    • 如果一个 private page 变 dirty 就是因为以前被 guest 写了,那么现在 guest 停止运行了,这个 private page 永远不会被转为 shared 了,所以此时发送 CANCEL 过去是非常安全的。我们不用面临一个 cancel 的 page 后面要变 shared 该怎么办这样的问题。

First-time import of a page during the current import session, or following a previous import cancellation, may be done in-place,所以说对于 cancel 的情况,我们的确也应该把 KVM MMU 里的 SPTE 置 0,这样下次才能触发 in-place import。这时候 pfn 就不一定还是以前的那个了。

  • 因为对于 CANCEL 的请求我们 code 不会 get_pfn(),所以不存在 page 被第二次 pin 的情况。
  • 因为 old_spte 是 present 的所以我们不会置上 first_time_import_bitmap,这个时候我们会 put_page(),从而释放掉要 cancel 的 page。Cancel 本身会把状态改成 REMOVED,也相当于在 TDX Module 里 free 了这个 page 了,这个状态和 free 的区别是我们需要在后面继续 import,把状态从 REMOVED 改回去。
  • 在后面的 new import,我们会 get_pfn(),这个 pfn 不一定和 CANCEL 之前的 pfn 一样了。

对于 CANCEL 的处理,我们不应该在 load 的时候将 page convert from private to shared,也不需要 TDH.MEM.PAGE.REMOVE 来把原来的 page 从 TDX Module 里 remove 掉。这是因为 CANCEL 这个 SEAMCALL 会把 TDX Module 自己维护的 EPT 里对应的 SPTE 转成 REMOVED 的状态(就相当于从 TDX Module 里拿掉了,这里说了:When a page import is cancelled during the in-order, the physical page is removed but its SEPT entry is put into a REMOVED state),而 TDH.MEM.PAGE.REMOVE 会置成 FREE 的状态,其实是一样的,都表示拿掉了,区别就是 REMOVED 之后需要再 import。

对于一个 cancel 了之后要 re-import 的 page,我们是会置上 first time import bitmap 的,毕竟之前的那个 page 已经被释放掉了(尽管 old pfn 还在 TDX Module 的 entry 里记着,但是此时是 REMOVED 状态,同时 re-import 会更新为 new pfn)。

TDH.IMPORT.MEM(Cancel) 之后 TDX 里 SPTE 状态会变成 Removed(但是 PFN 应该还是保存着的)。需要重新 IMPORT 才能变回去。Removed 的状态表示这个 page 已经不在 TDX Module 里了,所以我们应该 unaccount 这个 page,表示我们已经 removed 掉了而不是调用 TDH.MEM.PAGE.REMOVE 才算 remove。

Should the cache for the page be flushed?

After any private pages have been removed by a CANCEL operation, the host VMM should flush the physical pages’ cache lines and initialize their content before they are reused.

这部分实现 code 里并没有看到。

See ABI TDH.IMPORT.MEM Leaf

What will happen when guest access a REMOVED SPTE?

会发生 page fault 吗?EPT Violation

dst 的 dirty bitmap 在哪里更新?

dst 只需要更新 receive bitmap 就行了,保证 receive bitmap 的 bit 被 clear 掉,因为 TDX 的 SPTE 已经是 Removed 状态了,所以 access 会发生 page fault。

QEMU TD live migration overall stream structure overview

Post-copy is not covered.

# master thread
# 这是第一个发送的,用来表示这是一个 Migration stream
QEMU_VM_FILE_MAGIC # 4 bytes
# migrate stream version, 表示兼容与否
QEMU_VM_FILE_VERSION # 4 bytes
? QEMU_VM_CONFIGURATION # 可选,是否发送一些 configuration 过去
QEMU_VM_SECTION_CGS_START
    ############## setup 阶段(缩进表示这是对所有的 device 不只是 ram,仅仅以 ram 为例) ###############
    QEMU_VM_SECTION_START # 1 byte
    # 以下这些都是 SaveStateEntry 相关的,每一个 SaveStateEntry 可以看作一个 device 的 instance,比如 RAM
    # 这些都是在第一次发送时需要 send 过去的,下面我们以 RAM 为例
    se->section_id # 4 bytes, 接下来的 section id
    length of the idstr # 接下来的 se 的 idstr 的长度
    se->idstr
    se->instance_id # 4 bytes
    se->version_id # 4 bytes
    # 这个是 RAM specific 的发送方式
    # RAM_SAVE_FLAG_MEM_SIZE 表示这个信息表示的是总的 memory size
    total_mem_size | RAM_SAVE_FLAG_MEM_SIZE
        ########## 缩进表示对于每一个 RAMBlock,目的是把每一个 block 的元信息发送过去 ##########
        len(block->idstr) # 当然,每一个 block 也有 idstr
        block->idstr
        block->used_length # block 的大小,上面发送的 total_mem_size 就是通过每一个 block 大小计算的出的
    RAM_SAVE_FLAG_MULTIFD_FLUSH # 让对端执行 multifd_recv_sync_main()
    # 表示这个 section 结束了,这个结束是为了让 ram_load_precopy 退出,从而开始 receive footer
    # 每一个 section 结束都会发送这个
    RAM_SAVE_FLAG_EOS
    QEMU_VM_SECTION_FOOTER
    se->section_id # 再发送一次 section id
    ######################################## End of setup phase #######################################
    ####################################### Start of PART phase #######################################
    QEMU_VM_SECTION_PART
    se->section_id # 4 bytes, 接下来的 section id
    # 如果上个 section 把所有的 RAMBlock 都搜索了一遍
    # TDX live migration 在每一个 epoch 都需要执行一个 EXPORT.TRACK 的 SEAMCALL
    # 请注意不一定一个 epoch 就对应一个 section,有可能 section 在发送的过程中因为
    # rate limiting 或者超时(50ms)而退出,此时一个 epoch(把所有 RAMBlock dirty page 扫一遍)还没有结束
    # 比如当内存比较大时,一个 epoch 可能包含了上百个 section。
    ?RAM_SAVE_FLAG_CGS_EPOCH and TDX Header and MBMD
        # 缩进表示会有多次 page 的发送
        ############# TD single stream #############
        # 把这个 page 的 offset 加上 flag 发过去,表示这个 page header
        # 默认是上一个 block 里的 page,所以不用再发 block 的信息
        if this block:
            # or RAM_SAVE_FLAG_CGS_STATE_CANCEL
            offset | RAM_SAVE_FLAG_CGS_STATE | RAM_SAVE_FLAG_CONTINUE
        else: # another block
            # or RAM_SAVE_FLAG_CGS_STATE_CANCEL
            offset | RAM_SAVE_FLAG_CGS_STATE
            len(block->idstr)
            block->idstr
        TDX Header # 主要表示接下来的几个数据长度都是多少(都是等长的,目前实现是 1,也就是一次一个 page 发送)
        MBMD
        buffer list
        gpa list
        mac list
        ######### End of TD single stream ##########
        ############## TD multi stream ##############
        # multi-fd 线程在第一次启动的时候,会发送 initial packet(只有一次)
        # MULTIFD_MAGIC, MULTIFD_VERSION, p->id 都是 initial packet 的一部分。
        MULTIFD_MAGIC
        MULTIFD_VERSION
        p->id # 表示自己是第几个 stream
            ###### multi-stream 每一个 packet ######
            MultiFDPacket_t | MULTIFD_FLAG_PRIVATE # 元信息,flags 上表示这些页都是 private 的
            # 注意 multi-stream 不需要发送 TDX Header 过去
            MBMD
            GPA list
            MAC list
            buffer list # page 的内容
            ########## end of each packet ##########
        ########### End of TD multi stream ###########
    RAM_SAVE_FLAG_EOS # 表示这个 section 结束了
    QEMU_VM_SECTION_FOOTER
    se->section_id # 再发送一次 section id 以确认?
    ######################################## End of PART phase ########################################
    ####################################### Start of END phase #######################################
    QEMU_VM_SECTION_END
    RAM_SAVE_FLAG_CGS_EPOCH
    TDX Header
    MBMD
    se->section_id # 再发送一次 section id
        # 这一部分和 PART 发送的无异,故省略
        # 缩进是因为会发送多个 page
    RAM_SAVE_FLAG_EOS # 表示这个 section 结束了
    QEMU_VM_SECTION_FOOTER
    se->section_id # 再发送一次 section id
    ######################################## END of END phase ########################################
QEMU_VM_SECTION_CGS_END
# TD information
TDX Header | TDX_MIG_F_CONTINUE
MBMD
buffer list
# Each vCPU information
    # 缩进是因为每一个 vcpu 都会执行
    TDX Header | TDX_MIG_F_CONTINUE
    MBMD
    buffer list
# Epoch track
TDX Header
MBMD
QEMU_VM_EOF # 表示整个迁移的结束
# 完结,撒花

其他请参考 QEMU live migration stream structure。

Note:

  1. 传统 VM 迁移过程中,每一个 section 结束或者每一轮 RAM 扫过一次(我之所以这么说,请参考 multifd_flush_after_each_section)会 multifd_send_sync_main 一次,但是在 TD live migration 中,因为要在每一轮 RAM 扫过一次后执行 TDH.TRACK 的 SEAMCALL 来迭代到下一个 epoch,所以每一轮 RAM 扫过一次后是必执行一次 multifd_send_sync_main 的。

cgs->ready QEMU, migration flow is done KVM, td_initialized KVM, finalized KVM

按照时间顺序:

  • td_initialized 是在 tdx_mig_save_td() 的时候,也是调用 tdx_td_post_init() 的时候;
  • finalized 是在 tdx_mig_save_epoch() 的时候,指定了 start token;这里的时候 dst 调用 KVM_RUN 的 ioctl 了真正会运行了
  • cgs->ready (QEMU) 是在接收完 TD/VCPU mutable state 之后的时候才 ready;
  • "migration flow is done" 是在 QEMU 执行 tdx_mig_loadvm_state_cleanup() 的时候才发出来的,表示所有的状态都已经迁移完毕,如果是 post-copy,在这条信息出现之前 dst 就已经可以运行了。
static int tdx_mig_savevm_state_end(QEMUFile *f)
{
    TdxMigStream *stream = &tdx_mig.streams[0];

    // 这个会让 dst 变成 initialized 状态
    // 也就是 kvm_tdx->td_initialized = true
    ret = tdx_mig_save_td(f, stream);
    //...

    ret = tdx_mig_save_vcpus(f, stream);
    //...

    // 这个会让 dst 开始 run
    // kvm_tdx->finalized = true;
    ret = tdx_mig_save_epoch(f, true);
    //...
}

cgs_mig_loadvm_state() / tdx_mig_loadvm_state() / tdx_mig_savevm_state_ram() / tdx_mig_savevm_state_ram_cancel() / tdx_mig_save_ram() QEMU

此函数在发送了 page header 元信息后(address),用来发送后面跟随的 mbmd,buffer list 等实在的信息的。同时执行对应的 TDX SEAMCALL 来从 TDX Module 读出(比如 KVM_TDX_MIG_EXPORT_MEM)。

此函数在接收到了 page header 元信息后(address),用来接收后面跟随的 mbmd,buffer list 等实在的信息的。同时执行对应的 TDX SEAMCALL 来写到 TDX Module 里(比如 KVM_TDX_MIG_IMPORT_MEM)。



tdx_mig_stream_create() TDX KVM

static int tdx_mig_stream_create(struct kvm_device *dev, u32 type)
{
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(dev->kvm);
	struct tdx_mig_state *mig_state = kvm_tdx->mig_state;
	struct tdx_mig_stream *stream;
	int ret;

	stream = kzalloc(sizeof(struct tdx_mig_stream), GFP_KERNEL_ACCOUNT);
    // error checking...
	dev->private = stream;
	stream->idx = atomic_inc_return(&mig_state->streams_created) - 1;

    // 如果这是第一个 stream,那么我们需要先创建一个 session
    // 
	if (!stream->idx) {
		ret = tdx_mig_session_init(kvm_tdx);
        // 0 号 stream 为 default stream
        // 注意 default stream 和 backward stream 不同。在创建 0 号
        // stream(default stream)的时候,我们同时也创建了 backward stream
		mig_state->default_stream = stream;
	}

	ret = tdx_mig_do_stream_create(kvm_tdx, stream, &mig_state->migsc_paddrs[stream->idx]);
	return 0;
    //...
}

tdx_mig_session_init() TDX KVM

static int tdx_mig_session_init(struct kvm_tdx *kvm_tdx)
{
	struct tdx_mig_state *mig_state = kvm_tdx->mig_state;
	struct tdx_mig_gpa_list *blockw_gpa_list = &mig_state->blockw_gpa_list;
    //...
    // 调用 SEAMCALL 创建 backward stream,其实就是一个 page
	tdx_mig_do_stream_create(kvm_tdx, &mig_state->backward_stream, &mig_state->backward_migsc_paddr)
	if (tdx_is_migration_source(kvm_tdx))
		ret = tdx_mig_stream_gpa_list_setup(blockw_gpa_list);
}

tdx_mig_do_stream_create() TDX KVM

static int tdx_mig_do_stream_create(struct kvm_tdx *kvm_tdx, hpa_t *migsc_addr)
{
	struct tdx_module_args out;
	hpa_t migsc_va, migsc_pa;
	uint64_t err;

	/*
	 * This migration stream has been created, e.g. the previous migration
	 * session is aborted and the migration stream is retained during the
	 * TD guest lifecycle (required by the TDX migration architecture for
	 * later re-migration). No need to proceed to the creation in this
	 * case.
	 */
	if (*migsc_addr)
		return 0;

	migsc_va = __get_free_page(GFP_KERNEL_ACCOUNT);
	migsc_pa = __pa(migsc_va);
	tdh_mig_stream_create(kvm_tdx->tdr_pa, migsc_pa);
	*migsc_addr = migsc_pa;
}

TDH.MIG.STREAM.CREATE

需要传入:

  • migsc:migsc 是一个 4K 页,这个需要我们来传入而非其自己分配。
  • tdr:表示要在哪一个 TD 上创建 Stream。

TDP Live Migration

Live migration support for Partitioned TD guests.^

相当于把 TD partitioning 和 TD live migration 这两个 feature 结合起来了。

Per session vs per stream calls

Check KVM function: tdx_mig_stream_ioctl.

  Per session Per stream
TDH.EXPORT.MEM  
TDH.EXPORT.RESTORE  
TDH.EXPORT.TRACK ✅(has stream_id paras, but must be 0)  
TDH.EXPORT.STATE.IMMUTABLE ✅(has stream_id paras, but must be 0)  
TDH.EXPORT.STATE.TD ✅(has stream_id paras, but must be 0)  
TDH.EXPORT.STATE.VP ✅(has stream_id paras, but must be 0)  
TDH.EXPORT.UNBLOCKW  

Why strict ordering in in-order phase?

What is multi-fd, multi-channel and multi-page?

multi-page 就是一次 export 多个 page,而不是仅仅 export 一个。

应该就是 multi-stream。

multi-stream 是 TDX 里的叫法,一个 stream 代表一个从 SEAM Module 里 dump pages 的 Thread,不涉及到网络传输。

multi-fd, multi-channel 是 QEMU 里 LM 框架的叫法。

不同的 fd 之间的关系并不是平等的:Features/Migration-Multiple-fds - QEMU

Within each migration stream, proper ordering is maintained by the migration bundle counter

Why we need to maintain such ordering in each migration stream?

What is the epoch counter / TLB epoch?

Epoch counter / TLB epoch, they are the same. each counter number identify an epoch.

2 SEAMCALL will advance the epoch counter:

  • TDH.EXPORT.PAUSE
  • TDH.MEM.TRACK

Advance this counter can flush TLB, that's the reason it is called TLB epoch.

Why the TLB needs to be flushed on each migration epoch?

I don't know why now.

If the TD has not been paused by TDH.EXPORT.PAUSE, ensure TLB shootdown.

How does KVM know TDX module supports TD migration?

KVM calls the TDH.SYS.RD or TDH.SYS.RDALL interface function to enumerate Intel TDX Module functionality and learns from the TDX_FEATURES that the Intel TDX Module supports TD Migration. The host VMM learns details of TD migration capabilities and service TD capabilities from the other fields.

To be migratable, the TD may be initialized, using the TDH.MNG.INIT function, with ATTRIBUTES.MIGRATABLE bit set to 1.

What stuffs will be migrated?

TDX LM spec:

2 TD Migration Overview

2.4 Migrated Assets

Table 2.1 Migrated TD Assets*

As there is no guarantee of allocating the same addresses on the dst TD, Secure EPT is not migrated.

Does MigTD need to be on the same platform with target TD?

Yes.

Details on MigTD connect to TD?

When launching a TD in QEMU, it will specify an option migtd-pid with the MigTD's process id, then KVM will find the right td struct for the MigTD and associate it with the TD. (binding)

Firstly:

  • There is a KVM ioctl: KVM_TDX_SERVTD_BIND, which needs a parameter to specify the MigTD pid it want to bind.
  • Then it will use the SEAMCALL TDH.SERVTD.BIND or TDH.SERVTD.PREBIND. it will binds a service TD to a target TD.

Secondly:

  • The port is set later by ioctl KVM_TDX_SET_MIGRATION_INFO.

Should the MigTD be created before the TD?

Not really, Pre-binding allows a user TD to launch before a MigTD is launched.

Migration Session Key (MSK) and Migration Transport Key (MTK)

MTK is to help protect the transport of the MSK.

MTK is authenticated Diffie-Helman negotiated symmetric key generated after mutual attestation of the MigTDs

Migration protocol version

TD-scope metadata field MIG_VERSION is writable by the MigTD using TDH.SERVTD.WR. At the start of the migration session, the TDX module copies MIG_VERSION to an internal WORKING_MIG_VERSION that is used throughout the session.

What is an TD live migration operation / re-migrate / re-import / re-export / REMIGRATE?

为什么需要 re-export,因为一个 export 的 page 可能重新变成 dirty 的:

  • 在第一次 export 的时候,我们已经 blockw 了这个 page(我们会 BLOCKW 整个 slot);
  • 当 guest TD 想要访问这个 page 的时候,会触发一个 EPT violation;
  • 当触发 EPT violation 的时候,会执行 unblockw 的 SEAMCALL,此时会把这个 page mark 成 dirty 的,从而需要进行 block and re-export。

即使是一个变 dirty 需要 REMIGRATE 的 page,也不能在这个 epoch 进行迁移,而是需要等到下一个 epoch:

case GpaListEntry::Operation::REMIGRATE:
    /* Check that the page has not been imported in the current migration epoch. */

对于 TDH.EXPORT.MEM 来说,operation 为 MIGRATE 可以涵盖 export 和 re-export 两种情况。也就是输入为 MIGRATE,输出可能为 MIGRATEREMIGRATE 两种情况。这个修改是 in-place 的,可以看 TDX Module 下面的 code 验证:

// 这里我们检查 operation_ 是不是 cancel
if(gpa_list_entry.operation_ == static_cast<uint64_t>(GpaListEntry::Operation::CANCEL))
    //...
else
{
    if(septe.IsFirstTimeExportAllowed())
        // 改为 MIGRATE
        gpa_list_entry.operation_ = static_cast<uint64_t>(GpaListEntry::Operation::MIGRATE);
    else if (septe.IsAnyExportedAndDirty())
        // 改为 REMIGRATE
        gpa_list_entry.operation_ = static_cast<uint64_t>(GpaListEntry::Operation::REMIGRATE);
//...

可以看到只有当一个 page 已经被 export 过,同时又变 dirty 的时候,在 export 的时候才会输出 REMIGRATE。(IsAnyExportedAndDirty())。

REMIGRATE 一定不是 first time import,目前还不清楚如果 REMIGRATE 填进去的 PFN 的值和 TDX Module 里 track 的值不一样会是什么情况?

When remigrate a page, if TDX Module found a page was already exported but it's not dirty, it will reported an error. 也就是说 export 过就不要再 re-export 了,除非又变 dirty 了。也就是说,一个 page 只能 export 一次,不要无缘无故地去 export 一个 page。

在 Destination 端 import 一个 page 的时候,

ABI: GPA List Details: OPERATION

Migration
9. TD Private Memory Migration
9.4. TD Private Memory Export
9.4.5. Unblocking for Write, Tracking Dirty Pages and Re-Exporting

If the page EPT violation happens on has been exported, TDH.EXPORT.UNBLOCKW updates its SEPT state to EXPORTED_DIRTY, it indicates that the page is dirty and needs to be re-exported.

TDH.IMPORT.MEM operations:

  • MIGRATE
  • CANCEL
  • REMIGRATE
  • NOP

REMIGRATE operation means we want to re-import the page. Re-import is only allowed during the in-order import phase. The imported pages replace an older version of the same pages.

Migration buffer

To export a multi-page migration bundle, KVM prepares a set of migration buffer pages and a buffer for an MBMD in shared memory.

Fill each entry in tdx buffer list with hpa / tdx_mig_stream_buf_list_setup / KVM

这个函数只有在创建 stream 的时候会执行一次,因为只需要把这个 buffer setup 好就行了,至于谁来填里面每一个 page 的数据,这个由 tdx module 来做。

所以说一次性 export 和 import 的 page 数量是固定的,由也是在刚开始就 setup 的 buf_list_pages^ 来决定。

static int tdx_mig_stream_buf_list_setup(struct tdx_mig_buf_list *buf_list, uint32_t npages)
{
    //...
	for (i = 0; i < npages; i++) {
        // alloc a new page page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
        //...
		buf_list->entries[i].pfn = page_to_pfn(page);
	}

	/* Mark unused entries as invalid */
	for (i = npages; i < 512; i++)
		buf_list->entries[i].invalid = true;
    //...
}

struct tdx_mig_buf_list KVM

entries 和 hpa 是同一个 page,hpa 表示这个 page 的 hpa 方便调用的时候传入。entries 是这个 page 里面内容(比如 512 个 entries,每一个指向一个 page)

struct tdx_mig_buf_list {
	union tdx_mig_buf_list_entry *entries;
	hpa_t hpa;
};

Buffer list (import_mem_buf_list/mem_buf_list) / buf_list_pages

TDH.IMPORT.MEM^ 里有很详细的说明,可以看看。

A buffer list (defined in the TDX module ABI spec) specifies a list of 4KB pages used as buffers to export and import TD private memory data. code 里对应的是 mem_buf_list。这个 mem_buf_list 是和 Userspace 共享的区域,QEMU 在接收 page 的时候会写到这里,从而传递给 KVM。

The list itself is a 4KB page and can hold up to 512 64-bit entries with each entry pointing to a page.

The buf_list_pages (it is a number, not the start address of the page list, this should be mem_buf_list) is the only parameter for userspace to set to kvm_device currently. buf_list is a list of pages to export or import TD memory data or non-memory (e.g. vcpu) states from the TDX module. Userspace tells KVM the list size (i.e. number of pages) that it will use, for example, userspace could do a batch of 128 TD private pages to export in one request.

buf_list_pages 可以算是一个 capability,是一个固定的值,在刚开始就已经设置好了。Userspace 传进来的 npages 将基于 buf_list_pages 进行检查。初始化的值可以看 QEMU 里的 tdx_mig_stream_setup(),对于 pre-copy 的情况,设置为 1。

The max number of pages supported by the TDX architecture is 512 currently, so userspace is expected to request a number smaller than 512. As the buf_list is also used to export/import TD non-memory states, the size (i.e. number of pages in the list) should be larger than or equal to the size required to export/import the non-memory states.

import_mem_buf_listmem_buf_list 的区别是:前者是真正传给 TDH.IMPORT.MEM 这个 SEAMCALL 的。

[RFC PATCH v2 042/107] KVM: TDX_MIG: allow userspace to set parameters to kvm_device

Page list for non-memory data (struct tdx_mig_page_list) / tdx_mig_stream_page_list_setup() KVM

List of buffers to export/import the TD non-memory data (like vCPU data)。也叫 Non-Memory State Migration Buffers List,所以它也是一种 buffer list。

Page list 和 buffer list 的设计是对称的,也是需要一个 page 放 512 个 entry,每一个 entry 表示一个 PFN,这些 entry 所描述的页作为放 non-memory state 的 buffer。

因为在一个 migration stream 中,我们不可能同时处理 memory 和 non-memory 的数据,所以没必要再申请一堆 page 来作为 page list 的 buffer,直接用 buffer list 当时申请的就好啦(这两个 list 的长度也是一样的):

static int tdx_mig_stream_page_list_setup(struct tdx_mig_page_list *page_list,
			       struct tdx_mig_buf_list *buf_list,
			       uint32_t npages)
{
	struct page *page;
	uint32_t i;
	page = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 0);
    //...
	page_list->entries = page_address(page);
	page_list->info.pfn = page_to_pfn(page);

    // 直接用了 buffer list 申请的那些页,甚至和 buf_list 长度都是一样的
	for (i = 0; i < npages; i++)
		page_list->entries[i] = PFN_PHYS(buf_list->entries[i].pfn);

	page_list->info.last_entry = npages - 1;
	return 0;
}

所以只有 TDH.EXPORT.STATE.* 以及 TDH.IMPORT.STATE.* 才需要传入此参数,TDH.EXPORT.MEM 以及 TDH.IMPORT.MEM 并不需要传入此参数。

GPA list

Migration Spec: 9.3.1. GPA List

ABI 4.14.2. GPA List

GPA list 的设计和 buffer list 和 page list 也是很像的。虽然它们每一个 entry 的长度都是 64bit。共有 4K/64bit = 512 entries,但是

  • buffer/page list entry 表示的是 host buffer page 的物理地址,单纯用来放 export/import 的数据的;
  • 而 GPA list 的 entry 表示一个 GPA。因为是 Guest 里的 page,所以就不需要进行 page 的申请,更无所谓和 buffer list 以及 page list 分享 buffer pages 了。

GPA list 不会和 Page list for non-memory data 一起用(比如一起作为参数传到 SEAMCALL)里,因为它只在和 Memory 相关的场景才会用。

在 code 的实现中,其实是有两个 GPA list:

  • Per session 的结构体 tdx_mig_state 里的 struct tdx_mig_gpa_list blockw_gpa_list;
  • Per stream 的结构体 tdx_mig_stream 里的 struct tdx_mig_gpa_list gpa_list;

原因是:

  • Per stream 的 GPA List 主要是用来支持 TDH.MEM.EXPORT 等 SEAMCALL 的,这些 SEAMCALL 就是 per stream 的。这个容易理解。
  • Per session 的 GPA List 主要是用来支持 TDH.EXPORT.BLOCKW 的。
    • 为什么要另外加一个 GPA list?as TDH.EXPORT.BLOCKW could happen at the same time when the TDH_EXPORT_MEM SEAMCALL is in progress.
    • 为什么要是 per session 的而不是 per stream 的?不能每一个 stream 各自 block 然后再 export 吗。

GPA_LIST_INFO (Arch) is a 64b structure used as input and output of multiple functions. It provides the HPA of the GPA list page, and other information. such as:

  • FIRST_ENTRY: is updated to the index of the next entry to be processed, If all entries have been processed, FIRST_ENTRY is updated to (LAST_ENTRY + 1) Modulo 512.
  • LAST_ENTRY: Index of the last entry in the GPA list.

2 use cases:

  • Used as part of the private memory migration bundle^ (scattered as entries).
  • Used as an input and output of following functions: TDH.EXPORT.BLOCKW, TDH.(EX/IM)PORT.MEM, TDH.EXPORT.RESTORE. (passed as a whole).

A GPA list as a whole contains up to 512 entries, the most important fields a GPA list entry contains are:

  • GPA and Level: The address (GPA bits 51:12)
  • State: MAPPED, PENDING
  • Operation: NOP, MIGRATE, REMIGRATE, CANCEL

Each entry is 8 bytes (64bit). bit 51:12 is the GPA, bit 2 is PENDING, bit 53:52 is OPERATION, etc…

MAC List

MAC 全称是 Message Authentication Code。

MAC is to confirm that the message came from the stated sender (its authenticity) and has not been changed.

Each MAC list entry is 128bit (AES-GMAC-256), so to hold 512 entries, we need 2 4KB pages.

和 GPA 一样,不需要有一系列的 pages 作为 buffer,不像 buffer list 和 page list。

MAC List 也是主要在涉及 Memory 的 SEAMCALL 用的,比如 TDH.EXPORT.MEM,而不是 non-memory 的,这说明它主要还 buffer list/GPA list 一起用,而不是 page list。

A single GPA list entry and a separate page MAC list entry compose the page metadata (注意别和 MBMD 弄混).

我们已经有 MBMD 里的校验了,为什么还需要 MAC list?

MBMD 保证了 Integrity (完整性)。

两端的 DF 交换保证了 Confidentiality(保密性)。

MAC list 保证了 Authenticity(可认证)。

其实 MAC list 也可以保证一定的完整性,所以也不清楚为什么还需要 MBMD。

How does it be generated?

The GPA list entry format is designed so that the output of TDH.EXPORT.BLOCKW can be used directly with TDH.EXPORT.MEM, and the output of TDH.EXPORT.MEM can be used directly with TDH.IMPORT.MEM.

Live migration whens

When the in-order phase starts?

TDH.EXPORT.STATE.IMMUTABLE.

When a migration session starts?

TDH.EXPORT.STATE.IMMUTABLE.

When the destination TD can run?

TDH.IMPORT.COMMIT invoke on destination KVM.

TDH.IMPORT.COMMIT also ensures that the that the TD will never run on the source platform again, because after it an abort token will not be generated from destination side.

When the blackout is end?

blackout phase 其实有广义和狭义两部分,就像 pre-copy。

  • 广义的 blackout 指的就是从 src TD stop 到 dst TD start 中间这段无服务的时间,这个是由 TDH.IMPORT.COMMIT 来终结的;
  • 狭义的 blackout 指的就是 TDH.EXPORT.PAUSETDH.EXPORT.TRACK 中间的这段时间。

When the pre-copy stage is ended?

TDH.EXPORT.TRACK with the start token.

TD live migration session

From the migration streams and migration queues perspective, a migration session is divided into in-order and out-of-order phase.

What does order mean? it means a newer export of the same memory page must be imported after an older export of the same page.

The start tokens, generated by TDH.EXPORT.TRACK and verified by TDH.IMPORT.TRACK, serve as markers to indicate the end of the in-order phase and start of the out-of-order phase. They are used to enforce all the in-order state (across all streams) to have been imported before the out-of-order phase starts.

(I suppose) Usually there is 1 session to migrate a TD, but may use multiple streams.

In-order memory export phase (就是 pre-copy phase)

This phase include part of source TD running phase and part of blackout phase.

The source TD may run, and its memory and non-memory state may change.

The most up-to-date version must be migrated before the in-order phase ends.

Why? In the in-order phase, one or more migration streams are mapped to each migration queue.

Divided to multiple migration epochs. A specific page can only be migrated once per migration epoch. TDH.EXPORT.TRACK starts a new export epoch and creates an epoch token migration bundle that is transmitted by the host VMM to the destination platform, where TDH.IMPORT.TRACK is invoked to a new import epoch. The last invocations of TDX.EXPORT.TRACK and TDX.IMPORT.TRACK, with a parameter indicating that the in-order phase is done, start the out-of-order phase.

Out-of-order memory export phase (就是 post-copy phase)

Started by TDH.EXPORT.TRACK.

The source TD does NOT run, so its memory and non-memory state may NOT change.

Why? Furthermore, the KVM may assign exported pages (even multiple copies of the same exported page) to different priority queue. This is used, e.g., for on-demand migration after the destination TD starts running.

Migration stream

Migration stream is a TDX concept. Multiple streams allow multi-threaded, concurrent export and import.

Within each stream, state is migrated in-order. This is enforced by the MB_COUNTER field of MBMD.

Each forward stream has a corresponding backward stream.

The host VMM should use the same stream index to import memory on the destination TD, This is enforced by TDH.IMPORT.MEM.

Non-memory state can only be migrated once; there is no override of older migrated non-memory state with a newer one.

TdxMigStream (QEMU)

Migration stream in kernel is implemented as a KVM device kvm_tdx_mig_stream_ops.

// QEMU
cgs_mig->loadvm_state_setup
    tdx_mig_stream_setup
        tdx_mig_do_stream_setup
            tdx_mig_stream_create
                kvm_create_device
                    // KVM
                    kvm_ioctl_create_device
                        tdx_mig_stream_create
                            tdx_mig_do_stream_create
                                TDH.MIG.STREAM.CREATE
typedef struct TdxMigStream {
    int fd; // The KVM device fd to ioctl on
    void *mbmd;
    // 从 QEMUFile 里读完 Hdr 以及 MBMD 以后,剩下的 data 数据会读到这个里面。
    // 请把这个和 buffer list 区别开,这个是真的 buffer,包含着所有 4k page 的内容
    // 而 buffer list 只是一个 4k page,包含 512 个地址,每一个地址指向
    // 一个 4k 的 page。
    // 这个对应 KVM 里的 stream->mem_buf_list
    // 请看 QEMU 里的函数 tdx_mig_do_stream_setup 和 KVM 里的函数 tdx_mig_stream_fault
    void *buf_list;
    void *mac_list;
    void *gpa_list;
}

Migration queue

KVM concept

Multiple queues allow QoS and prioritization

Migration bundle

The generic migration bundle structure. Private memory migration uses an enhanced format.

2 parts:

  • migration data: may span one or more 4KB pages or one 2MB page.
  • migration bundle metadata (MBMD).

TDH.EXPORT.* will generate migration bundle, not matter it is:

  • TDX.EXPORT.MEM to export private pages;
  • TDH.EXPORT.STATE.IMMUTABLE to export immutable state;
  • TDH.EXPORT.STATE.TD to export TD scope mutable state;
  • TDXPORT.STATE.VP to export VCPU scope mutable state;

Export and import functions operate on a single migration bundle at a time, which belongs to a specific migration stream.

Migration data

It is

  • Confidentiality-protected using AES-GCM with the TD migration key and a running migration session counter
  • Integrity-protected by its associated MBMD.

Migration bundle metadata (MBMD)

*ABI: 3.12.1.1 Generic MBMD Structure MBMD: Migration Bundle Metadata TD Migration: Migration Types Data Types*

An MBMD is not confidentiality protected, but it provides integrity protection for the entire bundle. 我们在 KVM 里是可以看到 MBMD 各个 field 的值的,比如通过 mb_type 可以看到这是内存数据还是 vCPU 的数据。一个 MBMD 通常是一个 4KB 页的大小。

很明显,每一个 Migration bundle 的 mbmd 数据是不一样的,但是我们可以用同一个 buffer 来 hold 这部分数据。

A migration bundle always contains a single MBMD, optional migration data can be stored in [0, multiple] 4KB migration buffer pages.

How to export and import a migration bundle?

Export: KVM provides the MBMD’s HPA and a list of HPA pointers to the migration pages as an input to the TDH.EXPORT* function.

Import: the same.

Private memory migration bundle

Composed of multiple MAC-protected components:

  • MAC-protected MBMD
  • For each 4KB page, encrypted 4KB migration buffer
  • For each 4KB page, MAC-protected page metadata:
    • GPA list entry^
    • page MAC list entry

Migration Stream Context (MigSC)

An opaque control structure that holds migration stream context.

MigSC occupies a single 4KB physical page, and is created using the TDH.MIG.STREAM.CREATE function.

感觉 MigSC 相比于一个 migration stream 有点像 TDR 相对于一个 TD。

Migration tokens

A migration token is formatted as a migration bundle, with only an MBMD.

Abort token

Only generated by destination side TDH.IMPORT.ABORT, source side TDH.EXPORT.ABORT won't generate this.

The abort token is generated by TDH.IMPORT.ABORT on the destination platform if import fails for any reason. It helps ensure that the TD will not run on the destination platform, and therefore may be restored on the source platform.

Epoch token / Start token

An epoch token serves as an epoch separator. It provided the total number of migration bundles exported so far. This helps TDH.IMPORT.TRACK, which imports the epoch token, checks that all migration bundles of the previous epoch have been received. No migration bundle of an older epoch may be imported.

A start token (epoch number 0xFFFFFFFF) is a special version of an epoch token which starts the out-of-order phase. It ensures that no newer version of any page exported prior to the start token exists on the source platform.

Which parts are delayed in destination TD side?

TDH.MNG.INIT 这个 SEAMCALL 被 delay 了,因为所需要的信息还没有完全迁移过来。

TDH.VP.CREATE 这个 SEAMCALL 被 delay 了,因为 dst TD 还没有初始化。(但是我们可以先提前把页分配好)

把 shared EPT Pointer load 到 VMCS 里这个动作也被 delay 了,因为 “loading of the root page to shared EPT pointer in VMCS

needs to be done after the TD has been fully initialized.”

struct tdx_mig_state KVM

A per-TD scope struct, to manage its migration related info, e.g. the number of migration streams that have been created in the TDX module.

struct tdx_mig_state {
	atomic_t streams_added;
	/*
	 * Array to store physical addresses of the migration stream context
	 * pages that have been added to the TDX module. The pages can be
	 * reclaimed from TDX when TD is torn down.
	 */
	hpa_t *migsc_paddrs;
	struct tdx_mig_gpa_list blockw_gpa_list;
	struct tdx_mig_stream *default_stream;
	/* Backward stream used on migration abort during post-copy */
	struct tdx_mig_stream backward_stream;
	hpa_t backward_migsc_addr;
	bool bugged;
	/* Index of the next vCPU to export the state */
    // 主要是 export_vp 的时候会作为一个状态用到。
	uint32_t vcpu_export_next_idx;
};

Life cycle of live migration

*TDX LM Spec:

  • 4.TD Migration Software Flows
  • 4.1.Typical TD Migration Flow Overview*

TDX LM Spec:

  • Figure 7.1: Migration Session Control Overview (Success Case)

Migrate global immutable non-memory state (属于广义的 pre-copy phase)

Related SEAMCALL: TDH.EXPORT.STATE.IMMUTABLE, TDH.IMPORT.STATE.IMMUTABLE.

Immutable metadata is the set of TD state variables that are set by TDH.MNG.INIT, may be modified during TD build but are never modified after the TD’s measurement is finalized using TDH.MR.FINALIZE.

Some of these state variables control how the TD and its memory is migrated. Therefore, the immutable TD control state is migrated before any of the TD memory state is migrated.

Prior to invoking TDH.EXPORT.STATE.IMMUTABLE, KVM should create enough migration streams contexts using TDH.MIG.STREAM.CREATE. That means the session is created after steams are created by TDH.MIG.STREAM.CREATE.

TD private page pre-copy (属于广义的 pre-copy phase)

# This is for each stream
for e in epochs:
    while (some condition):
        TDH.EXPORT.BLOCKW # multiple times, blocks a set of pages for writing
    TDH.MEM.TRACK # once, increments the TD's epoch
    # starts epoch and creates epoch token migration bundle;
    # **a page can be exported once per epoch**. It can also be
    # placed at the start (before TDH.EXPORT.BLOCKW)
    TDH.EXPORT.TRACK(epoch token) 
    while (some condition):
        TDH.EXPORT.MEM # exports, re-exports or cancels the export of TD private pages and creates a memory migration bundle.
    while (some condition):
        TDH.EXPORT.UNBLOCKW # for write, if the page already exported, need to be re-blocked and re-exported.

Migrate final (mutable) non-memory state (blackout period)

Migration Spec: 2.5.4.2 TD-Scope and VCPU-Scope Mutable Non-Memory State migration

Started by: TDH.EXPORT.PAUSE

Ended by: TDH.EXPORT.TRACK(start token)

Related:

  • TDH.EXPORT.STATE.TD
  • TDXPORT.STATE.VP

It is part of the pre-copy period.

Mutable non-memory state is a set of source TD state variables that might have changed since it was finalized via TDH.MR.FINALIZE. Exists for the

  • TD scope (as part of the TDR and TDCS control structures) and the'
  • VCPU scope (as part of the TDVPS control structure).

Because the source shouldn't run otherwise these state will change and the destination cannot run without these states, this is the cause for blackout. Although source shouldn't run, TDH.EXPORT.MEM may be used during this blackout time.

KVM must pause the source TD for a brief period so that KVM may export the final control state (for all VCPUs and for the TD overall). Initiates via TDH.EXPORT.PAUSE, which prevents TD VCPUs from executing any more. It then allows export of final (mutable) TD non-memory state (TDH.EXPORT.STATE.TD and TDXPORT.STATE.VP).

On the source platform, TDH.EXPORT.PAUSE starts the blackout phase and TDH.EXPORT.TRACK ends it. it also marks:

  • The end of pre-copy
  • The end of mutable TD VP
  • The end of mutable TD global control state
  • The start of out-of-order phase.

TDH.EXPORT.TRACK generates a start token to allow the destination TD to become runnable. On the destination platform, TDH.IMPORT.TRACK – which consumes it, allows the destination TD to be un-paused.

Life cycle of migtd

The MigTD lifecycle does not have to be coincidental with the target TD – the MigTD may be instantiated when required for Live Migration, but

  • it must be bound to the target TD before Live Migration can begin
  • and must be operational until the MSK has been successfully programmed for the target TD being migrated.

Questions

5.1. Example Migration Session Establishment: MigTD-s requests the host VMM to be bound to TD-s, using TDG.VP.VMCALL.

What I see is TD require to bind to a migtd, not a migtd bind to TD.

Epochs (注意读音)

Epochs in TDCS

MIG_EPOCH

Migration epoch Starts from 0 on migration session start, incremented by 1 on each epoch token. A value of 0xFFFFFFFF indicates out-of-order phase.

所以 TDH.EXPORT.TRACK 其实增加的是这个。

TD_EPOCH

Incremented by the host VMM using the TDH.MEM.TRACK function.

BW_EPOCH

Migration related,和 TD_EPOCH 也是有关系的。

Blocking-for-write epoch.

Holding the value of TD_EPOCH at last time TDH.EXPORT.BLOCKW blocked a page for writing.

Epochs in TDVPS

VCPU_EPOCH

和 TD_EPOCH 是有关系的。比如 TDH.VP.ENTER 会把 TDCS.TD_EPOCH copy 到 TDVPS.VCPU_EPOCH

Abort

*Migration: 7.1.2. Aborted Migration Session Overview Migration Session Control and State Machines*
  In-Order Out-Of-Order
Abort on SRC Yes  
Abort on DST   Yes

Whether aborting in in-order or out-of-order, src or dst, the source should resume running and the destination shouldn't be running.

ram_save_abort() / cgs_mig_savevm_state_ram_abort() / tdx_mig_savevm_state_ram_abort() QEMU

这个函数也是 TDX 1.5 patch 引入的。

void ram_save_abort(void)
{
    uint64_t cgs_epochs = stat64_get(&mig_stats.cgs_epochs);
    int ret;

    if (!cgs_epochs) {
        return;
    }

    ret = cgs_mig_savevm_state_ram_abort();
    // error handling...
}

int cgs_mig_savevm_state_ram_abort(void)
{
    //...
    //tdx_mig_savevm_state_ram_abort
    ret = cgs_mig.savevm_state_ram_abort();
    //...
}

static int tdx_mig_savevm_state_ram_abort(void)
{
    TdxMigStream *stream = &tdx_mig.streams[0];
    tdx_mig_stream_ioctl(stream, KVM_TDX_MIG_EXPORT_ABORT, 0, 0);
    //...
}

KVM_TDX_MIG_EXPORT_ABORT / tdx_mig_export_abort() KVM

从 SEAMCALL 的角度:

  • TDH.EXPORT.ABORT
  • TDH.EXPORT.RESTORE
  • TDH.EXPORT.UNBLOCKW
static int tdx_mig_export_abort(struct kvm_tdx *kvm_tdx, struct tdx_mig_stream *stream, uint64_t __user *data)
{
    //...
    // 调用这个 SEAMCALL
	tdh_export_abort(kvm_tdx->tdr_pa, 0, 0);
    //...
	return kvm_tdp_mmu_restore_private_pages(&kvm_tdx->kvm);
}

TDH.EXPORT.ABORT

Abort an export session.

TDH.EXPORT.ABORT aborts an export session and allows the source TD to resume normal operation, depending on export state and an abort token received from the destination platform.

tdp_mmu_restore_private_pages() / tdp_mmu_restore_private_page() KVM

static int tdp_mmu_restore_private_page(struct kvm *kvm, gfn_t gfn,
					u64 *sptep, u64 old_spte,
					u64 new_spte, int level)
{
	int ret;

	kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
	ret = static_call(kvm_x86_restore_private_page)(kvm, gfn);
	if (ret == -EPERM && is_shadow_present_pte(old_spte) && !is_writable_pte(old_spte)) {
		kvm_write_unblock_private_page(kvm, gfn, level);
		ret = 0;
	}

	return 0;
}

TDH.EXPORT.RESTORE

Restores a list of TD private 4KB pages’ Secure EPT entry states.(可以一次 restore 许多个 page)

Reverts each Secure EPT entry to their original non-exported state.

一个 blockw 并且 exported 的 page 在 restore 后会变为 MAPPED,不需要再 unblockw 了:

  • Check that the SEPT entry state is one of the EXPORTED_* or PENDING_EXPORTED_* states.(当然 EXPORTED_BLOCKW 也包含在其中)
  • If the SEPT state is one of the PENDING_* states, update it to PENDING. Else, update it to MAPPED.

可能会失败的原因:

tdx_restore_private_page() / x86_ops->restore_private_page KVM

static int tdx_restore_private_page(struct kvm *kvm, gfn_t gfn)
{
	struct tdx_module_args out;
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
	struct tdx_mig_stream *stream = kvm_tdx->mig_state->default_stream;
	struct tdx_mig_gpa_list *gpa_list = &stream->gpa_list;

	tdx_mig_gpa_list_init(gpa_list, &gfn, 1);
    //...
	tdh_export_restore(kvm_tdx->tdr_pa, gpa_list->info.val, &out);
    //...
    // 这个 page 没有被 export 过,所以不需要被 restore,restore 自然会失败
	if (gpa_list->entries[0].status != GPA_LIST_S_SUCCESS)
		return -EPERM;
	return 0;
}

kvm_write_unblock_private_page() / tdx_write_unblock_private_page() KVM

void kvm_write_unblock_private_page(struct kvm *kvm, gfn_t gfn, int level)
{
    //...
	kvm_x86_ops.write_unblock_private_page(kvm, gfn, level);
}

static void tdx_write_unblock_private_page(struct kvm *kvm, gfn_t gfn, int level)
{
	struct tdx_module_args out;
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
	union tdx_mig_ept_info ept_info = {
		// TDX treats level 0 as the leaf level,
		// while Linux treats level 1 (PG_LEVEL_4K) as the level.
		.level = pg_level_to_tdx_sept_level(level),
		.gfn = gfn,
        //...
	};
    // 保证 TLB 里没有任何 cached translation
    // 因为我们要改真正的 EPT entry 信息了。
	tdx_track(kvm);

	err = tdh_export_unblockw(kvm_tdx->tdr_pa, ept_info.val, &out);
    //...
}

Abort during the In-Order Phase

Blackout is part of the in-order phase, so aborting in blackout phase is the same as aborting in the immutable and mem export phase.

Abort by source

TDH.EXPORT.ABORT terminates the export session and enables the TD to resume running on the source platform.

By design, the TD should not be able to run on the destination platform.

Abort by destination

Abort during the Out-Of-Order Phase

Abort by source

Abort by destination

TDH.IMPORT.ABORT is invoked on the destination platform and creates an abort token, which is transmitted back to the source, enables the source TD to resume running.

Will the modified memory on DST also transmitted back to SRC?

Patch learning

Overview

QEMU cmdline 里指定了 migtd 的 pid 是什么,所以一开始执行就会调用 bind 来绑定。如果这个时候 MigTD 已经启动,它并不会一直 TDCALL WaitForRequest 来轮询,因为在第一次调用的时候,KVM 发现当前 vsockport 信息还没有被设置,就会 emulate 一个 halt 指令从而 block 住它。

接下来设置 vsockport,设置了之后 KVM 会 resume MigTD vcpu 的执行,这样它下次 WaitForRequest 就能看到我们设置的 vsockport 的信息了。当完成之后会 ReportStatus。

mig_stream 这个东西在 KVM 和 QEMU 里都定义了。但是 structure 是不一样的。QEMU 和 KVM 里 stream 之间的对应其实是通过 kvm_device 的 fd 实现的,也就是一个 QEMU stream 以 fd 的形式持有一个 kvm_device,一个 kvm_device 的 private 字段对应一个 KVM 里的 stream。

因为 mbmd 并不加密,所以 QEMU 本身也可以获取 mbmd 的信息。export 的时候把 mbmd 传给 module,其中 migsc_index 表示用哪一个 stream,目前必须是 0。目前 TDX module 返回的 caps 里最多支持 512 个 stream。

对于 dst TD,不需要执行 TDH.MNG.INIT 来初始化 TD,因为 TDH.IMPORT.STATE.IMMUTABLE 会来负责初始化。dst TD 的 post init 在 KVM 里被安排在了 TDH.IMPORT.STATE.TD 这个 SEAMCALL 后面。

问题

vsockport 是怎么被 migtd 知道的?

潜在的还没找到 patch 的地方

KVM: TDX: retry seamcall on recoverable errors

让一些 recoverable 的 seamcall 可以 retry,默认是 1000 次。有三个特例,如果 status 是:

  • TDX_VCPU_ASSOCIATED
  • TDX_VCPU_NOT_ASSOCIATED
  • TDX_INTERRUPTED_RESUMABLE 那么就直接 return,不 retry。

KVM: TDX: remove TDX_ERROR_SEPT_BUSY

sept related seamcalls 需要在 TDX_OPERAND_BUSY 时 retry,但是因为我们设计了一个新的支持 retry 的 wrapper 函数,所以可以删了原来 sept 特殊设计的能够 retry 的函数,使用我们的就行。

!!! RFC PATCH v2 032/107] KVM: TDX: allow userspace to finish td initialization at a later stage

examaple

improted

Code Architecture

MigStream is defined as a KVM device^.

The migration flow is driven by QEMU. QEMU can issue SEAMCALL ioctl to KVM to perform the real SEAMCALL.

TdxMigState has a list of TdxMigStream

CGS_PRIVATE_GPA_INVALID QEMU

主要是用来判断这个 dirty page 是一个 private page 还是 shared page。如果是 private page,那么 pss->cgs_private_gpa 就是这个 page 的 GPA;如果是 shared page,那么 pss->cgs_private_gpa 就是 CGS_PRIVATE_GPA_INVALID,要发送的 GPA 存在 pss->page 中。这样也可以通过判断不同的值来走不同的路径:shared page 走 legacy 的迁移路径,而 private 的走 TDX 的迁移路径(TDH.MEM.EXPORT)。

ram_load_update_cgs_bmap() QEMU

cgs_bmap 记录了每一个 page 是 private 还是 shared,那么这个函数就是根据 cgs_bmap 的信息,将指定地址的 page 从 shared 的转化为 private 或者相反。

int ram_load_update_cgs_bmap(RAMBlock *block, ram_addr_t offset,
                             bool is_private)
{
    unsigned long bit = offset >> TARGET_PAGE_BITS;
    bool was_private;
    hwaddr gpa;
    //...
    was_private = test_bit(bit, block->cgs_bmap);
    if (was_private == is_private) {
        return 0;
    }

    /* Unaliased GPA is the same for both private pages and shared pages */
    ret = kvm_physical_memory_addr_from_host(kvm_state, block->host + offset, &gpa);
    ret = kvm_convert_memory(gpa, TARGET_PAGE_SIZE, is_private, INT_MAX);
}

调用路径:

// 我们以最常见的 precopy 为例 (其实 multifd 和 postcopy 也都用到了)
ram_load_precopy
    ram_load_update_cgs_bmap

ram_get_private_gpa() QEMU

根据传入的 rb 已经这个 rb 中的第几个 page,返回 GPA。

RAMBlock 这个数据结构里并没有关于 Guest 的信息,我们需要从其他地方来计算 GPA,但是为什么我们要通过 KVMSlot 而不是 MemoryRegion 这个数据结构来计算?因为 MemoryRegion 中也没有直接保存 GPA 的相关信息,而是在父 MR 中的偏移量,所以计算起来也很不方便。

static hwaddr ram_get_private_gpa(RAMBlock *rb, unsigned long page)
{
    int ret;
    // page 是这个 rb 里面的第几个 page,而不是 gfn,需要转换成为 gpa
    ram_addr_t offset = ((ram_addr_t)page) << TARGET_PAGE_BITS;
    hwaddr gpa;

    // In some conditions: return CGS_PRIVATE_GPA_INVALID
    // ...
    // 通过遍历 KVMSlot 拿到 HVA 所对应的 GPA,HVA 是通过 host + offset 计算出来的
    kvm_physical_memory_addr_from_host(kvm_state, rb->host + offset, &gpa);
    return gpa;
}

TdxMigHdr (QEMU)

这个结构体不是 TDX Spec 定义的,而是自己定义的一个辅助用的数据结构,在 MBMD 的前面。

在 multi-stream 发送 RAM 信息时不会用到。因为长度信息已经在 MultiFDPacket_t 里了。

在发送不规则信息,比如 epoch, TD, VCPU 信息以及 pre-copy 发送 RAM 的时候会用到,发送 RAM 会用到是因为本来也支持一次发多个 pages 的,只不过目前写死了是 1。

typedef struct TdxMigHdr {
    uint16_t flags;
    uint16_t buf_list_num; // MB 的 data 有多少个页?需要乘以页大小(4KB)来计算出字节的数量
} TdxMigHdr;

struct tdvmcall_service_migtd KVM

既可以作为 migtd 发起 TDVMCALL 的参数,也可以作为对于 TDVMCALL 的 response。也就是说这个结构体是双向的。比如在函数 tdx_handle_service_migtd 里:

tdx_handle_service_migtd
    //...
	struct tdvmcall_service_migtd *cmd_migtd = (struct tdvmcall_service_migtd *)cmd_hdr->data;
	struct tdvmcall_service_migtd *resp_migtd = (struct tdvmcall_service_migtd *)resp_hdr->data;
    //...

可以看到它既作为了 cmd,又作为了 response。

struct tdvmcall_service_migtd {
	uint8_t version;
	uint8_t cmd;
	uint8_t operation;
	uint8_t status;
	uint8_t data[0];
};

tdx_mig_gpa_list_setup() QEMU

static void tdx_mig_gpa_list_setup(union GpaListEntry *gpa_list, hwaddr *gpa,
                                   uint64_t gpa_num, int operation)
{
    int i;

    for (i = 0; i < gpa_num; i++) {
        gpa_list[i].val = 0;
        gpa_list[i].gfn = gpa[i] >> TARGET_PAGE_BITS;
        gpa_list[i].mig_type = GPA_LIST_ENTRY_MIG_TYPE_4KB;
        gpa_list[i].operation = operation;
    }
}

struct tdx_mig_stream

struct tdx_mig_stream {
	uint16_t idx;
	struct tdx_mig_mbmd mbmd;
    // 表示 mem_buf_list 的长度,也就是说只有前 buf_list_pages
    // 个 mem_buf_list 的 entry 被分配了实际的 page。
	uint32_t buf_list_pages;
    // The list itself is a 4KB page, so it can hold up to 512 entries.
    // memory data, non-in-place import 用的,所以 source 和 destination 都有
	struct tdx_mig_buf_list mem_buf_list;
    // memory data, in-place import 用的,所以只有 destination 有
	struct tdx_mig_buf_list td_buf_list;
	// List of buffers grabbed either from the private_fd allocated pages
	// for in place import or from mem_buf_list for non-in-place import.
    // 言简意赅,要么是 td_buf_list,要么是 mem_buf_list
	struct tdx_mig_buf_list import_mem_buf_list;
	// import/export TD non-memory state data
	struct tdx_mig_page_list page_list;
	/* List of GPA entries used when export/import the TD private memory */
	struct tdx_mig_gpa_list gpa_list;
	/* List of MACs used when export/import the TD private memory */
	struct tdx_mig_mac_list mac_list[2];
	/*
	 * Bitmap to get if a gpa in the gpa_list to import needs first-time
	 * import, i.e. the sept entry has not been setup on the TDX side.
	 * 512 bits which supports 512 pages in a batch.
	 */
	uint64_t first_time_import_bitmap[8];
	/* GFNs of the pages to import */
	gfn_t gfns[TDX_MIG_GPA_LIST_MAX_ENTRIES];
	uint64_t sptes[TDX_MIG_GPA_LIST_MAX_ENTRIES];
};

CgsMig

这个结构体定义了一些 function 的 hooks。目前这个结构体只有一个 instance,全局的 cgs_mig

tdx_mig_init()

这个函数主要是用来初始化 cgs_mig,给它赋予一些函数。

qemu_init
migration_object_init
tdx_mig_init

TDH.EXPORT.MEM

// QEMU
ram_find_and_save_block
    ram_save_host_page
        ram_save_target_page
            ram_save_cgs_private_page
                cgs_mig_savevm_state_ram
                    cgs_mig.savevm_state_ram
                        tdx_mig_save_ram
                            tdx_mig_stream_ioctl(KVM_TDX_MIG_EXPORT_MEM)

// KVM
tdx_mig_stream_export_mem
    tdh_export_mem

在执行此 SEAMCALL 之前执行 TDH.EXPORT.BLOCKW 并不是必须的:

  • If the TD is running, the exported pages MUST be blocked and TLB tracked.
  • Else (e.g., the TD has been paused for export), no blocking and tracking is required.

但是大多数时候(running)是必须的,原因可能是防止 export 的过程中 page 被更新了?

但是,大多数时候肯定还是在 running 状态的。那么问题来了,怎么保证在 EXPORT 的时候,已经 BLOCKW 了?请看 BLOCKW^。

如果在当前 epoch 已经 export 过的 page,变成了 dirty page 需要 re-export (remigrate),这个是允许的吗?

简单答案,无论是 re-import 还是 cancel,都是不允许的。

Migration
7. Migration Session Control and State Machines
7.1. Overview
7.1.3. Migration Epochs

A page can only be migrated once per migration epoch.

Host VMM starts migration epoch with an epoch token migration bundle received from the source platform; a page can be imported once per epoch.

Migration
9. TD Private Memory Migration
9.5. TD Private Memory Import
9.5.1. TD Private Memory Import In-Order Import Phase
9.5.1.3. Enforcing a Single Import Operation per Migration Epoch

When a page is imported during the in-order, the current migration epoch is recorded in the page’s PAMT.BEPOCH field. Page re-import and import cancel operations compare the recorded migration epoch. For the import to succeed, it should be older than the current migration epoch.(也就是说 re-import 和 cancel 的时候的 epoch 要比第一次 import 时候的 epoch 要新,所以我们可以得出结论,一个 epoch 内不可以 export 过后再 cancel 或者 re-export)。

以下 code 也证明了,cancel 和 export 不能放在同一个 epoch 之内:

ram_prepare_postcopy
    cgs_ram_save_start_epoch(f);
        ram_save_target_page_private

Export 后再 cancel 然后再 export,output operation 会是 MIGRATE 还是 REMIGRATE?

tdx_mig_save_ram() QEMU / TDH.EXPORT.MEM 是如何把需要的结构导出到 Userspace 的?

static long tdx_mig_save_ram(QEMUFile *f, TdxMigStream *stream)
{
    uint64_t num = 1;
    uint64_t hdr_bytes, mbmd_bytes, gpa_list_bytes,
             buf_list_bytes, mac_list_bytes;

    /* Export mbmd, buf list, mac list and gpa list */
    // 可以看到,这个 ioctl 直接把数据导入到了 stream 里,怎么做到的?
    ret = tdx_mig_stream_ioctl(stream, KVM_TDX_MIG_EXPORT_MEM, 0, &num);

    mbmd_bytes = tdx_mig_stream_get_mbmd_bytes(stream);
    buf_list_bytes = TARGET_PAGE_SIZE;
    mac_list_bytes = sizeof(Int128);
    gpa_list_bytes = sizeof(GpaListEntry);

    hdr_bytes = tdx_mig_put_mig_hdr(f, 1, 0);
    qemu_put_buffer(f, (uint8_t *)stream->mbmd, mbmd_bytes);
    qemu_put_buffer(f, (uint8_t *)stream->buf_list, buf_list_bytes);
    qemu_put_buffer(f, (uint8_t *)stream->gpa_list, gpa_list_bytes);
    qemu_put_buffer(f, (uint8_t *)stream->mac_list, mac_list_bytes);

    return hdr_bytes + mbmd_bytes + gpa_list_bytes +
           buf_list_bytes + mac_list_bytes;
}

要 export 的 GPA 是怎么从 userspace 传到 KVM 的?那一行看起来只传了数量?

在之前跑了 tdx_mig_gpa_list_setup()^ 将 operation 以及 gfn 之类的 setup 到了 stream->gpa_list 里面。

tdx_mig_stream_export_mem() KVM

static int64_t tdx_mig_stream_export_mem(struct kvm_tdx *kvm_tdx,
					 struct tdx_mig_stream *stream,
					 uint64_t __user *data)
{
	struct tdx_mig_state *mig_state = kvm_tdx->mig_state;
	/* Userspace is expected to fill the gpa_list.buf[i] fields */
	struct tdx_mig_gpa_list *gpa_list = &stream->gpa_list;
	struct tdx_mig_buf_list *mem_buf_list = &stream->mem_buf_list;
	union tdx_mig_stream_info stream_info = {.val = 0};
	struct tdx_module_output out;
	uint64_t npages, err;
	int idx;

	if (mig_state->bugged)
		return -EBADF;

	if (copy_from_user(&npages, (void __user *)data, sizeof(uint64_t)))
		return -EFAULT;

	if (npages > stream->buf_list_pages)
		return -EINVAL;

	/*
	 * The gpa list page is shared to userspace to fill GPAs directly.
	 * Only need to update the gpa_list info fields here.
	 */
	gpa_list->info.first_entry = 0;
	gpa_list->info.last_entry = npages - 1;
	tdx_mig_buf_list_set_valid(&stream->mem_buf_list, npages);

	stream_info.index = stream->idx;
	do {
		err = tdh_export_mem(kvm_tdx->tdr_pa,
				     stream->mbmd.addr_and_size,
				     gpa_list->info.val,
				     mem_buf_list->hpa,
				     stream->mac_list[0].hpa,
				     stream->mac_list[1].hpa,
				     stream_info.val,
				     &out);
		if (seamcall_masked_status(err) == TDX_INTERRUPTED_RESUMABLE) {
			stream_info.resume = 1;
			/* Update the gpa_list_info (mainly first_entry) */
			gpa_list->info.val = out.rcx;
		}
	} while (seamcall_masked_status(err) == TDX_INTERRUPTED_RESUMABLE);

	/*
	 * It is possible that TDX module returns a general success,
	 * with some pages failed to be exported. For example, a page
	 * was write enabled before the TDH_EXPORT_MEM seamcall. The failed
	 * page hsa been marked dirty in the dirty page log and will be
	 * re-exported later in the next round. So no special handling here
	 * and just ignore such error.
	 *
	 * The number of failed pages is put in the operand id field, so
	 * we mask that part to indicate a general success of the call.
	 */
	if (seamcall_masked_status(err) == TDX_SUCCESS) {
		if (err != TDX_SUCCESS) {
			idx = srcu_read_lock(&kvm_tdx->kvm.srcu);
			tdx_mig_handle_export_mem_error(&kvm_tdx->kvm,
							gpa_list, npages);
			srcu_read_unlock(&kvm_tdx->kvm.srcu, idx);
		}

		/*
		 * 1 for GPA list and 1 for MAC list
		 * TODO: Improve by checking GPA list entries
		 */
		out.rdx = out.rdx - 2;
		if (copy_to_user(data, &out.rdx, sizeof(uint64_t)))
			return -EFAULT;
	} else {
		pr_err("%s: err=%llx, gfn=%llx\n",
			__func__, err, (uint64_t)gpa_list->entries[0].gfn);
		return -EIO;
	}

	return 0;
}

tdx_mig_stream_import_mem() KVM

case KVM_TDX_MIG_IMPORT_MEM:
    tdx_mig_stream_import_mem
static int tdx_mig_stream_import_mem(struct kvm_tdx *kvm_tdx,
				     struct tdx_mig_stream *stream,
				     uint64_t __user *data)
{
	int idx, ret;
	uint64_t i, npages;
	gfn_t gfn;
	kvm_pfn_t pfn;
	struct kvm *kvm = &kvm_tdx->kvm;
	struct kvm_vcpu *vcpu = kvm_get_vcpu(kvm, 0);
	union tdx_mig_gpa_list_entry *gpa_list_entries =
						stream->gpa_list.entries;

    // npages = data
	if (copy_from_user(&npages, (void __user *)data, sizeof(uint64_t)))
		return -EFAULT;

    // stream->gfns: just buffer, doesn't contain any data
    // stream->sptes: just buffer, doesn't contain any data
    // stream->gpa_list.entries: already has value
	memset(stream->gfns, 0, npages * sizeof(gfn_t));
	memset(stream->sptes, 0, npages * sizeof(uint64_t));

	for (i = 0; i < npages; i++) {
        //...
		gfn = (gfn_t)(gpa_list_entries[i].gfn);
        // 当 operation 不是 cancel 时
		if (!gpa_cancel_import(&gpa_list_entries[i])) {
			ret = kvm_restricted_mem_get_pfn(gfn_to_memslot(kvm, gfn), gfn, &pfn, NULL);
            // erro handling...
			stream->sptes[i] = (u64)pfn << PAGE_SHIFT | VMX_EPT_RWX_MASK |
					(MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) |
					VMX_EPT_IPAT_BIT | VMX_EPT_SUPPRESS_VE_BIT;
		}

		stream->gfns[i] = gfn;
	}

	return kvm_mmu_import_private_pages(vcpu, stream->gfns,
					    stream->sptes, npages,
					    stream->first_time_import_bitmap,
					    stream);
}

TDH.MEM.TRACK

Increment the TD’s TLB tracking counter (epoch counter, TDCS.TD_EPOCH), execute once per migration epoch.

输入就是一个 TDR,表示这是一个 per TD 的 SEAMCALL。

为什么要 increase 它,有什么用呢?

这个和 TDH.EXPORT.TRACK 的作用有什么区别呢?

TDH.EXPORT.TRACK

可以选择生成一个 start token,如果不生成,就是普通的一个 epoch,如果生成了,表示要开启 out-of-order export phase。

On the source platform,

  • TDH.EXPORT.PAUSE starts the blackout phase and,
  • TDH.EXPORT.TRACK ends the blackout phase (and marks the end of the transfer of TD memory pre-copy, mutable TD VP and mutable TD global control state). TDH.EXPORT.TRACK generates a start token to allow the destination TD to become runnable. On the destination platform, TDH.IMPORT.TRACK – which consumes the cryptographic start token, allows the destination TD to be un-paused.

code 里,dst 接收到这个 token 的时候会同时把 kvm_tdx->finalized = true;。这个时候 dst 就已经可以开始跑了(而不是执行 TDH.IMPORT.END 并看到 migration flow is done 的时候)。

End the current epoch and start the new epoch. Generate an epoch token to be exported to the destination platform.

Invoked on the source platform on master thread during the in-order export phase, verifies proper in-order export.

migration_thread // search "migrate -d tcpu:localhost:6666", here, we have finished the connect
    migration_iteration_run
        qemu_savevm_state_iterate
            ram_save_iterate // savevm_ram_handlers->save_live_iterate
                ram_find_and_save_block
                    ram_save_host_page
                        ram_save_cgs_start_epoch
                            cgs_ram_save_start_epoch
                                cgs_mig.savevm_state_ram_start_epoch
                                    //...
                                    tdx_mig_stream_ioctl(KVM_TDX_MIG_EXPORT_TRACK)

How does QEMU know all the streams' current epochs are done and we can TDH.EXPORT.TRACK?

问题转化为,cgs_ram_save_start_epoch 这个函数什么时候被调用?

再次转化为:rs->cgs_start_epoch 什么时候为 true:当 find_dirty_block 在所有的 RAMBlock 中都找不到 dirty page 时,其会被设置为 true。

TDH.IMPORT.TRACK

process_incoming_migration_co
qemu_loadvm_state // Notice: not just this
qemu_loadvm_state_main
qemu_loadvm_section_start_full
vmstate_load
ram_load // savevm_ram_handlers.load_state
ram_load_precopy
    cgs_mig_loadvm_state
        tdx_mig_loadvm_state // cgs_mig->loadvm_state, this function only operate on the first stream
                

TDH.IMPORT.MEM

Private TD pages and Secure EPT entries are initialized in a single operation (via TDH.IMPORT.MEM) for pages migrated using TDH.EXPORT.MEM.

可以看 migration SPEC 里 TDH.IMPORT_MEM 对于 SEPT 的状态转化的作用:

  • Figure 9.8: Page In-Order Import Phase Partial SEPT Entry State Diagram
  • Figure 9.9: Page Out-of-Order Import Phase Partial SEPT Entry State Diagram

Process:

  • Walk the SEPT based on the GPA and find the leaf entry for the page.(既然能 walk,说明至少已经建好 non-leaf mapping 了,这应该是在 import immutable states 时的 pre-alloc 做的)。
multifd_recv_thread // search "-incoming tcp:localhost:6666"
nocomp_recv_pages // multifd_nocomp_ops->recv_pages
nocomp_recv_private_pages
cgs_mig_multifd_recv_pages
tdx_mig_multifd_recv_pages // cgs_mig->multifd_recv_pages 
tdx_mig_stream_ioctl(KVM_TDX_MIG_IMPORT_MEM)

这个 SEAMCALL 的参数需要好好看一下:

err = tdh_import_mem(kvm_tdx->tdr_pa,
             stream->mbmd.addr_and_size,
             gpa_list->info.val,
             stream->import_mem_buf_list.hpa,
             stream->mac_list[0].hpa,
             stream->mac_list[1].hpa,
             stream->td_buf_list.hpa,
             stream_info.val,
             &out);

可以看到,我们传进去了两个 buffer list:import_mem_buf_listtd_buf_list。为什么要这样呢?

  • import_mem_buf_list 对应 ABI 里定义的参数 MIG_BUFF_LIST
  • td_buf_list 对应 ABI 里定义的参数 PAGE_LIST

PAGE_LIST: If in-place import is requested for all pages imported for the first-time, this should be set to NULL_PA (all 1’s). Otherwise, if some pages are to be imported in a non-in-place mode, this should be set to the HPA of a destination page list in shared memory. The page list allows selecting in-place or non-in-place import for each page imported for the first-time in the current import session:

  • To select in-place import, the page list entry’s INVALID bit should be set to 1 (it is possible to set the whole entry to NULL_PA).
  • To select non-in-place import, the page list entry should be set to the HPA of the page to become a new TD private page.

简而言之,我们要区分 "first import" 和 "in-place import" 的概念。只有当是 "first import" 的时候,才是 "in-place import",反之亦然

如果 td_buf_entries 里指定了 PFN,那么就不是 first time import,这个 PFN 就是 TDX Module 里现在有的 page 的 pfn,就要 copy 到这里;如果没有指定,那么就是 first time import,把 import_mem_buf_entries 里的 pfn 当作要加入的。

  • 不是 first import,这种情况比较简单:
// 这是我们 userspace 传进来数据的地方,这是 copy 的数据源
import_mem_buf_entries[i].pfn = mem_buf_entries[i].pfn;
// pfn 是真实的 TD Guest 已经在用的 pfn,这是 copy 的目的地。
td_buf_entries[i].pfn = pfn;
// 表示我们不是 in-place import
td_buf_entries[i].invalid = false;
  • 是 first import:
// 把 userspace 的数据(mem_buf_entries[i].pfn)拷贝到 pfn 所在的地方
// pfn 是 TD Guest 将要用的 TDX Module 将要引入的 pfn
// 我们之所以能够这么 copy,是因为现在是 first time import,这个 pfn 还没给 TDX module
tdx_mig_mem_buf_copy(pfn, (kvm_pfn_t)mem_buf_entries[i].pfn);
import_mem_buf_entries[i].pfn = pfn;
// 表示要 in-place import
td_buf_entries[i].invalid = true;

不难看出来,逻辑是这样的,td_buf_entries 只用来表示 in-place import 相关的信息,所以

  • 大多数情况下,它的每一个 entry 的 invalid 都应该是 false,表示这个里面的 pfn 是有意义的,是 import 的目的页(始发页在 import_mem_buf_entries)。
  • 少数情况下,也就是在 first import 的时候,我们需要 in-place import,故而,td_buf_entries 每一个 entry invalid 设置为 true,表示 td_buf_entries 没有意义,直接 in-place import import_mem_buf_entries 就好了。

Import 的时候是直接 try_cmpxchg64(sptep, &old_spte, new_spte) 来更新 KVM 所维护的 SEPT 的,并不会触发 handle_changed_spte()

数据流向(IMPORT):

TdxMigStream->buf_list // QEMU
stream->mem_buf_list.entries[i].pfn // each page in the PFN (KVM mmaped)
import_mem_buf_entries
td_buf_entries // optional

Buffer list for in-place import (td_buf_list)

TDH.IMPORT.MEM^ 里有很详细的说明,可以看看。

// 也是 tdx_mig_buf_list 类型的
struct tdx_mig_buf_list td_buf_list;

首先,这个东西不是 ABI 里定义的。

其次,这个只在 Destination 端需要。从函数 tdx_mig_stream_setup() 里便可以看出:

...
	/* The lists used by the destination TD only */
	if (!is_src) {
		ret = tdx_mig_stream_buf_list_alloc(&stream->td_buf_list);
		ret = tdx_mig_stream_buf_list_alloc(&stream->import_mem_buf_list);
	}
...

If no new page list entry is provided, and a migration buffer is provided, this indicates in-place import.

Only meaningful and used in SEAMCALL TDH.IMPORT.MEM.

In-Place Import: First-time import of a page during the current import session, or following a previous import cancellation, may be done in-place;

  • the same physical pages that are provided as input are converted to TD private pages. (migration buffer)
  • Alternatively, a list of 4KB pages to be used as the destination TD new private pages may be provided.

In any case, either a migration buffer or a new page must be provided, even if the imported page is PENDING and no content is imported. Re-import of a page is always done over the TD private page that holds the previously imported version.

State transition when IMPORT

TDH.EXPORT.BLOCKW & TDH.MEM.RANGE.BLOCK

这两个 block 有什么区别呢?第一眼看上去:

  • TDH.EXPORT.BLOCKW 是在 TD live migration 的语境下的 block,
  • TDH.MEM.RANGE.BLOCK 是 TDX 本身的一个 SEAMCALL。

看起来要 block 的动作也是不一样的:

  • TDH.EXPORT.BLOCKW 要 block 的是对 page 的写,同时 GFN 可以不是连续的;
  • TDH.MEM.RANGE.BLOCK 要 block 的是对一个 range 的 page 的读和写

注意,BLOCKW 并不是由 userspace 调用来触发的,而是当 SPTE 属性发生改变(注意不是 SPTE 本身,而是记载 SPTE 相关的属性)的时候来触发。目前的实现还是一个 page 一个 page 地 block,还没有进行 batching。

为什么不由 userspace 触发?我觉得通过 KVM 来触发就可以了,KVM 里在调用这个 SEAMCALL 的时候并不会判断当前 TD 是否是 migratable 的,仅仅通过 was_writableis_writable 的 flag 的 change 来判断是否需要 TDH.EXPORT.BLOCKW。这个 flag 在哪里改变的?我感觉应该是在 QEMU enable dirty page logging 的时候,会 clear 每一个 dirty page bit,这个时候会把所有的页都置成 blockw 的状态。

那么如果 guest 访问了一个 BLOCKW 的 page,exit 出来需要 unblockw,此时会将其设置为变成了 dirty,那么什么时候会重新 block 它呢?(是会重新 blockw 的,对于 dirty ring 的实现,可以看 KVM_RESET_DIRTY_RINGS 这个 ioctl)。

blockw 的流程:

ram_save_setup
    ram_init_all
        ram_init_bitmaps
            ram_list_init_bitmaps
                block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
            migration_bitmap_sync_precopy
                migration_bitmap_sync
                    ramblock_sync_dirty_bitmap
                        cpu_physical_memory_sync_dirty_bitmap
                            if (rb->clear_bmap)
                                // Postpone the dirty bitmap clear to the point before we really send the pages...
                                // 在这里把所有 page 都 set 上。
                                clear_bmap_set(rb, start >> TARGET_PAGE_BITS, length >> TARGET_PAGE_BITS);
                            else 
                                /* Slow path - still do that in a huge chunk */
                                memory_region_clear_dirty_bitmap(rb->mr, start, length);
ram_save_host_page
    migration_bitmap_clear_dirty
        migration_clear_memory_region_dirty_bitmap
            // 如果没有 clear_bmap 的话,就不继续
            if (!rb->clear_bmap || !clear_bmap_test_and_clear(rb, page))
                return;
            memory_region_clear_dirty_bitmap
                kvm_log_clear
                    kvm_physical_log_clear
                        kvm_log_clear_one_slot
                            kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d)
                                kvm_vm_ioctl_clear_dirty_log
                                    kvm_clear_dirty_log_protect
                                        kvm_arch_mmu_enable_log_dirty_pt_masked
                                            // clear MMU Dirty bit for PT level pages, or write
                                            // protect the page if the Dirty bit isn't supported.
                                            kvm_mmu_write_protect_pt_masked
                                                kvm_tdp_mmu_clear_dirty_pt_masked
                                                    clear_dirty_pt_masked
                                                        tdx_write_block_private_pages
                                                            // SEAMCALL 调用的地方
                                                            tdh_export_blockw
                                        __rmap_clear_dirty
                                            spte_wrprot_for_clear_dirty
                                                // 把 writable 这个 bit 去掉,从而能够 write block
                                                bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);
                                                // 如果之前是 writable 的,那么对不起,我们只能往坏的地方想,这个
                                                // page 有可能被改过,所以设置为 dirty 的。
                                                // 如果之前不是 writable 的,那么说明我们这两次 block 之间没有改过,所以
                                                // 不应该是 dirty 的。
                                                if (was_writable && !spte_ad_enabled(*sptep))
                                            		kvm_set_pfn_dirty(spte_to_pfn(*sptep));

                                                

        // 清空掉 dirty bitmap 的这个 bit
        ret = test_and_clear_bit(page, rb->bmap);


// 1 path (not all) to reblockw the TDX pages
kvm_dirty_ring_reap_locked
    case KVM_RESET_DIRTY_RINGS:
        kvm_vm_ioctl_reset_dirty_pages
            kvm_dirty_ring_reset
                kvm_reset_dirty_gfn
                    kvm_arch_mmu_enable_log_dirty_pt_masked
                        kvm_mmu_clear_dirty_pt_masked / kvm_mmu_write_protect_pt_masked
                            kvm_tdp_mmu_clear_dirty_pt_masked
                                clear_dirty_pt_masked
                                    static_call(kvm_x86_write_block_private_pages)

unblockw 的流程(fast page fault handle):

fast_pf_fix_direct_spte
    kvm_write_unblock_private_page
        tdx_write_unblock_private_page
            tdh_export_unblockw
    // unblockw 之后把这个 page 置成 dirty
    mark_page_dirty_in_slot(vcpu->kvm, fault->slot, gfn);

TDH.MEM.RANGE.BLOCK BLOCK 谁?一个起始 GPA(表示从哪里开始 block,GPA 的末几位 bit 是有要求的),和一个 Size(表示 block 多少(4KB, 2MB, 1GB, 512GB, 256TB))。

所以这个 SEAMCALL 的主要输入就是 Level 和 GPA。



tdx_sept_free_private_spt


tdx_sept_zap_private_spte
    tdh_mem_range_block

TDH.MEM.RANGE.BLOCK / TDH.MEM.RANGE.UNBLOCK

可以通过 TDH.MEM.RANGE.UNBLOCK 来取消。对于以下几个 SEAMCALL,TDH.MEM.RANGE.BLOCK 是必须的:

  • TDH.MEM.TRACK:因为要 TLB Flush,所以 flush 的这段时间 block 住对于指定区间的读和写,这样不会让 TLB 重新变成 valid。
  • TDH.MEM.RANGE.UNBLOCK 这个是很好理解的
// 在 static_call(kvm_x86_remove_private_spte) 之前
// 在 static_call(kvm_x86_merge_private_spt) 之前
// 在 static_call(kvm_x86_split_private_spt) 之前
// 都需要调用 static_call(kvm_x86_zap_private_spte) 来 range.block
// 除此之外,还有一个函数
// 看起来和 rmap 有点关系?
kvm_mmu_zap_private_spte
    __kvm_mmu_zap_private_spte
        static_call(kvm_x86_zap_private_spte)
            tdx_sept_zap_private_spte
                tdh_mem_range_block

// 在 free 一个页表的时候,因为需要 REMOVE 对应的 PT page(注意不是 TD page)
// 所以我们需要把对应的这个 GFN 区间都 block 住,不能够再读和写了。
tdx_sept_free_private_spt
    tdh_mem_range_block
    tdh_mem_sept_remove

tdx_write_block_private_pages() KVM TDX

TD live migration data process flow

在 dst 端,tdx_mig_multifd_recv_pages 函数里会设置一些 iov 来 receive pages。QEMU 中数据发送和接收的顺序是:

  • mbmd (1 page)
  • buf list (N pages,真正的 page 的数据)
  • gpa list (1 page)
  • mac list (2 pages)

这些数据被放在了 QEMU 的数据结构 TdxMigStream 中:

typedef struct TdxMigStream {
    int fd;
    void *mbmd;
    void *buf_list;
    void *mac_list;
    void *gpa_list;
} TdxMigStream;

然后 QEMU 调用了 ioctl KVM_TDX_MIG_IMPORT_MEM

tdx_mig_stream_ioctl(stream, KVM_TDX_MIG_IMPORT_MEM, 0, &gfn_num);
static struct kvm_device_ops kvm_tdx_mig_stream_ops = {
	.name = "kvm-tdx-mig",
	.get_attr = tdx_mig_stream_get_attr,
	.set_attr = tdx_mig_stream_set_attr,
	.mmap = tdx_mig_stream_mmap,
	.ioctl = tdx_mig_stream_ioctl,
	.create = tdx_mig_stream_create,
	.release = tdx_mig_stream_release,
};


tdx_mig_stream_setup
    tdx_mig_do_stream_setup
        tdx_mig_stream_create
            kvm_create_device(kvm_state, KVM_DEV_TYPE_TDX_MIG_STREAM, false)
                // .create (tdx_mig_stream_create)
            kvm_device_ioctl(stream->fd, KVM_SET_DEVICE_ATTR, &attr)
                // .set_attr (tdx_mig_stream_set_attr)
            mmap()
                // .mmap (tdx_mig_stream_mmap)

stream 是怎么传给 KVM 的呢?首先 QEMU create 了 mig stream,调用到了 KVM device 里的 .create 函数。这个函数主要就是调了 TDH.MIG.STREAM.CREATE 来创建了一个 MigSC。

然后设置 attr,输入 buf_list_pages 代表页的数量作为前期的 setup,调用了 KVM device 里的 .set_attr 函数。这个函数

  • 在 KVM 里设置了 stream 的 buf_list_pages;
  • mbmdmem_buf_listpage_listgpa_listmac_list 等等的空间(page)进行了申请。

最后,调用了 mmap 来让 QEMU 和 KVM 共享 mbmd,gpa_list,mac_list,buf_list 等等的内存(这些内存之前在 KVM 里已经申请好了)。更多请参考 QEMU 里的函数 tdx_mig_do_stream_setup

mmap 的 KVM device handler 值得一提,也就是函数 tdx_mig_stream_mmap

tdx_mig_stream_mmap, tdx_mig_stream_ops, tdx_mig_stream_fault

static int tdx_mig_stream_mmap(struct kvm_device *dev, struct vm_area_struct *vma)
{
	vma->vm_ops = &tdx_mig_stream_ops;
}

static const struct vm_operations_struct tdx_mig_stream_ops = {
	.fault = tdx_mig_stream_fault,
};

// 因为我们已经通过 mmap 把 QEMU 和 KVM 的这些内存共享了,但是
// 当 page fault 发生时,我们仍然需要找到对应的 page。
static vm_fault_t tdx_mig_stream_fault(struct vm_fault *vmf)
{
	struct kvm_device *dev = vmf->vma->vm_file->private_data;
	struct tdx_mig_stream *stream = dev->private;
	struct page *page;
	kvm_pfn_t pfn;
	uint32_t i;

    // 因为我们在 .set_sttr (tdx_mig_stream_set_attr) 里面 setup 时,已经确定
    // 了各个 buffer 的顺序,所以这里我们可以按照 pgoff 号来找到对应的 page
    // 并进行设置。
	if (vmf->pgoff == TDX_MIG_STREAM_MBMD_MAP_OFFSET) {
		page = virt_to_page(stream->mbmd.data);
	} else if (vmf->pgoff == TDX_MIG_STREAM_GPA_LIST_MAP_OFFSET) {
		page = virt_to_page(stream->gpa_list.entries);
	} else if (vmf->pgoff == TDX_MIG_STREAM_MAC_LIST_MAP_OFFSET ||
		   vmf->pgoff == TDX_MIG_STREAM_MAC_LIST_MAP_OFFSET + 1) {
		i = vmf->pgoff - TDX_MIG_STREAM_MAC_LIST_MAP_OFFSET;
		if (stream->mac_list[i].entries) {
			page = virt_to_page(stream->mac_list[i].entries);
		} else {
			pr_err("%s: mac list page %d not allocated\n",
				__func__, i);
			return VM_FAULT_SIGBUS;
		}
    // 举个例子,对于 mem_buf_list,我们并没有把 PFN 暴漏给 QEMU,而是
    // 让 QEMU 在 import/export 的时候自动触发 page fault,然后在 KVM 里
    // 找到 PFN,然后找到对应的 page,返回时 QEMU 默认 page 里的数据已经写好了。
    // page 的 pfn 是在 stream->mem_buf_list.entries[i].pfn 里保存着,一直不会变。
	} else if (tdx_mig_stream_in_mig_buf_list(vmf->pgoff, stream->buf_list_pages)) {
		i = vmf->pgoff - TDX_MIG_STREAM_BUF_LIST_MAP_OFFSET;
		pfn = stream->mem_buf_list.entries[i].pfn;
		page = pfn_to_page(pfn);
	} else {
		pr_err("%s: VM_FAULT_SIGBUS\n", __func__);
		return VM_FAULT_SIGBUS;
	}

	get_page(page);
	vmf->page = page;
	return 0;
}

至此,QEMU 和 KVM 里各自 stream struct 里面共享的成员对应如下:

  • TdxMigStream->mbmd 对应 tdx_mig_stream->mbmd.data
  • TdxMigStream->gpa_list 对应 tdx_mig_stream->gpa_list.entries
  • TdxMigStream->mac_list 对应 tdx_mig_stream->mac_list[i].entries(因为 mac_list 占了两个 page,所以需要通过 i 来确定哪一页)。或者说对应的是 tdx_mig_stream->mac_list
  • TdxMigStream->buf_list 对应 tdx_mig_stream->mem_buf_list.entries[i].pfn(因为 buf_list 是多个页,所以也需要用 i 来确定是哪一页)。或者说对应的是 tdx_mig_stream->mem_buf_list.entries

KVM 里的其它 stream 结构体成员都是私有的,而不是和 QEMU 共享的。比如 gfns 和 sptes,放着里面就是为了用的时候更加方便一些。

TD live migration across different product family

TDX Module 可能支持不同的 feature sets,需要保证两边的同步。比如 no rbp mode。

// 固定 CPU Model
-cpu SapphireRapids
// 选最小的
-machine pc-q35-7.2

TD live migration across different GNR steppings

因为 GNR A0 有一些 vulnerabilities,这些只能在 GNR B0 才能 fix。现在的 GNR CPU Model 是按照 GNR B0 来写的,所以在 A0 上会报 waning,如果要迁移应该如下使用 GraniteRapids 的 CPU Model:

-cpu GraniteRapids,-sbdr-ssdp-no,-fbsdp-no,-psdp-no,-pbrsb-no,-mcdt-no

Pre-migration in TD Live Migration

Pre-binding / TDH.SERVTD.PREBIND

If VMM chooses to launch MigTD when it is required (may after launching the target TD), then the VMM can prebind the target TD. The VMM shall calculate the MigTD’s TDINFO_STRUCT then SERVTD_INFO_HASH. The SERVTD_INFO_HASH shall be input as parameter during prebinding.

Must pre-bind before the target TD measurement is finalized.

Binding / TDH.SERVTD.BIND

TD measurement is extended for the MigTD bound to the TD being migrated.

This SEAMCALL has will bind the user TD to service TD using the parameters passed in:

  • RCX: user TD's TDR page;
  • RDX: service TD's TDR page.

MigTD policy

MigTD Spec 5.2 MigTD Migration Policy

有三种 policy 类型:

  • TDX Module Policy
  • MigTD Default Policy
  • MigTD Extension Policy (CSP 自己加上去的 policy)

一个 policy 的例子:https://github.com/intel/MigTD/blob/main/config/policy.json

Pre-migration status error code

*GHCI 3.13.2.3 TDG.VP.VMCALL *

0: SUCCESS

1: INVALID_PARAMETER

2: UNSUPPORTED

3: OUT_OF_RESOURCE

4: TDX_MODULE_ERROR

5: NETWORK_ERROR

6: SECURE_SESSION_ERROR

7: MUTUAL_ATTESTATION_ERROR

8: MIGPOLICY_ERROR

0xFF: MIGTD_INTERNAL_ERROR

0x0A~0xFE: Reserved

What will happen when pre-migration is triggered?

首先,我们触发 migration 的脚本是这样的:

SRC_MIGTD_PID=$(pgrep migtd-src)

# Bind the source migtd to the source user TD
echo "qom-set /objects/tdx0/ migtd-pid ${SRC_MIGTD_PID}" | nc -U /tmp/qmp-sock-src

# Asking migtd-src to connect to the src socat
echo "qom-set /objects/tdx0/ vsockport 1234" | nc -U /tmp/qmp-sock-src
/******* Bind 阶段,对应 echo "qom-set /objects/tdx0/ migtd-pid ${SRC_MIGTD_PID}" | nc -U /tmp/qmp-sock-src *******/
// Ask to bind,对应下面这个命令:
// echo "qom-set /objects/tdx0/ migtd-pid ${SRC_MIGTD_PID}" | nc -U /tmp/qmp-sock-src
tdx_migtd_set_pid
    tdx_binding_with_migtd_pid
        // ioctl to bind
    	case KVM_TDX_SERVTD_BIND:
            tdx_servtd_bind
                tdx_servtd_do_bind
                    // 输入包括 servtd 的 TDR,也包括 usertd 的 TDR,表示要将这两个绑定起来
                    tdh_servtd_bind
                // servtd 的 usertd_binding_slots update 一下。
                tdx_servtd_add_binding_slot

/******* Pre-migration 阶段,对应 echo "qom-set /objects/tdx0/ vsockport 1234" | nc -U /tmp/qmp-sock-src ********/
// target TD process
tdx_migtd_set_vsockport
    info.is_src = !runstate_check(RUN_STATE_INMIGRATE);
    // 会告诉是 source 端还是 destination 端
    tdx_vm_ioctl(KVM_TDX_SET_MIGRATION_INFO, 0, &info);
        // KVM
        tdx_set_migration_info
            // kick the servtd
            tdx_notify_servtd

// Service TD
// First issue a TDCALL WaitForRequest (not SEAMCALL)
// 这个 TDCALL 用来向 VMM 问,有没有新的 pre-migration 的 request(TDVMCALL_SERVICE_MIGTD_CMD_WAIT)
handle_tdvmcall
    case TDG_VP_VMCALL_SERVICE:
        tdx_handle_service
            case TDVMCALL_SERVICE_ID_MIGTD:
                need_block = tdx_handle_service_migtd(tdx, cmd_buf, resp_buf);
                    case TDVMCALL_SERVICE_MIGTD_CMD_WAIT:
                        migtd_wait_for_request
                            migtd_start_migration
                                // 检查自己的 usertd_binding_slots,如果有值
                                // 说明有 target TD 在等着自己做 pre-migration
                            	resp_migtd->operation = TDVMCALL_SERVICE_MIGTD_OP_START_MIG;
                            	return len;

// MigTD 知道了我们可以做 pre-migration 了,那么就开始做
// 做完了,成功了,告诉 VMM 这个消息(TDVMCALL_SERVICE_MIGTD_CMD_REPORT)
handle_tdvmcall
    case TDG_VP_VMCALL_SERVICE:
        tdx_handle_service
            case TDVMCALL_SERVICE_ID_MIGTD:
                need_block = tdx_handle_service_migtd(tdx, cmd_buf, resp_buf);
                    case TDVMCALL_SERVICE_MIGTD_CMD_REPORT:
                        migtd_report_status
                            case TDVMCALL_SERVICE_MIGTD_OP_START_MIG:
                                migtd_report_status_for_start_mig
                                    tdx_binding_slot_set_state

Binding slot / Binding table

首先明确下面几个概念:

  • Binding table:TDX Module 在 target TD 的 TDCS 里维护的一个 array,里面记录了对于这个 TD,它绑定了哪些 SERVTD。
  • Binding slot:上述 array 里的 entry。
  • binding_slots:KVM 里用来 mirror binding table 的结构体。
  • usertd_binding_slots:KVM 里用来记录对于一个 Service TD 绑定了哪些 User TD 的结构体。

Binding slot in the table contains the following fields:

  • SERVTD_BINDING_STATE
  • SERVTD_INFO_HASH
  • SERVTD_TYPE
  • SERVTD_ATTR
  • SERVTD_UUID

slot_idx is inputted as parameter to TDH_SERVTD_PREBIND to indicate which slot is used for this service TD。进行绑定 TDH.SERVTD.BIND 和预绑定 TDH.SERVTD.PREBIND 都需要输入这个参数。所以说对于一个 Service TD 其实其在 target TD 的 binding table 中的位置并不是 TDX Module 自动指定的,而是需要使用者自己来指定的。

注意,type 也是输入的参数,表示要绑定的 service TD 的类型,理论上来说可以和 slot_index 没有关系的。但是,Current design allows a user TD to be bound to 1 type of service TD only, so the service TD type number is reused as the index into the binding slots maintained in TDX module.

KVM code 里也维护了一个 binding slot 的数组 binding_slots,来反映 TDX Module 里此数组的状态。同时,一个自定义的数组 usertd_binding_slots 只有在此 TD 是 MigTD 时才有效,用来维护所有绑到了这个 MigTD 上的 TD。

struct kvm_tdx {
    //...
    // 当是 user TD 的时候有用。因为一个 user TD 可以绑定到多个
    // Service TD,所以这个数组长度为 KVM_TDX_SERVTD_TYPE_MAX
	struct tdx_binding_slot binding_slots[KVM_TDX_SERVTD_TYPE_MAX];

    // 当是 service TD 的时候有用。因为一个 SERVTD 可以和多个 TD 绑定
    // 所以每一个 entry 对应一个 user TD,或者说,每一项是一个 pointer
    // 指向那个 user TD 所 handle 的对于此 Service TD 的 binding_slot 项。
    struct tdx_binding_slot *usertd_binding_slots[SERVTD_SLOTS_MAX];
    //...
}

struct tdx_binding_slot KVM

struct tdx_binding_slot {
	enum tdx_binding_slot_state state;
	/* Identify the user TD and the binding slot */
	uint64_t handle;
	/* UUID of the user TD */
	uint8_t  uuid[32];
	// 这个 user TD 在所绑定的 service TD 里的 array 里的位置(index)
	uint16_t req_id;
	/* The servtd that the slot is bound to */
	struct kvm_tdx *servtd_tdx;
	/*
	 * Data specific to MigTD.
	 * Futher type specific data can be added with union.
	 */
	struct tdx_binding_slot_migtd migtd_data;
};

Binding slot cleanup tdx_binding_slots_cleanup() KVM

 static void tdx_binding_slots_cleanup(struct kvm_tdx *kvm_tdx)
{
	struct tdx_binding_slot *slot;
	struct kvm_tdx *servtd_tdx;
	uint16_t req_id;
	int i;

    // 如果是 user TD,那么没有什么特别的,直接把 service TD 里
    // 自己所对应的那个 entry 置成 null 就行了,而又因为自己
    // 已经要 teardown 了,所以自己维护的这两个 array 都无了。
	for (i = 0; i < KVM_TDX_SERVTD_TYPE_MAX; i++) {
		slot = &kvm_tdx->binding_slots[i];
		servtd_tdx = slot->servtd_tdx;
		if (!servtd_tdx)
			continue;
		req_id = slot->req_id;
        //...
		servtd_tdx->usertd_binding_slots[req_id] = NULL;
	}

    // 如果时 service TD,那么我们不能够把对应的 user TD 的 array 里自己的这一项
    // 置 null,因为比如 premig 完成后,关掉 MigTD 是很正常的操作,不应该影响到
    // user TD 里对于是否已经完成了 pre-migration 来进行判断。
	for (i = 0; i < SERVTD_SLOTS_MAX; i++) {
		slot = kvm_tdx->usertd_binding_slots[i];
		if (!slot)
			continue;
		slot->servtd_tdx = NULL;
	}
}

Binding slot state transition

这个是 KVM code 里为了维护绑定状态信息而设计的概念:

可以看到在 prebind 之后还是需要 bind。prebind 是发生在 td finalize 的时候。不需要 prebind 的 bind 也是发生在 td finalize 的时候。prebind 之后的 bind 是发生在我们脚本触发 pre-migration 的时候。