TDX Misc

有两个 SEPT,一个是 KVM 用来维护状态用的 SEPT,另一个是 TDX module 里的 SEPT,对外保密,只能通过接口进行设置和删除(x86_ops 里的那些 TD 相关的函数)。

pc.bios is not private memory. TDX requires OVMF to act as private memory,这两句话好像是冲突的。

TDX 现在不支持 swap memory out。

目前 TDX 只支持到 2M 的大页。

A correctly written TD should either not use UC locks or split locks, or be ready to handle any #AC and #GP(0) faults raised when such locks are used. 在 TD guest 跑 userspace 的可以触发 split lock 的程序时,发生 split lock 不会发生 vmexit。但是如果 host kernel 是 sld = off 的情况,他就不会 configure MSR,然后硬件上也就不会产生 #AC。为了让 TDX guest 不 handle AC,判断一下是不是 TDX guest,是的话就 ignore,这样不管 host sld 是啥,TD guest 都不会产生 warning 了。

正常非 debug 版本的 tdx module 应该是不建立自己的 IDT 表的,所以在 tdx module 里面发生 exception 都会直接挂掉。tdx module 和 kernel 做 context switch 的时候,对应的 handler 也会 switch。handler 是根据 IDT 里设置的来确定的,tdx module 和 kernel 不一样。现在 debug 版本的让他们在 exception handler 里面加了一些 log。

Guest ACCEPT private memory 并且将 private/shared bitmap 信息更新到 VMM 有两种方式:

  • 直接 TDG.MEM.PAGE.ACCEPT 一段内存,如果这段内存还没有被 VMM AUG/ADD,那么会触发 Violation 从而 Exit 到 VMM,VMM 可以通过这种方式记录 private/shared bitmap。这是 TDVF 采用的方式
  • 调用 TDVMCALL MapGPA 来请求 VMM,KVM 会 exit 到 QEMU 来进行通知,QEMU 会设置 cgs_bmap(See ram_block_convert_range()),调用 SET_ATTRIBUTE 下来让 VMM AUG 完之后返回 success,这样 Guest 就可以接着 ACCEPT这是 Guest kernel 采用的方式。(注意,如果 MapGPA 是从 private->shared,那么不需要后续的 Accept)。可以看 guest kernel 里的函数 tdx_enc_status_changed()

Kernel 里的 xarray 记录的是从 kernel 的角度来看这个 page 应该是 private 的还是 shared 的,QEMU 里的 cgs_bmap 也是如此。

In TDX, tsc multiplier can't be changed.

TDX is supported using legacy MMU, not just for EPT, see commit dbbdd4e8ab6f370fe8c232006a1433c1d947c718.

TDX is NOT enabled for all SPR CPUs. Some steppings are not supported, such as E3.

TDX doesn't support APICV.

TDX is PV (Software within the guest TD can use the TDCALL(TDG.MR.REPORT) function to request the Intel TDX module to generate an integrity-protected TDREPORT structure.), so A TD OS is considered enlightened if it is aware that it is running as a TD.

TDX module code is open source. Intel® Trust Domain Extensions

TDX Module can be loaded by 2 ways, one from BIOS, and one for OS by GETSEC.

Secure EPT is intended to be managed indirectly by the host VMM using Intel TDX functions.

Because TDVPS includes VMCS, and TDX module also use the VMLUANCH/VMRESUME to start a VM, so

  • on each TD exit, it just need to save non-VMCS CPU state in TDVPS, and,
  • on each TD entry, it just need to restore non-VMCS CPU state from TDVPS.

To ensure it is a TD in the guest, just lscpu | grep tdx_guest.

TDX doesn't trust the BIOS.

The TDX module is expected to be loaded by the BIOS when it enables TDX. The TDX module will be initialized by the KVM subsystem when KVM wants to use TDX.

TDX requires x2APIC. The tdx guest only supports x2apic.

TEE-IO Provisioning Agent: TPA.

cgs: Confidential Guest Support.

MapGPA call points in TD guest kernel / TDVMCALL_MAP_GPA

有以下两个地方 call 到了 TDVMCALL_MAP_GPA

platform_device_add / pci_device_add / acpi_device_add
    arch_dev_authorized
        authorized_node_match
            tdx_guest_dev_attest
                tdxio_devif_accept
                    tdxio_devif_accept_mmios
                        tdx_map_private_mmio
                            __tdx_map_gpa
                                _tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
set_memory_encrypted / set_memory_decrypted / set_memory_decrypted_noflush
    __set_memory_enc_dec
        __set_memory_enc_pgtable
            x86_platform.guest.enc_status_change_prepare / x86_platform.guest.enc_status_change_finish
                tdx_enc_status_change_prepare / tdx_enc_status_change_finish
                    tdx_enc_status_changed
                        tdx_map_gpa
                            .r11 = TDVMCALL_MAP_GPA,

Page's refcount during TDX lifetime

TDX:

发生 page fault,page fault handler 里调用 kvm_gmem_get_pfn() 会把 refcount 初始化为 3(alloc 1, page cache 1, lru 1),过段时间 lru 里的会被异步移除从而减一,所以我们可以看作起始是 2。

tdx_unpin() 这里会减一,从而释放。

TDX Live Migration:

Page fault in TDX

Guest memory access 并不是相同的,而是也有 private access 和 shared access 之分的。

Based on the value of a new SHARED bit in the Guest Physical Address (GPA).

Guest Physical Address (GPA) space is divided into private and shared sub-spaces, determined by the SHARED bit of GPA.

一个 page 的映射可以分为以下几个情况:

  Shared EPT SEPT
Not mapped    
Private   Yes
Shared Yes  

有三个不同的概念需要澄清:

  • 访问是 private 还是 shared,看 gfn_shared_mask
  • 真实的 page 是 private 还是 shared,看 SEPT 有没有建立起来
  • 期许的 page 应该是 private 还是 shared,看 mem_attr_array

所以发生 page fault 的可能原因,

  • TD Guest 访问的和 page 真实的不匹配,比如 TD Guest 是一个 shared access,但是真实的 page 是 private 的。
  • 本身这个 page 的 mapping 就还没有建立起来。

对于这两种情况的处理有点绕:但都是基于看 TD Guest 访问的(也就是 page fault 是否是 private)和存到 xarray 也就是这个页应该是 private 还是 shared 的进行比对:

  • 如果相同,那么我们有理由相信这是由于 page 的 mapping 还没有建立起来引起的,这种情况没必要告诉 Userspace,直接我们自己拿到 page 建立起映射就行了;
  • 如果不相同,既然 Guest 以这种方式访问了,说明 Guest 想要把这个 page 转换以下,这种情况我们要 exit 到 Userspace 来处理这个问题。

tdx_track() KVM

相当于 TLB Flushing,在每次 enter 的时候如果发现有 request,那么就调用 tdx_track() 来 flush。

vcpu_enter_guest
    if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
        kvm_vcpu_flush_tlb_all(vcpu);
            vt_flush_tlb_all
                tdx_flush_tlb
    kvm_service_local_tlb_flush_requests
        kvm_service_local_tlb_flush_requests
            if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
        		kvm_vcpu_flush_tlb_current(vcpu);
                    vt_flush_tlb_current
                        tdx_flush_tlb_current
                            tdx_track(vcpu->kvm);


主要是为了执行 SEAMCALL TDH.MEM.TRACK^。

/*
 * TLB shoot down procedure:
 * There is a global epoch counter and each vcpu has local epoch counter.
 * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
 *   This blocks the subsequenct creation of TLB translation on that range.
 *   This corresponds to clear the present bit(all RXW) in EPT entry
 * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
 * - IPI to remote vcpus
 * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
 * - On re-entry, TDX module compares the local epoch counter with the global
 *   epoch counter.  If the local epoch counter is older than the global epoch
 *   counter, update the local epoch counter and flushes TLB.
 */
static void tdx_track(struct kvm *kvm)
{
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
	struct kvm_vcpu *vcpu;
	unsigned long i;
	u64 err;

	KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
	/* If TD isn't finalized, it's before any vcpu running. */
	if (unlikely(!is_td_finalized(kvm_tdx)))
		return;

	/*
	 * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
	 * the counter.  The counter is used instead of bool because multiple
	 * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
	 */
	atomic_inc(&kvm_tdx->doing_track);

	while (atomic_cmpxchg(&kvm_tdx->tdh_mem_track, 0, 1)) {
		cpu_relax();
	}

	smp_store_release(&kvm_tdx->has_range_blocked, false);

	/*
	 * Don't wait for other vcpus with the empty IPI handler.  Instead,
	 * Synchronize after tdh_mem_track() to reduce synchronization time.
	 */
	kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH & ~KVM_REQUEST_WAIT);

	/*
	 * kvm_flush_remote_tlbs() doesn't allow to return error and
	 * retry.
	 */
	err = tdh_mem_track(kvm_tdx->tdr_pa);
	if (!err)
		tdx_tdi_iq_inv_iotlb(kvm_tdx);

	/* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
	atomic_set(&kvm_tdx->tdh_mem_track, 0);

	/*
	 * Avoid TDX_TLB_TRACKING_NOT_DONE on the following Secure-EPT operation
	 * by waiting here for all other vcpus to go through TDExit once or not
	 * running TD guest.  The alternative is loop on
	 * TDX_TLB_TRACKING_NOT_DONE with Secure-EPT operation.  But if we hit
	 * problem with tlb shoot down, debug will be very difficult.  So we
	 * don't choose the loop option.
	 */
	kvm_for_each_vcpu(i, vcpu, kvm) {
		int mode;

		/* If vcpu == current vcpu, vcpu->mode == OUTSIDE_GUEST_MODE */
		mode = smp_load_acquire(&vcpu->mode);
		while ((mode == IN_GUEST_MODE || mode == EXITING_GUEST_MODE) &&
		       kvm_test_request(KVM_REQ_TLB_FLUSH, vcpu)) {
			cpu_relax();
			mode = smp_load_acquire(&vcpu->mode);
		}
	}
	atomic_dec(&kvm_tdx->doing_track);

	if (KVM_BUG_ON(err, kvm))
		pr_tdx_error(TDH_MEM_TRACK, err, NULL);
}

tdx_td_init() KVM

static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params, u64 *seamcall_err, bool post_init)
{
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
	struct tdx_module_args out;
	cpumask_var_t packages;
	unsigned long *tdcs_pa = NULL;
	unsigned long tdr_pa = 0;
	unsigned long va;
	int ret, i;
	u64 err;

    //...
    // alloc HKID
	ret = tdx_guest_keyid_alloc();
    //...
	kvm_tdx->hkid = ret;

    // cgroup 相关的
	kvm_tdx->misc_cg = get_current_misc_cg();
	ret = misc_cg_try_charge(MISC_CG_RES_TDX, kvm_tdx->misc_cg, 1);
    //...

    // 分配 TDR
	va = __get_free_page(GFP_KERNEL_ACCOUNT);
    //...
	tdr_pa = __pa(va);

    // 分配 TDCS 数组
	tdcs_pa = kcalloc(tdx_info.nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
			  GFP_KERNEL_ACCOUNT | __GFP_ZERO);
    //...
    // 分配 TDCS 数组每一个 entry指向的页
	for (i = 0; i < tdx_info.nr_tdcs_pages; i++) {
		va = __get_free_page(GFP_KERNEL_ACCOUNT);
        //...
		tdcs_pa[i] = __pa(va);
	}

    // 分配一个 cpu_mask 出来,放到 packages 里
	zalloc_cpumask_var(&packages, GFP_KERNEL)
    //...

    // check if tdx module is available or not

	// Need at least one CPU of the package to be online in order to
	// program all packages for host key id.  Check it.
	for_each_present_cpu(i)
		cpumask_set_cpu(topology_physical_package_id(i), packages);
	for_each_online_cpu(i)
		cpumask_clear_cpu(topology_physical_package_id(i), packages);
    //...

	/*
	 * Acquire global lock to avoid TDX_OPERAND_BUSY:
	 * TDH.MNG.CREATE and other APIs try to lock the global Key Owner
	 * Table (KOT) to track the assigned TDX private HKID.  It doesn't spin
	 * to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
	 * caller to handle the contention.  This is because of time limitation
	 * usable inside the TDX module and OS/VMM knows better about process
	 * scheduling.
	 *
	 * APIs to acquire the lock of KOT:
	 * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
	 * TDH.PHYMEM.CACHE.WB.
	 */
	mutex_lock(&tdx_lock);
	err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
	mutex_unlock(&tdx_lock);
    // error checkings...
	kvm_tdx->tdr_pa = tdr_pa;
	tdx_account_ctl_page(kvm);

	for_each_online_cpu(i) {
		int pkg = topology_physical_package_id(i);

		if (cpumask_test_and_set_cpu(pkg, packages))
			continue;

		/*
		 * Program the memory controller in the package with an
		 * encryption key associated to a TDX private host key id
		 * assigned to this TDR.  Concurrent operations on same memory
		 * controller results in TDX_OPERAND_BUSY.  Avoid this race by
		 * mutex.
		 */
		mutex_lock(&tdx_mng_key_config_lock[pkg]);
		ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
				      &kvm_tdx->tdr_pa, true);
		mutex_unlock(&tdx_mng_key_config_lock[pkg]);
		if (ret)
			break;
	}
	if (ret)
		atomic_dec(&nr_configured_hkid);
	cpus_read_unlock();
	free_cpumask_var(packages);
    // error checking...

	kvm_tdx->tdcs_pa = tdcs_pa;
	for (i = 0; i < tdx_info.nr_tdcs_pages; i++) {
		err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
        // error checking...
		tdx_account_ctl_page(kvm);
	}

	if (!post_init) {
		err = tdh_mng_init(kvm_tdx->tdr_pa, __pa(td_params), &out);
        // error handling...
		tdx_td_post_init(kvm_tdx);
	}

	kvm_tdx->attributes = td_params->attributes;
	kvm_tdx->xfam = td_params->xfam;
	kvm_tdx->eptp_controls = td_params->eptp_controls;

	(td_params->attributes & TDX_TD_ATTRIBUTE_MIG) && tdx_mig_state_create(to_kvm_tdx(kvm))
    // error handling...

	if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
		kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
	else
		kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
	kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);

	return 0;
// a bunch of error handling tags...
}

tdx_map_gpa() KVM

当 KVM 收到 Map GPA 请求时,有两种处理方式:

  • (常见)如果能够找到 GPA 所对应的 memory slot,那我们 exit 到 Userspace,通知它,它会调用 ioctl SET_ATTRIBUTE 下来。
  • 如果找不到,自己处理。
static int tdx_map_gpa(struct kvm_vcpu *vcpu)
{
	struct kvm *kvm = vcpu->kvm;
	gpa_t gpa = tdvmcall_a0_read(vcpu);
	gpa_t size = tdvmcall_a1_read(vcpu);
	gpa_t end = gpa + size;
    // s == start
	gfn_t s = gpa_to_gfn(gpa) & ~kvm_gfn_shared_mask(kvm);
    // e == end
	gfn_t e = gpa_to_gfn(end) & ~kvm_gfn_shared_mask(kvm);
    // 查看 TD Guest 传过来得 GPA 有没有置上 shared bit,
    // 来表示 guest 像 map 成 private 的还是 shared 得。
	bool map_private = kvm_is_private_gpa(kvm, gpa);
	int ret;
	int i;

    // sanity checks...
    //...
    // 其实也没几个 address space
	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
		struct kvm_memslots *slots = __kvm_memslots(kvm, i);
		struct kvm_memslot_iter iter;

		kvm_for_each_memslot_in_gfn_range(&iter, slots, s, e) {
			struct kvm_memory_slot *slot = iter.slot;
			gfn_t slot_s = slot->base_gfn;
			gfn_t slot_e = slot->base_gfn + slot->npages;

			// 我们的 range 和 slot 的 range 重合不了一点
            // 那就继续
			if (e < slot_s || s >= slot_e)
				continue;

            // 我们完全被这个 slot 包在里面
			if (slot_s <= s && e <= slot_e) {
                // 如果这个 slot 的 flag 是包含private 的
				if (kvm_slot_can_be_private(slot))
                    // 让 Userspace(QEMU)来处理
                    // userspace 会调用 KVM_SET_MEMORY_ATTRIBUTES 下来
					return tdx_vp_vmcall_to_user(vcpu);
				continue;
			}
			break;
		}
	}

    // 没有找到重合且满足条件的 slot,这种情况只能 Kernel 处理
    // 不通知 Userspace 了。这种情况好像是 MMIO 的情况?应该不常见吧
	ret = kvm_mmu_map_private(vcpu, &s, e, map_private);
    //error handling...
}

kvm_mmu_map_private() / __kvm_mmu_map_private() KVM

int kvm_mmu_map_private(struct kvm_vcpu *vcpu, gfn_t *startp, gfn_t end, bool map_private)
{
	struct kvm_mmu *mmu = vcpu->arch.mmu;
    // checking...
	return __kvm_mmu_map_private(vcpu->kvm, startp, end, map_private);
}

static int __kvm_mmu_map_private(struct kvm *kvm, gfn_t *startp, gfn_t end, bool map_private)
{
	gfn_t start = *startp;
	u64 attrs;
	int ret;

    //...
	attrs = map_private ? KVM_MEMORY_ATTRIBUTE_PRIVATE : 0;
	start = start & ~kvm_gfn_shared_mask(kvm);
	end = end & ~kvm_gfn_shared_mask(kvm);

    //...
	kvm_vm_reserve_mem_attr_array(kvm, start, end);
    //...
	kvm_mmu_invalidate_begin(kvm);
	kvm_mmu_invalidate_range_add(kvm, start, end);

	if (is_tdp_mmu_enabled(kvm)) {
        //...
		ret = kvm_tdp_mmu_map_private(kvm, start, end, map_private);
        // 设置 xarray 为想要的属性,不管是 shared 还是 private
		kvm_vm_set_memory_attributes(kvm, attrs, start, end);
        //...
	} else {
		gfn_t gfn;

		for (gfn = start; gfn < end; gfn++) {
			/* mmu_map_private() handles only 1 gfn. */
			ret = mmu_map_private(kvm, gfn, map_private);
			if (ret) {
				if (gfn > start) {
					ret = -EAGAIN;
					start = gfn;
				}
				break;
			}

			KVM_BUG_ON(kvm_vm_set_memory_attributes(kvm, attrs, gfn, gfn + 1), kvm);
			if (need_resched()) {
				ret = -EAGAIN;
				start = gfn + 1;
				break;
			}
		}
	}

	kvm_mmu_invalidate_end(kvm);
    //...
}

TDX MCE handling

注册:

tdx_hardware_setup
    mce_register_decode_chain(&tdx_mce_nb);
        mce_register_decode_chain

运行时:

.notifier_call = tdx_mce_notifier
    tdx_mce_notifier
/* Clear poisoned bit to avoid further #MC */
static int tdx_mce_notifier(struct notifier_block *nb, unsigned long val, void *data)
{
	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
	struct mce *m = (struct mce *)data;
	unsigned long kaddr;
	unsigned long addr;
	struct page *page;
	u16 hkid;

    // 我们需要这个 feature 来 clear poison bit
	if (!boot_cpu_has(X86_FEATURE_MOVDIR64B))
		return NOTIFY_DONE;

    // 目前 TDX 只关注 memory error,其他类型的 error 直接返回。
	if (!m)
		return NOTIFY_DONE;
	if (!mce_is_memory_error(m))
		return NOTIFY_DONE;

	addr = m->addr & ((1ULL << boot_cpu_data.x86_phys_bits) - 1);
	hkid = m->addr >> boot_cpu_data.x86_phys_bits;

	/* Is hkid used for TDX? */
	if (hkid < tdx_global_keyid)
		return NOTIFY_DONE;

    // 后面做的都是把 kaddr 指向的地址所对应的 cache line 大小的区域清空。
	/*
	 * MCE handler may make the page non-present in direct map. Map the page
	 * to access.  Use VM_FLUSH_RESET_PERMS flag to tlb flush at vunmap()
	 * and reset direct mapping region.
	 */
    // 找到对应的页描述结构
	page = pfn_to_page(addr >> PAGE_SHIFT);
	kaddr = (unsigned long)vmap(&page, 1, VM_FLUSH_RESET_PERMS, PAGE_KERNEL);
	if (!kaddr)
		return NOTIFY_DONE;

	/* Adjust page offset. */
	kaddr |= addr & ~PAGE_MASK;
	/* Align to cache line. */
	kaddr = ALIGN_DOWN(kaddr, 64);
	/* Direct write to clear poison bit. */
	movdir64b((void *)kaddr, zero_page);
	__mb();

	vunmap((void *)(kaddr & PAGE_MASK));

	pr_err("cleared poisoned cache hkid 0x%x pa 0x%lx\n", hkid, addr);
	return NOTIFY_DONE;
}

TD Preserving

TD Preserving 不是一个写在 SPEC 里的 TDX Module 支持的 feature,它是一个纯在 Kernel/KVM 里实现的 feature。

Partitioned TD / TD Partitioning

1.5 的 feature。

Designed to provide a minimal environment for supporting the Microsoft VSM and similar architectures.

TDX_OPERAND_BUSY_HOST_PRIORITY

For guest-side functions: The operand is busy (e.g., it is locked in Exclusive mode) due to host priority.

For host-side functions: The operand is busy; the host VMM should retry the operation until successful to avoid guest being stuck on host priority.

TSC in TDX

TDX protects TDX guest TSC state from VMM (VMM cannot access guest's TSC value). The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM owns TSC virtualization for VMs, but the TDX module does for TDs.

Guest TDs are not allowed to modify the TSC. WRMSR attempts of IA32_TIME_STAMP_COUNTER result in a #VE.

Guest TDs are not allowed to access IA32_TSC_ADJUST because its value is meaningless to them.

TDX leaves

arch/x86/virt/vmx/tdx/tdx.h
arch/x86/kvm/vmx/tdx_arch.h
arch/x86/virt/vmx/tdx/tdx_module_loader_old/tdx_arch.h

handle_removed_private_spte() KVM

做两件事:zap(RANGE.BLOCK),remove(PAGE.REMOVE)。

  • 如果 new spte 是 zapped 的话,只做第一件事,也就是 zap(block)。
  • 如果 new spte 是 removed 的话,那么两件事都要做。
static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
					u64 old_spte, u64 new_spte,
					int level)
{
    //...
    // 不能从 zap 到 zap 状态。
	KVM_BUG_ON(was_private_zapped && is_private_zapped, kvm);
    // 因为要么是 zapped 要么是 removed,不可能还是 present
	WARN_ON_ONCE(is_present);
    // 不是 leaf 不用处理
	if (!was_leaf)
		return;

    // zap
	ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
    // 如过只需要 zap,那么 return。
	if (is_private_zapped) {
		/* page migration isn't supported yet. */
        // 而且我们还要保证 zap 前后的 pfn 是一样的。
		KVM_BUG_ON(new_pfn != old_pfn, kvm);
		return;
	}
    //...
	/* non-present -> non-present doesn't make sense. */
	KVM_BUG_ON(!was_present, kvm);
    // 都是 removed 了,new_pfn 不能有值了
	KVM_BUG_ON(new_pfn, kvm);
    // page.remove 掉
	ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn);
}

TDX private SPTE state transition / state diagram (public state)

一个 public state 是一个数,这个数和一个 TDX Module 内部维护的 SPTE 状态一一对应。对于这个内部状态,一些 SEAMCALL 会在出错时:

  • 在 RCX 返回 SPTE,我们可以通过这个程序进行分析:
  • 在 RDX 的 bit 7~15 这 8 位返回出错时的 public state。

https://github.com/tristone13th/code-snippets/blob/main/c/tdx/spte_state.c

所有的状态以及解释:

TDX Base SPEC
9. TD Private Memory Management
9.2. Secure EPT Entry
9.2.1. UPDATED: Overview
Table 9.1: UPDATED: Secure EPT Entry State High Level Description

// 这个更全一些
ABI
Table 4.32: Secure L1 EPT Entry TDX State Returned by TDX Interface Functions

基础状态图(注意 SEPT non-leaf entry 和 leaf entry 的状态是不一样的):

TDX Base SPEC
9. TD Private Memory Management
9.2. Secure EPT Entry
9.2.2. UPDATED: SEPT Entry State Diagrams

加了 Migration 之后的状态图:

TD Migraiton Spec
Figure 9.3: Partial SEPT Leaf Entry State Diagram for Mapped Page Export
Figure 9.6: Partial SEPT Leaf Entry State Diagram for Pending Page Export
Figure 9.8: Page In-Order Import Phase Partial SEPT Entry State Diagram
Figure 9.9: Page Out-of-Order Import Phase Partial SEPT Entry State Diagram

每个 state 由这七位表示:

  • SEPT_ENTRY_D_BIT_POSITION (0): dirty bit (bit 9)
  • SEPT_ENTRY_TDEX_BIT_POSITION (1 到 4):
    • tdex, // bit 53 - Exported
    • tdbw, // bit 54 - Blocked for Writing
    • tdb, // bit 55 - Blocked
    • tdp, // bit 56 - Pending
  • SEPT_ENTRY_IPAT_TDMEM_BIT_POSITION (5 到 6):
    • bit 6: // always 1
    • bit 7 // Non-Leaf(0) / Leaf(1)

MAPPED: leaf is 1

BLOCKED: leaf, tdb is 1

BLOCKEDW: leaf, tdbw is 1(只有 migration 才能转移到这个状态)

EXPORTED_BLOCKEDW: leaf, tdbw, tdex is 1(只有 migration 才能转移到这个状态)

Why there are 2 different states FREE and REMOVED?

TDH.MEM.PAGE.REMOVE will mark the state to FREE.

TDH.IMPORT.MEM(Cancel) will mark the state to REMOVED.

BLOCKED + TDH.EPORT.BLOCKW

理论上,SPEC 里,没有看到有这种状态转移。

TDX Module 的 Code 里看下来,这种状态转移也是不允许的。

BLOCKED + TDH.EPORT.MEM

理论上,SPEC 里,没有看到有这种状态转移。

TDX Module 的 Code 里看下来,这种状态转移也是不允许的。

BLOCKEDW + TDH.MEM.RANGE.BLOCK

理论上,SPEC 里,会转移到 BLOCKED 状态。

leaf 7, index = 6

TD Teardown Process

TD Teardown Process from QEMU


When does hkid is freed?

KVM:

kvm_vcpu_release
kvm_vm_release
    kvm_put_kvm(kvm); // 如果 ref_count 变成了 0
        if (refcount_dec_and_test(&kvm->users_count))
            kvm_destroy_vm
                mmu_notifier_unregister
                    subscription->ops->release
                        kvm_mmu_notifier_release
                            kvm_flush_shadow_all
                                kvm_arch_flush_shadow_all
                                    vt_flush_shadow_all_private // vt_x86_ops.flush_shadow_all_private()
                                        tdx_mmu_release_hkid
                                            tdx_hkid_free // free the hkid
                kvm_arch_destroy_vm
                    static_call_cond(kvm_x86_vm_free)(kvm);
                        vt_vm_free
                            tdx_vm_free

Zap in TDX

enum tdp_zap_private {
	ZAP_PRIVATE_SKIP = 0,
	ZAP_PRIVATE_BLOCK,
	ZAP_PRIVATE_REMOVE,
};

这三个 action 置上的地方不一样,但是检查并使用的地方都是一样的,都是在函数 tdp_mmu_zap_leafs() 里面。

kvm_tdp_mmu_unmap_gfn_range() 的 code 里其实可以看出来,这个代表了三种情况:

  • ZAP_PRIVATE_REMOVE:需要调用 PAGE.REMOVE 的 SEAMCALL 来把页从 TDX Module 里去除,因为我们要 gmem invalidation (PUNCH_HOLE),这其实表示的就是这段 gmem 空间我们不要了,kernel 可以收回这些 page 了,如果不从 TDX Module 里 remove 掉,那么 kernel 分配给其他进程使用的时候就会出错。当然,当要 delete 一个 memory slot 的时候也会触发这种情况。
  • ZAP_PRIVATE_BLOCK:从 private 往 shared 转,因为我们不能保证后面还会不会转回来,所以我们暂时先不 remove,这样可以提高性能。
  • ZAP_PRIVATE_SKIP:MMU Notifier。

ZAP_PRIVATE_SKIP KVM

默认的,对于 shared 也是用 ZAP_PRIVATE_SKIP。只在下面的函数 tdp_mmu_zap_leafs() 用到了(这个函数用来 zap 一个 range 的 leafs)。

  • 对于 private 的 sp,什么也不做;
  • 对于 shared sp,把这些 SPTE leaf 直接置为 SHADOW_NONPRESENT_VALUE,不像 private SPTE 还有一个中间步骤。
// 这个函数可以用来 zap private spt 的 leaf,也可以用来 zap shared spt 的 leaf。
static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
			      gfn_t start, gfn_t end, bool can_yield, bool flush,
			      enum tdp_zap_private zap_private)
{
	bool is_private = is_private_sp(root);
    //...
    // 可以看到,SKIP 对无论 shared 还是 private 都是有意义的
    // 另外两个(BLOCK, REMOVE)都是仅仅对 private 有意义的。
    WARN_ON_ONCE(zap_private != ZAP_PRIVATE_SKIP && !is_private);
    //...
    // 正如 SKIP 的名字,如果我们是 private 的,并且 skip 那么就不 zap
	if (zap_private == ZAP_PRIVATE_SKIP && is_private)
		return flush;
    //...
    for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
        // 如果是 SKIP,那么必定是 shared SPTE。
        // 这里的条件是达不成的。
        if ((zap_private == ZAP_PRIVATE_SKIP ||
             zap_private == ZAP_PRIVATE_BLOCK) &&
            is_private_zapped_spte(iter.old_spte))
            continue;
		if (zap_private == ZAP_PRIVATE_REMOVE)
			new_spte = SHADOW_NONPRESENT_VALUE;
        // 虽然函数是 private zapped,但是里面会检查,如果是 shared
        // 返回的就是 SHADOW_NONPRESENT_VALUWE
		else
			new_spte = private_zapped_spte(kvm, &iter);
}

ZAP_PRIVATE_BLOCK KVM

只在下面的函数用到了:

static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
			      gfn_t start, gfn_t end, bool can_yield, bool flush,
			      enum tdp_zap_private zap_private)
    //...
    for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
        if ((zap_private == ZAP_PRIVATE_SKIP || zap_private == ZAP_PRIVATE_BLOCK) &&
            is_private_zapped_spte(iter.old_spte))
            continue;
        //...
        if (zap_private == ZAP_PRIVATE_REMOVE)
			new_spte = SHADOW_NONPRESENT_VALUE;
        // 这里暗示了其实
		else
			new_spte = private_zapped_spte(kvm, &iter);
        //...

不难分析得出,这个 action 表示将 SPTE 置为 private_zapped_spte 的状态,也就是置上 bit 62 (SPTE_PRIVATE_ZAPPED)。

ZAP_PRIVATE_REMOVE KVM

REMOVE 就是一步到位,直接变成 SHADOW_NONPRESENT_VALUE

static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
			      gfn_t start, gfn_t end, bool can_yield, bool flush,
			      enum tdp_zap_private zap_private)
    //...
    for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
        //...
        if (zap_private == ZAP_PRIVATE_REMOVE)
			new_spte = SHADOW_NONPRESENT_VALUE;
		else
			new_spte = private_zapped_spte(kvm, &iter);
        //...
    }

.free_private_spt(), .remove_private_spte(), .drop_private_spte(), .zap_private_spte()

因为这些是 private 相关的,所以 VMX 并没有对应的 callback,这些是对应在 TDX 里面的。

注意,freespt,其它都是 spte

设置一个 SPTE 为 zapped 的三条路径

// Kernel 擅自把 PTE 改的时候会通过这个 path 通知到 KVM
mmu_notifier_change_pte
    __mmu_notifier_change_pte
        kvm_mmu_notifier_ops->change_pte()
            kvm_mmu_notifier_change_pte
                kvm_change_spte_gfn
                    kvm_set_spte_gfn
                        kvm_tdp_mmu_set_spte_gfn
                            set_spte_gfn
                            	tdp_mmu_iter_set_spte(kvm, iter, private_zapped_spte(kvm, iter));


zap_collapsible_spte_range
    tdp_mmu_zap_spte_atomic
        __kvm_tdp_mmu_write_spte(iter->sptep, private_zapped_spte(kvm, iter));


// 这个 path 就是 gmem invalidate gfn 的 path
kvm_mmu_unmap_gfn_range
    kvm_unmap_gfn_range
        kvm_tdp_mmu_unmap_gfn_range
            tdp_mmu_zap_leafs
                new_spte = private_zapped_spte(kvm, &iter);

// 这个 path 就是 set 一个 memslot 时会把整个 memslot invalidate 掉的 path
kvm_set_memslot
    kvm_invalidate_memslot
        kvm_arch_flush_shadow_memslot
            kvm_mmu_zap_memslot
                kvm_tdp_mmu_unmap_gfn_range
                    tdp_mmu_zap_leafs
                        new_spte = private_zapped_spte(kvm, &iter);

// 看起来是不太常用的一些 corner case
__kvm_set_or_clear_apicv_inhibit
kvm_post_set_cr0
update_mtrr
    kvm_zap_gfn_range
        kvm_tdp_mmu_zap_leafs
            tdp_mmu_zap_leafs
            	new_spte = private_zapped_spte(kvm, &iter);


tdx_sept_zap_private_spte() & .zap_private_spte / SPTE_PRIVATE_ZAPPED KVM

只是 block 这一段 GPA 而已。

Zap 的语义其实是被 TDX patchset 给改了,原来没有引入 TDX 的时候就是只是简单的把这个 SPTE remove 掉,加入了 TDX 之后,表示一种中间的 range block 的状态(SPTE_PRIVATE_ZAPPED)。可能是为了性能的考虑,在做 page attribute conversion 的时候,如果要把一个 page 从 private 转成 shared 的状态,因为我们并不着急回收内存,那么我们只需要先 zap 一下,保留着其他的元信息比如 PFN,并不调用 TDH.MEM.PAGE.REMOVE 来把它 remove 掉,这样如果以后我们还需要再 convert 回来的话,我们就省了很多开销(比如 TDH.MEM.PAGE.ADD)。

KVM: x86/mmu: add SPTE_PRIVATE_ZAPPED

KVM: x86/tdp_mmu: optimize remote tlb flush

This is preparation to optimize TLB shootdown. The existing code to zap the EPT entry always issues the TLB shootdown each the EPT entry, doesn't batch TLB shootdown for zapping multiple EPT entries. The origin procedure is:

  1. clear the EPT entry (in the KVM maintained shadow table).
  2. TDX SEAMCALL TDH.MEM.RANGE.BLOCK with GFN. This corresponds to clearing the present bit.
  3. TDH.MEM.TRACK. corresponds to local tlb flush
  4. send IPI to remote vcpu. This corresponds to remote tlb flush.
  5. When destructing TD, TDH.MEM.PAGE.REMOVE with PFN. There is no corresponding to the VMX EPT operation.

At the last step, PFN is needed to unlink the private memory from the Secure EPT. Because this procedure is doing synchronously, the PFN is saved on the stack.

If we'd like to batched TLB shootdown (TLB shootdown when entering guest?), the PFNs needs to be saved somewhere because the stack can't be used as the array of PFNs can be large.

  1. multiple the EPT entries
  2. TDX SEAMCALL TDH.MEM.RANGE.BLOCK with GFNs This corresponds to clearing the present bit. The step 1) and 2) are repeated for multiple GFNs. and then
  3. TDH.MEM.TRACK. corresponds to local tlb flush
  4. send IPI to remote vcpu. This corresponds to remote tlb flush. 3) and 4) is a batched TLB shootdown.
  5. When destructing VM, TDH.MEM.PAGE.REMOVE with PFNs. There is no corresponding to the VMX EPT operation.

For the step 5), PFNs needs to be remembered somewhere. One option is to use the zapped EPT entry. by setting the special flag SPTE_PRIVATE_ZAPPED. 从这里可以看出引入 SPTE_PRIVATE_ZAPPED 的作用,在一个 page 被标记成 zapped 的时候,并不会马上被 remove,而是在 SPTE 要变为其他的状态(非 zapped)时才会被 remove。在最后要关闭整个 TD 的时候,会把每一个 SPTE 设置为 SHADOW_NONPRESENT_VALUE,从而触发之前被置为 private zapped 的 page 的 removing:

kvm_mmu_notifier_release
    kvm_flush_shadow_all
        kvm_arch_flush_shadow_all
            kvm_mmu_zap_all
                kvm_tdp_mmu_zap_all
                    tdp_mmu_zap_root
                        tdp_mmu_set_spte_atomic(SHADOW_NONPRESENT_VALUE)
                            handle_changed_spte
                                // 这个条件很重要,是 delay removing 的关键
                                // 从 zap 的状态置为 non-present 的状态
                            	if (was_private_zapped && !is_present) {
                                    handle_private_zapped_spte
                                        // TDH.MEM.PAGE.REMOVE
                                    	ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn);
static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level)
{
    //...
	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
    //...
	WRITE_ONCE(kvm_tdx->has_range_blocked, true);
}

tdx_sept_drop_private_spte() & .drop_private_spte KVM

虽然名字叫 drop_private_spte,但最主要的作用还是回收掉指定的那一个 page。

在 TDX module 里 remove 掉 SPTE 映射到的那个 private page,并且 clear 对应 SPTE 的 value(当我们在 teardown TD 时就不需要了)。

在这里调用 tdx_unpin() 的原因,我觉得应该是因为在 close gmem/restricted fd 的时候,对应的内存并不会被 kernel 回收掉。所以需要在这里调用 tdx_unpin() 一个页一个页地把 private 的 memory 都回收掉。

有两个调用到的地方:

  • tdx_sept_remove_private_spte
  • rmap_remove
static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
                        			  enum pg_level level, kvm_pfn_t pfn)
{
    int tdx_level = pg_level_to_tdx_sept_level(level);
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
	struct tdx_module_output out;
	gpa_t gpa = gfn_to_gpa(gfn);
	hpa_t hpa = pfn_to_hpa(pfn);
	hpa_t hpa_with_hkid;
	int r = 0;
	u64 err;
	int i;

    // This means we are destroying the TD, because we don't need
    // to set the spte value to 0, so we use reclaim SEAMCALL
	if (!is_hkid_assigned(kvm_tdx)) {
		/*
		 * The HKID assigned to this TD was already freed and cache
		 * was already flushed. We don't have to flush again.
		 */
		tdx_reclaim_page(hpa, level, false, 0);
        // 真正地 free 掉这个 page
        tdx_unpin(kvm, gfn, pfn, level);
		return 0;
	}

    //...
    // 在 TDX module 里 remove 那个 private page, 并且清空对应的 spte value。
	err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
    //...

    // 对应的是这个 page 是大页时的情况,需要把每一个 4KB 的 page 都回收。
	for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++) {
        // hpa 和 hkid 关联起来
		hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
        // ...
        // Write back and invalidate all cache lines with the specified page
		err = tdh_phymem_page_wbinvd(hpa_with_hkid);
        tdx_set_page_present(hpa);
        // tdx_unpin 支持回收一整个大页,但可以看到这里参数是 PG_LEVEL_4K,说明
        // 我们在一个一个地回收,是不是有可以优化的地方?
        tdx_unpin(kvm, gfn + i, pfn + i, PG_LEVEL_4K);
		hpa += PAGE_SIZE;
	}
	return r;
}

tdx_sept_remove_private_spte() & .remove_private_spte() KVM

这里的 SPTE 和函数 handle_changed_spte() 里的一样,这里的 SPTE 并不一定就是最后一级 SPTE。

Remove = TLB flush + Drop

static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
					 enum pg_level level, kvm_pfn_t pfn)
{
	if (is_hkid_assigned(to_kvm_tdx(kvm)))
		kvm_flush_remote_tlbs(kvm);

	return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
}

在两个地方会被调用到:handle_changed_private_spte() 以及 handle_private_zapped_spte()


__handle_changed_spte
    handle_private_zapped_spte
    handle_changed_private_spte

tdx_sept_free_private_spt() & .free_private_spt KVM

free_private_spt() is (obviously) called when a shadow page is being zapped.

kvm_mmu_free_shadow_pagehandle_removed_pt 中均有调用。传进来的 gfn 是 sp->gfn,也就是这个 mmu page 的第一个 entry 所映射的 gfn。这个函数的作用并不是 free 这个 gfn 所指向 entry 表示的页表,而是这个 entry 所在的这整个 mmu page,也就是 sp

这个函数最主要的就是调用了 tdh_mem_sept_remove SEAMCALL 来在 tdx module 里取消 track 这个 EPT page。

static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level, void *private_spt)
{
    // page 的 gfn 能够确定一个 SPTE 的 key,这个 SPTE 的父节点所表示的 gfn 范围 (key) 可以通过
    // mask gfn 的一些位来得到。
	gpa_t parent_gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level + 1);
	int parent_tdx_level = pg_level_to_tdx_sept_level(level + 1);
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
	struct tdx_module_output out;
	u64 err;

    // 如果我们是在 teardown process,那么我们将这个页表页从 TDX Module 回收掉。
    // 至于在 kernel 里的回收,kvm_mmu_free_private_spt 会 free_page,所以这就不需要了
	if (!is_hkid_assigned(kvm_tdx))
		return tdx_reclaim_page(__pa(private_spt), PG_LEVEL_4K, false, 0);

    // 先把父 page 表示的 range 整个 block 住
	if (kvm_tdx->td_initialized)
		err = tdh_mem_range_block(kvm_tdx->tdr_pa, parent_gpa, parent_tdx_level, &out);

    // Flush TLB on all vcpus
	tdx_track(kvm_tdx);

    // 把父节点所在的 shadow page remove 掉
    // TDH.MEM.SEPT.REMOVE removes an empty Secure EPT page or pages, with all 512 entries marked as FREE
	err = tdh_mem_sept_remove(kvm_tdx->tdr_pa, parent_gpa, parent_tdx_level, &out);
    //...

    // 这个 EPT 页可能在 cache line 中还存在,写回并 invalidate cache line
	err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(__pa(private_spt), kvm_tdx->hkid));
    //...
	return 0;
}

TDX Basic

Documentation

GHCI 主要讲 TDG.VP.VMCALL 这个 TDCALL 的 leaf。

为什么遇到特权指令比如 CPUID 时,不是 VMM 直接模拟并返回,而是通过 VE 返回然后 guest 再 TDG.VP.VMCALL Instruction.CPUID?

Scripts

guest kernel cmdline should add idle=poll.

How to see TDX module version of a running system?

dmesg | grep "TDX module"

TD_PARAMS Structure

ABI 3.4.4. UPDATED: TD_PARAMS

KVM calls the TDH.MNG.INIT (passing the TD_PARAMS structure) to initialize the TD. It's size is 1KB.

  • From KVM to TDX module: pass TD_PARAMS;
  • From QEMU to KVM: pass init_vm.

Persistent SEAMLDR and Non-Persistent SEAMLDR

NP-SEAMLDR: Load P-SEAMLDR. This one is the binary /boot/efi/EFI/TDX/TDX-SEAM_SEAMLDR.bin.

P-SEAMLDR: Load TDX Module.

How does host know the TDX module can be trusted?

TD Measurement and Attestation.

Intel SGX-Based Attestation.

All TD measurements are reflected in TD attestations.

run-time measurement registers can be used by the guest TD software, e.g., to measure a boot process.

What is XMM register?

Finalize the TD measurement?

  • Its measurement cannot be modified anymore (except the run-time measurement registers).
  • TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER).

When?

After the initial set of pages is added and extended, the VMM can finalize the TD measurement using the TDH.MR.FINALIZE function.

TD Exit logic

TD guest code ->

TD Exit reasons

TD Exit qualification

TDX ABI
3. Data Types
3.7. TD Entry and Exit Types
3.7.1. Extended Exit Qualification

TDX SPTE bits illustrated

typedef union ia32e_sept_u {
    struct {
        uint64_t
            r          :   1,  // 0
            w          :   1,  // 1
            x          :   1,  // 2
            mt         :   3,  // 3-5 - Set to 110 (WB)
            ipat_tdmem :   1,  // 6 - Set to 1
            leaf       :   1,  // 7 - Non-Leaf(0) / Leaf(1), always 1 for 4KB (level 0)
            a          :   1,  // 8 - Accessed
            d          :   1,  // 9 - Dirty, set and cleared by the TDX module in all the *EXPORTED_* states
            tdel       :   1,  // 10 - Entry Lock
            reserved_0 :   1,  // 11 - Reserved for IOMMU SNP
            base       :   40, // 12-51
            hp         :   1,  // 52 - Host Priority, used together with TDEL
            tdex       :   1,  // 53 - Exported
            tdbw       :   1,  // 54 - Blocked for Writing
            tdb        :   1,  // 55 - Blocked
            tdp        :   1,  // 56 - Pending
            tdpin      :   1,  // 57 - 1: Page is pinned in memory
            pw         :   1,  // 58 - Paging-Write
            ignored_0  :   1,  // 59
            sss_tdsa   :   1,  // 60 - Supervisor Shadow Stack / SEPT Alias (Link)
            reserved_1 :   1,  // 61 - Reserved for IOMMU
            reserved_2 :   1,  // 62 - Reserved for IOMMU (BlockDMA)
            supp_ve    :   1;  // 63
    };
    uint64_t raw;
}

What is Asynchronous TD Exit and TD Resumption?

所谓的同步与异步,其实是对于 Guest TD 的 code flow 来说的。

A TD Exit might be asynchronous, triggered by some external event (e.g., external interrupt or SMI) or an exception, or it might be synchronous, triggered by a TDCALL(TDG.VP.VMCALL) function.

异步比如 interrupt / EPT violation,都是在 Guest TD 的 code flow 之外的,所以算是异步的。

  • Guest TD memory access to a non-present private GPA causes an asynchronous TD exit with an EPT Violation exit reason.

而 VMCALL 是 TD 主动调用的,所以算是同步的。

What is Synchronous TD Exit and Subsequent TD Entry?

It is regarded to the TD, TD can exit by invoking TDG.VP.VMCALL, so it is synchronous.

What's the difference between TDX modes and SEAM modes?

TDX modes are logical concepts which doesn't exist, it is logical, SEAM root is physical.

TDX root mode contains:

  • Non-SEAM root mode: KVM is running under this mode.
  • half SEAM root mode: TDX Module which serve host-side functions is running under this mode.

TDX non-root mode contains:

  • SEAM non-root mode: TD is running under this mode.
  • half SEAM root mode: TDX Module which serve guest-side functions is running under this mode.

You can refer to figure 2.2 in TDX spec for more information.

SEAM is Secure Arbitration Mode

Like SMM, it is a new mode.

You can image a cubic, which has x, y and z axis, they are:

  • x: Modes, such as SMM, SEAM
  • y: Out of VMX, VMX Root, VMX Non-Root
  • z: Ring 0-3

Shared EPT and Secure EPT

  Shared EPT Secure EPT
For Shared GPAs Private GPAs
Managed by VMM VMM indirectly
GPA encrypted by Key shared with VMM TD private key
Encrypted? Not Not

Shared key and TD private key

Shared/Private is a bit in physical address.

  • Shared accesses are intended to behave as legacy memory accesses and use the upper bits of the host physical address as an HKID, which must be from the range allocated to legacy MKTME.
  • Private accesses use the guest TD’s private HKID. (Which means it won't use the key in the upper bits!)

How to understand "Accessible" and "Addressable"?

Accessible (Memory): Memory whose content is readable and/or writeable (e.g., TD private memory is accessible to the guest TD).

Addressable (Memory): Memory that can be referred to by its address. The content of addressable memory might not necessarily be accessible (e.g., TDCS is not accessible to the host VMM, you need to use SEAMCALL to invoke the TDX Module).

TDH submodules?

MNG: Management

MR: Measurement Register

MEM: Memory

VP: Virtual Processor

All the interfaces can be seen TDX Spec: Sec 2.9.

tdsysinfo_struct

The tdsysinfo_struct is fairly large (1024 bytes) and contains a lot of info about the TDX module.

tdsysinfo_struct is also a structure defined in KVM.

TDH.SYS.RD/RDALL and TDH.SYS.INFO Can both enumerate TDX module capabilities.

TDH.SYS.RD and TDH.SYS.RDALL are added in TDX 1.5, and are the recommended enumeration methods.

tdx_capabilities (KVM)

用来让 KVM 把其关注的 tdsysinfo_struct 里的一些 field 进行解析变换处理后存起来。

struct tdx_capabilities {
    //...
	u8 sys_rd; // 支不支持 TDH.SYS.RD 这个 SEAMCALL
	u32 max_servtds; // 最多同时能够 bind 多少个 service TD
    //...
	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
};

What is initialize?

Who initialize? The host VMM.

Initialize who? The TDX module.

When to initialize? After loaded the module.

What the TDX-module exactly is?

TDX module is in SEAM VMX Root.

TD is in SEAM VMX non-root.

So TDX module and TD is like VMM and VM.

TDX module also use the VMRESUME/VMLAUNCH to start a VM. (This is wrapped in the TDX SEAMCALL TDH.VP.ENTER, so actually, KVM can just call this SEAMCALL).

What is CMR (Convertible Memory Ranges)?

Memory regions that can hold TD-private memory pages.

The meaning of "Convertible": Can be converted from a Shared page to a Private page.

TDX Spec: 13.1.4.1

A 4KB memory page is defined as convertible if it can be used to hold an Intel TDX private memory page or any Intel TDX control structure pages while helping guarantee Intel TDX security properties (i.e., if it can be converted from a Shared page to a Private page).

  • CMR configuration is checked by MCHECK and cannot be modified afterwards.

How does KVM get the CMR information?

The host VMM should then call the TDH.SYS.RD/RDALL or TDH.SYS.INFO function to enumerate the Intel TDX module functionality and parameters, and retrieve the trusted platform topology and CMR information as previously checked by MCHECK.

What's the relationship between TDMR and CMR?

TDMR is very similar with CMR, it also has many characteristics:

  CMRs TDMRs
Size multiple of 4KB multiple of 1GB
Power of 2 not required not required
Can overlap? No No
Scope Platform Platform
Soft configuration Yes Yes
Phsical or Virtual Physical Physical

Every TDMR page must reside within a CMR. There is no requirement for TDMRs to cover all CMRs.

  • During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees.
  • The KVM should decide on a set of TDMRs based on the CMR information.
  • The KVM should then call the TDH.SYS.CONFIG function and pass TDMR information with other configuration information.
  • The KVM should then use the TDH.SYS.TDMR.INIT function to initialize the TDMRs and their associated control structures.

TDX reports a list of CMR to tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. Once this is done, those "TDX-usable" memory regions are fixed during module's lifetime.

SEAM module / TDX module

Identical.

What's the difference between SEAMRR and CMR?

  SEAMRR CMR
Intention Loading and executing the Intel TDX module hold TDX memory pages encrypted with a private HKID
Configured by BIOS BIOS
It is a Register Table
P or V Physical Range Physical Range

SEAMRR is for memory range for loading and executing the Intel TDX module.

MCHECK stores the CMR table in a pre-defined location in SEAMRR’s SEAMCFG region so it can be read later and trusted by the Intel TDX module.

Does TDX module know the content of KET(Key Encryption Table)?

What is a service TD?

// currently only migtd is supported
enum kvm_tdx_servtd_type {
	KVM_TDX_SERVTD_TYPE_MIGTD = 0,
	KVM_TDX_SERVTD_TYPE_MAX,
};

Service TD to target TD binding relationship is many-to-many

  • Multiple service TDs of different types may be bound to a single target TD. (On target TD can have no more 1 service TD with the same type).
  • Multiple target TDs may be bound to a single service TD.

What is ephemeral key?

It is just another name for the key in MKTME to encrypt TD pages.

TDVMCALL

TDVMCALL: Guest Call a host VMM service.

  • It is a TDCALL, the function name is TDG.VP.VMCALL
  • The call is forwarded by TDX module to the host VMM (e.g., KVM), so it is a hypercall implemented in the TDX context.
  • EXIT_REASON_TDCALL
static int vt_handle_exit(struct kvm_vcpu *vcpu,
			     enum exit_fastpath_completion fastpath)
{
	if (is_td_vcpu(vcpu))
		return tdx_handle_exit(vcpu, fastpath);

	return vmx_handle_exit(vcpu, fastpath);
}

Actually, all TDVMCALL is triggered by a TDCALL(TDG.VP.VMCALL), when to process go from TDX module to the Hypervisor, it is named TDVMCALL.

TDVPS.LAST_TD_EXIT has a name TDVMCALL which denotes last TD exit was due to a TDG.VP.VMCALL. On the next TD entry, most GPR and all XMM state will be forwarded to the guest TD from the host VMM.

For more, pls. refer to TDX Spec: Sec 2.9.

All the interfaces can be seen TDX Spec: Sec 2.9.1.

Relationship between PAMT and Secure EPT

Control Structures

TDVPS, TDVPR, TDVPX?

  TDVPS TDVPR TDVPX
Name State Root Extension
Pages Multi 1 Multi
Page type Mixed TDVPR Page TDCX Pages

TDVPR is the root (first) page of a TDVPS.

TDVPX are the non-root pages of a TDVPS.

TDCX: 4KB physical pages that are intended to hold parts of a multi-page control structure. (It is a page type)

TDVPS includes the VMCS of the TD, so I think TDVPS can be seen as the superset to the VMCS.

What's the difference between TD Root (TDR), TDCS, TDVPS?

It seems TDVPS is like VMCS, they both control a vcpu.

The Intel TDX module is designed to load CPU state from the TDVPS structure and perform VM entry to go into TDX non-root mode. When TD exit is triggered, the Intel TDX module is designed to save CPU state into the TDVPS structure, load the CPU state saved on TD entry.

  TDR TDCS TDVPS
Meaning Trust Domain Root Trust Domain Control Structure Trust Domain Virtual Processor State
Scope TD TD VCPU
Controls key management and build/teardown process. the operation of a guest TD operation and hold the state of a guest TD virtual processor.
Pages 1 Multi Multi
Encrypted by global private HKID guest's private HKID (TDR/TDCX) guest's private HKID (TDVPR/TDCX)

For more, pls. refer to TDX Spec: Table 2.2: TDX-Managed Control Structures Overview.

TDX host kernel code learning

Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption. It is assumed The TDX host patch series implements necessary functions.

TDX_MODULE_CALL / seamcall() / kvm_seamcall()

tdcallseamcall 是两个不同的 instruction,但是使用它们的 ABI 却是相似的,所以定义了 TDX_MODULE_CALL 来同时处理这两种情况:

  • TDX_MODULE_CALL host=1 时,使用 seamcall
  • TDX_MODULE_CALL host=0 时,使用 tdcall

我们可以看到,__seamcall() 函数是这么实现的,__tdx_module_call_asm 就是 TDCALL:

SYM_FUNC_START(__seamcall)
	FRAME_BEGIN
	TDX_MODULE_CALL host=1
	FRAME_END
	RET
SYM_FUNC_END(__seamcall)

SYM_FUNC_START(__tdx_module_call_asm)
	FRAME_BEGIN
	TDX_MODULE_CALL host=0
	FRAME_END
	RET
SYM_FUNC_END(__tdx_module_call_asm)

__seamcall()seamcall()kvm_seamcall() 中都有调用:

// 这个函数主要是 kernel 在用
seamcall()
    // 添加一些结构体的封装
    __seamcall()
        TDX_MODULE_CALL host=1

// 这个函数主要是 KVM 在用
kvm_seamcall()
    __seamcall()
        //...

为什么要设计两个封装函数呢?

SYM_FUNC_START(__tdx_module_call_asm)
	FRAME_BEGIN
	TDX_MODULE_CALL host=0
	FRAME_END
	RET
SYM_FUNC_END(__tdx_module_call_asm)

KVM TDX basic feature support (by Isaku)

Note that MSR bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the same value for all VCPUs of the same TD.

The CPU translates shared GPAs using the usual EPT or "Shared EPT" (in this document), which resides in KVM memory. The Shared EPT is directly managed by the host VMM - the same as with the current VMX.

这个没有看懂?

Since execution of such interface functions takes much longer time than accessing memory directly, in KVM we use the existing TDP code to minor the Secure EPT for the TD. This way, we can effectively walk Secure EPT without using the TDX interface functions.

One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if the guest physical address is private (the bit is cleared) or shared (the bit is set). The bits are called stolen bits.

Because it's costly to access secure EPT during walking EPTs with SEAMCALLs for the private guest physical address, another private EPT is used as a shadow of Secure-EPT with the existing logic at the cost of extra memory.

Use 'vt' for the naming scheme as a nod to VT-x and as a concatenation of VmxTdx.

Dependency:

The assumed APIs the TDX host patch series provides are
- int seamrr_enabled()
  Check if required cpu feature (SEAM mode) is available. This only check CPU
  feature availability.  At this point, the TDX module may not be ready for KVM
  to use.
- int init_tdx(void);
  Initialization of TDX module so that the TDX module is ready for KVM to use.
- const struct tdsysinfo_struct *tdx_get_sysinfo(void);
  Return the system wide information about the TDX module.  NULL if the TDX
  isn't initialized.
- u32 tdx_get_global_keyid(void);
  Return global key id that is used for the TDX module itself.
- int tdx_keyid_alloc(void);
  Allocate HKID for guest TD.
- void tdx_keyid_free(int keyid);
  Free HKID for guest TD.

Tear Down

As part of the TD teardown process, the VMM needs to put the TD into a TD_TEARDOWN state, as described in 6.3. This is a non-recoverable state.

as long as the TDR page is the last one to be reclaimed.

For TDR page, the intention is for the host VMM to call TDH.PHYMEM.PAGE.WBINVD after calling TDH.PHYMEM.PAGE.RECLAIM.

Functions such as TDH.MEM.PAGE.REMOVE and TDH.MEM.PAGE.PROMOTE are designed to remove TD private pages and Secure EPT pages, respectively.

MKTME

Why using multi-keys, is 1 key not enough?

Tenants can setup their own keys to encrypt their VMs.

Config

Misc

-no-hpet is a must option to boot a TD.

q35 is the must motherboard to boot a TD.

q35 has influence on virtio network card, make the Ubuntu guest (Ubuntu 16 and Ubuntu 22 is tested) cannot access the network.

Segment fault, lib.so.6 on ubuntu22

guest command line add "noccfilter".

guest kernel add patch x86/tdx: Virtualize CPUID leaf 0x2 · intel-innersource/os.linux.cloud.mvp.kernel-dev@a2bc4b6

Ubuntu 22 doesn't enable the network card

ip a

# do not open in tmux
sudo vim /etc/netplan/01-netcfg.yaml
sudo netplan apply
network:
  version: 2
  ethernets:
    enp0s1:
      dhcp4: true
      dhcp6: false

Ubuntu 22.04 LTS : Configure DHCP Client : Server World

只需要配一次,重启之后也没有问题。

TDCALL

TDCALL is an instruction. RAX to select the leaf.

TDCALL(guest interface): used by the guest TD software (in TDX non-root mode) to invoke guest-side TDX functions.

  • From: Guest TD
  • To: TDX Module
To find the leaf functions: *ABI: 5.4.1. TDCALL Instruction (Common) Guest-Side (TDCALL) Interface Functions Interface Functions*, there is a table lists all the leaves.

TDG.VP.VMCALL (TDCALL Leaf (RAX) = 0)

ABI: 7.5.29. TDG.VP.VMCALL Leaf: GHCI is totally for this leaf.

R11 indicates the sub function.

TDG.VP.VMCALL <SetupEventNotifyInterrupt> (sub-function (R11) = 0x10001)

The guest TD may request VMM specify which interrupt vector to use as an event-notify vector.

Example of an operation that can use the event notify is the VMM signaling a device removal to the TD, in response to which a TD may unload a device driver.

The VMM should use SEAMCALL[TDWRVPS] leaf to inject an interrupt at the requested interrupt vector into the TD VCPU that executed TDG.VP.VMCALL<SetupEventNotifyInterrupt> via the posted-interrupt descriptor.

TDG.VP.VMCALL <MapGPA> (sub-function (R11) = 0x10001)

Please search in GHCI 3.2.

Request the host VMM to map a GPA range as private or shared-memory mappings。Guest 通过指定 Start GPA of address range 里的 Shared bit 来表示 if a sharedor private-page mapping is desired.

注意,这个可以在 private 和 shared 之间转化的,也就是 Guest 可以请求将一个 private 的 mapping 转化为 shared 的。

The aim is for the VMM to use TDH.MEM.PAGE.AUG to add the GPA(s) to the TD as pending, private mapping(s) in the secureEPT. When the VMM responds to this TDG.VP.VMCALL with success, the goal is for the TD to execute TDG.MEM.PAGE.ACCEPT to complete the process to make the page(s) usable.

所以说,这个和 TDG.MEM.PAGE.ACCEPT 并不是等价的,这个是为了让 Guest 通知 VMM 先进行 AUG,这样 Guest 自己才能够 ACCEPT

我们可以看下 TDX guest kernel 里的 code:

#define TDVMCALL_MAP_GPA		0x10001


tdx_enc_status_changed
tdx_enc_status_changed_phys
_tdx_hypercall(TDVMCALL_MAP_GPA, ...)



tdx_accept_memory
tdx_enc_status_changed_phys
_tdx_hypercall(TDVMCALL_MAP_GPA, ...)

TDG.VP.VMCALL <Service> (sub-function (R11) = 0x10005) / tdx_handle_service() / struct tdvmcall_service KVM

The <Service> means this TDCALL is from a service TD, not a normal TD.

Service is identified by the GUID in the command buffer.

Service command is identified by the command filed in the data field of the command buffer.

Command/Response Buffer (CRB)

This buffer is allocated by TD and shared with KVM:

  • The command buffer is filled by the service TD with commands for KVM to handle, and
  • the response buffer is filled by the KVM to respond to the service TD.
  • These 2 buffers shouldn't be private. (Because we need KVM to get the information!)

When receiving the TDG.VP.VMCALL, KVM allocates 2 host buffers of the same size as the command buffer and response buffer, and copies the commands into the host side buffer. When the command handling is done, the response data in the KVM allocated response buffer are copied to the service TD shared response buffer. This avoids the inconvenience of direct accessing to userspace memory. (你可能会问,我们明明是从 guest TD 直接 TDVMCALL exit 到 KVM 里的,和 QEMU 没什么关系,为什么要从 userspace copy 数据呢?这是因为 guest TD 传进来的这个 shared buffer 本质不还是被 QEMU 所 handle 的处于 userspace 的内存区域吗,所以我们才可以这么进行 copy,因为 guest 的 memory 就是位于 userspace 的 memory)。

Can be async, which means KVM can return immediately and interrupt guest when response is ready. This is controlled by the R14 register of the command buffer.

CRB 的内存布局是这样的(GHCI 所定义的):

struct tdvmcall_service {
	guid_t   guid;
    // 整个 CRB 的长度(也就可以理解为这个结构体的大小)
	uint32_t length;
	uint32_t status;
	uint8_t  data[0];
};

TDG.VP.VMCALL <Service.Query>

The Query service currently only has a query command.

Allows the service TD to query if a service handling is supported by KVM.

TDG.VP.VMCALL <Service.MigTD>

The MigTD service currently has a bunch of commands supported.

  • WaitForRequest: check from KVM if there is an operation needs to perform on MigTD side.
  • ReportStatus: do the operation, report the status of the operation back to KVM.

This is used to allow MigTD to get the migration information from VMM.

handle_tdvmcall
    TDG_VP_VMCALL_SERVICE
        tdx_handle_service
static int tdx_handle_service(struct kvm_vcpu *vcpu)
{
	struct kvm *kvm = vcpu->kvm;
	struct kvm_tdx *tdx = to_kvm_tdx(kvm);
    // CRB 的地址被放在了寄存器中传过来
	gpa_t cmd_gpa = tdvmcall_a0_read(vcpu) & ~gfn_to_gpa(kvm_gfn_shared_mask(kvm));
	gpa_t resp_gpa = tdvmcall_a1_read(vcpu) & ~gfn_to_gpa(kvm_gfn_shared_mask(kvm));
	uint64_t nvector = tdvmcall_a2_read(vcpu);
	struct tdvmcall_service *cmd_buf, *resp_buf;
	enum tdvmcall_service_id service_id;
	bool need_block = false;
	int ret = 1;
	unsigned long tdvmcall_ret = TDG_VP_VMCALL_INVALID_OPERAND;

    // CRB 不能是 private memory
	if (kvm_mem_is_private(kvm, gpa_to_gfn(cmd_gpa)) ||
	    kvm_mem_is_private(kvm, gpa_to_gfn(resp_gpa))) {
		pr_warn("%s: cmd or resp buffer is private\n", __func__);
		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
		goto err_cmd;
	}

    // 如我们上面所说,allocate 2 host buffers
	cmd_buf = tdvmcall_servbuf_alloc(vcpu, cmd_gpa);
	resp_buf = tdvmcall_servbuf_alloc(vcpu, resp_gpa);
	resp_buf->length = sizeof(struct tdvmcall_service);

	service_id = tdvmcall_get_service_id(cmd_buf->guid);
	switch (service_id) {
	case TDVMCALL_SERVICE_ID_QUERY:
		tdx_handle_service_query(cmd_buf, resp_buf);
		break;
	case TDVMCALL_SERVICE_ID_MIGTD:
		if (nvector) {
			pr_warn("%s: interrupt not supported, nvector %lld\n",
				__func__, nvector);
			nvector = 0;
			break;
		}
		need_block = tdx_handle_service_migtd(tdx, cmd_buf, resp_buf);
		break;
	case TDVMCALL_SERVICE_ID_VTPM:
	case TDVMCALL_SERVICE_ID_VTPMTD:
	case TDVMCALL_SERVICE_ID_TDCM:
	case TDVMCALL_SERVICE_ID_TPA:
	case TDVMCALL_SERVICE_ID_SPDM:
		ret = 0;
		break;
	default:
		resp_buf->status = TDVMCALL_SERVICE_S_UNSUPP;
		pr_warn("%s: unsupported service type\n", __func__);
	}

	if (ret == 0) {
		/* user handles the service and update the guest status buf */
		ret = tdx_vp_vmcall_to_user(vcpu);
		kfree(resp_buf);
	} else {
		/* Update the guest status buf and free the host buf */
		tdvmcall_status_copy_and_free(resp_buf, vcpu, resp_gpa);
		tdvmcall_ret = TDG_VP_VMCALL_SUCCESS;
	}

err_status:
	kfree(cmd_buf);
	if (need_block && !nvector)
		return kvm_emulate_halt_noskip(vcpu);
err_cmd:
	if (ret) {
		tdvmcall_set_return_code(vcpu, tdvmcall_ret);

		if (nvector)
			tdx_inject_notification(vcpu, nvector);
	}

	return ret;
}

SEAMCALL

Used by the host VMM to invoke host-side TDX interface functions.

  • From: Host VMM (e.g., KVM)
  • To: TDX Module

All SEAMCALL leaves

To find the leaf functions: ABI: 5.3.1. SEAMCALL Instruction (Common), there is a table:

Table 5.4: SEAMCALL Instruction Leaf Numbers Definition

lists all the leaves.

SEAMCALL Completion Status Codes

64Bit, returned in RAX.

ABI 3.1.3. Function Completion Status Codes

TDH.MNG.KEY.CONFIG (TDCALL Leaf 8)

Configure the TD private key on a single package.

Input is the TDR physical address with HKID set to 0.

A CPU-generated random key is used. The operation may fail due to lack of entropy.

A KET entry in private HKIDs range is configured per package by KVM using the this function.

Why? if set to 0, how does MKTME build the connection between the HKID and the key?

Thy HKID is written to TDR during the TDH.MNG.CREATE function. So HKID is in the content, not in the physical address.

TD-scope key management fields are held in TDR. They include the key state, ephemeral private HKID and key information, and a bitmap for tracking key configuration.

TDH.MEM.SEPT.ADD

Add and map 4KB SEPT pages to a TD. 这个不是用来映射 GPA 到 HPA 的,而是 Add 多个页用来放 SEPT 页表的内容。

这个相当于 Non-leaf entry 版的对于 leaf entry 的 TDH.MEM.PAGE.ADD

  • TDH.MEM.PAGE.ADD 加的是 leaf entry 真正要 map 到的 TD Guest page;
  • TDH.MEM.SEPT.ADD 加的是 non-leaf entry 要 map 到的 SEPT 要用的 page。

TDH.MEM.SEPT.ADD adds a set of 4KB Secure EPT pages to a TD and maps them to the provided GPA.

TDH.MEM.SEPT.ADD initializes the SEPT pages to hold 512 free entries using the TD’s ephemeral private key.

在代码里主要是通过 tdx_sept_link_private_spt() 里调用的,而且这个函数也很简单,只是调用了这个 SEAMCALL,把 PT page 加到 TDX Module 里,并映射传进来的 GPA 到这个页。

主要的输入有一个 GPA,一个 HPA。

  • HPA 表示的是要 ADD 的 SEPT page。
  • GPA 表示的是要 map 的地址。

作用就是,让给定的 GPA 的位置的 SPTE 指向加进来 page 的 HPA。

API 这么设计的好处是灵活,这种方式:

  • 既可以添加中间页表的映射,映射到下一级页表页的 HPA;
  • 也可以添加最后一级页表的映射,映射到真正最后想要让 GPA 映射到的 HPA。(我猜的,因为其实 TDH.MEM.PAGE.ADDTDH.MEM.PAGE.AUG 可以处理最后一级页表的映射,当 TD 已经 finalized 之后,我使用 TDH.MEM.PAGE.AUG 来添加;如果还没有,用 TDH.MEM.PAGE.ADD 来添加)。

这个 SEAMCALL 的流程主要如下:

  • Walk the SEPT based on the GPA and level and find the SEPT entry
  • Initialize the new SEPT page, indicating 512 entries in the FREE state
  • Update the parent SEPT entry with the new SEPT page HPA.
  • Increment TDR.CHLDCNT.

TDH.PHYMEM.PAGE.WBINVD

Write back and invalidate all cache lines associated with the specified memory page.

TDH.MEM.SEPT.REMOVE

TDH.MEM.SEPT.REMOVE removes an empty Secure EPT page or pages, with all 512 entries marked as FREE, from the TD’s Secure EPT trees.

  • Walk the L1 Secure EPT based on the GPA operand and find the non-leaf SEPT entry of the SEPT page to be removed.
  • Scan the L1 Secure EPT page content and check all 512 entries are FREE. If passed, set the parent L1 Secure EPT entry to FREE.
  • Atomically decrement TDR.CHLDCNT.

主要的输入就是 GPA,

TDH.MEM.PAGE.ADD

Add a 4KB private page to a TD, mapped to the specified GPA, filled with the given page image.

Input:

  • GPA: 要 map 到 guest 里的地址
  • HPA: HPA of the target page to be added to the TD

AUG/ADD 都会增加 TDR.CHLDCNT。注意,ADD 是在 build time 增加的,而 AUG 是动态增加的。

这个 SEAMCALL 会更新 SPTE,使其指向所传入的 HPA。

这个 GPA->HPA 映射 VMM 应该也保存了一份。

TDH.MEM.PAGE.AUG

Dynamically add a 4KB or a 2MB private page to an initialized TD, mapped to the specified GPAs.

别名叫做 shared to private conversion。这只是把这个 page 加到了 tdx module 里,让这个 page 成了 private 的可以加密的。

AUG/ADD 都会增加 TDR.CHLDCNT

Input:

  • GPA: 要 map 到 guest 里的地址
  • HPA: HPA of the target page to be added to the TD

TDH.MEM.PAGE.REMOVE

Remove a GPA-mapped 4KB, 2MB or 1GB private page from a TD.

别名也是 private to shared conversion,从名字也可以看出,只是把这个页从 TDX Module 去掉了,不需要再加密了。

Input:

  • GPA: 要在 guest 里 remove 的 page 的 GPA

Process:

  • Walk the Secure EPT based on the GPA, and find the leaf entry of the page to be removed.
  • Set the SEPT entry state to FREE.
  • Atomically decrement TDR.CHLDCNT by 1, 512 or 5122
  • Free the physical page: Set the PAMT entry of the removed TD private page to PT_NDA.

和 RECLAIM 有什么区别呢?

  • RECLAIM 需要在 TD State teardown 的时候执行。REMOVE 是在 running 的时候执行的。有点像 ADDAUG 之间的关系哈。
  • TDH.MEM.PAGE.REMOVE 会 Set 这个 page SEPT entry to FREE,也就是会清空映射。

最后他们两个都会把对应的 page 在 PAMT 里的状态设置为 PT_NDA

这个 SEAMCALL 会在 tdx_sept_drop_private_spte() 里面被调用。

TDH.PHYMEM.PAGE.RECLAIM / tdh_reclaim_page() / KVM

Can reclaim pages only if the owner TD is in the TD_TEARDOWN state. 这个和 PAGE.REMOVE 的区别就是,PAGE.REMOVE 会是在 TD running 的过程中调用的,而这个是 teardown 过程中调用的。所以 unmap 一个 page 并不是这两个 SEAMCALL 都要跑,而是根据情况这二者选其一。

Reclaim a physical 4KB, 2MB or 1GB TD-owned page from a TD.

注意,reclaim 只是告诉 TDX module 这个 page 我们不用了,reclaim 完之后我们还需要在 kernel 里释放掉。

Input:

  • HPA: 要 reclaim 的 page 的 HPA.

Process:

  • Check that the target page metadata in PAMT are correct (PT must NOT be PT_NDA nor PT_RSVD). 这就说明给不能先 REMOVE 再 RECLAIM。
  • Update the PAMT entry of the reclaimed page to PT_NDA.

为什么 TDX Module 不设计成当一个 TD 的 TDR 被 free 的时候自动 reclaim 掉所有的 page,而是要用户在 teardown 的过程中手动调用?不知道,可能有隐情吧。

相比于 PAGE.REMOVEPAGE.RECLAIM 并不会动 SEPT,是不是因为在 teardown process,所以动 SEPT 没有意义,反正早晚都会被 clear 掉。

TDH.PHYMEM.PAGE.RECLAIM can reclaim pages only if the owner TD is in the TD_TEARDOWN state.

static int tdx_reclaim_page(hpa_t pa, enum pg_level level,
			    bool do_wb, u16 hkid)
{
	struct tdx_module_output out;
	u64 err;

	do {
		err = tdh_phymem_page_reclaim(pa, &out);
		/*
		 * TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is shutdown.
		 * state.  i.e. destructing TD.
		 * TDH.PHYMEM.PAGE.RECLAIM  requires TDR and target page.
		 * Because we're destructing TD, it's rare to contend with TDR.
		 */
	} while (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX));

	if (err & TDX_SEAMCALL_STATUS_MASK)
		return -EIO;

	/* out.r8 == tdx sept page level */
	WARN_ON_ONCE(out.r8 != pg_level_to_tdx_sept_level(level));

	if (do_wb && level == PG_LEVEL_4K) {
		/*
		 * Only TDR page gets into this path.  No contention is expected
		 * because of the last page of TD.
		 */
		err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
		if (WARN_ON_ONCE(err)) {
			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
			return -EIO;
		}
	}

	tdx_set_page_present_level(pa, level);
	tdx_clear_page(pa, KVM_HPAGE_SIZE(level));
	return 0;
}

Process

TDX module loading process

Software Use Cases

Intel TDX Module Lifecycle

Intel TDX Module Platform-Scope Initialization

Table Typical Intel TDX Module Platform-Scope Initialization Sequence

When build the kernel, set CONFIG_INTEL_TDX_HOST=y.

When boot the kernel, add kernel cmdline parameter tdx_host=on.

When load the KVM, make sure sudo cat /sys/module/kvm_intel/parameters/tdx is Y.

TDX module has 2 possible names:

  • libtdx.so. load from initrd
  • TDX-SEAM.so. load from IFWI

You can change their name to support different way loading.

If "UEFI SEAM Load" is not enabled:

  • SEAMLdr and libtdx.so are both in initrd.

else:

  • If SEAMLdr and TDX-SEAM.so are both in ESP, SEAMLdr will load the TDX module from ESP.
  • If not, SEAMLdr and TDX-SEAM.so are built in IFWI. (这种方式叫做 FV)

With the old IFWI, Linux kernel will continue help to load TDX SEAM module from /lib/firmware/intel-seam/.

When load KVM (kernel boot will also load KVM), if /lib/firmware/intel-seam/ has module, it will load it and override existing one (这个优先级是最高的). The corresponding file name is: libtdx.bin libtdx.bin.sigstruct np-seamldr.acm

TD build / create process

Table 3.3: Typical TD Build Sequence.

To use a TD, we should first build it.

KVM can create a new guest TD by allocating and initializing a TDR control structure using the TDH.MNG.CREATE function. As an input to it, the host VMM assigns the TD with a HKID. A TD is identified with the containing bits 51:12 of the physical address of the TD’s TDR page.

static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params)
{
    // ...
    ret = tdx_guest_keyid_alloc();
    // ...
    va = __get_free_page(GFP_KERNEL_ACCOUNT);
    tdr_pa = __pa(va);
    // ...
    err = tdh_mng_create(tdr_pa, kvm_tdx->hkid); // Create the TDR and generate the TD’s random ephemeral key.
    // ...
}

KVM then program the HKID and encryption key into the MKTME encryption engines using the TDH.MNG.KEY.CONFIG function on each package.

Build the TD Control Structure (TDCS) by adding control structure pages, using the TDH.MNG.ADDCX function, and initialize using the TDH.MNG.INIT function.

It can then build the Secure EPT tree using the TDH.MEM.SEPT.ADD function and add the initial set of TD-private pages using the TDH.MEM.PAGE.ADD function.

//...
kvm_init
    kvm_arch_init
        kvm_confidential_guest_init
            tdx_kvm_init
                get_tdx_capabilities
                    r = kvm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd); // cmd is KVM_TDX_CAPABILITIES
                    r = kvm_vm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd); // only if the above line returned -EINVLD



                    // KVM
                    

kvm_dev_ioctl // the system-wide ioctl
    kvm_arch_dev_ioctl
        // tdx_dev_ioctl(), pass the TDX system-wide information to user because currently only KVM_TDX_CAPABILITIES is supported
        r = static_call(kvm_x86_dev_mem_enc_ioctl)(argp);



        vt_mem_enc_ioctl
            tdx_vm_ioctl
                tdx_td_init
                    __tdx_td_init

TD memory allocation

Guest use TDG.VP.VMCALL to request GPA range allocation.

KVM TDH.MEM.SEPT.ADD build SEPT.

KVM TDH.MEM.PAGE.AUG add pages.

KVM TDH.VP.ENTER

Guest TDG.MEM.PAGE.ACCEPT.

TD memory removal

A bit complicated than allocation.

Interrupt Handling in TDX

TDX Base Spec: Interrupt Handling and APIC Virtualization

TDX supports only posted interrupt. No LAPIC emulation.

Guest TDs must use virtualized x2APIC mode. xAPIC mode (using memory mapped APIC access) is not allowed. The guest TD cannot disable the APIC.

Guest TDs are allowed access to a subset of the virtual APIC registers^, which are virtualized by the CPU. Access to other registers can cause a #VE. The guest TD is expected to use a software protocol over TDG.VP.VMCALL (GHCI) to request such operations from the VMM. 至于访问哪些会产生 VE,请看 TDX Module Base: Figure 11.3: Virtual APIC Access by Guest TD

Non-NMI interrupt injection into the guest TD by the host VMM or the IOMMU can be done through the posted-interrupt mechanism. If there are pending interrupts in the PID, the VMM can post a self IPI with the notify vector prior to TD entry.

The PID resides in a shared page. If needed, the guest TD may use a software protocol over TDCALL(TDG.VP.VMCALL) to ask the VMM to stop interrupt delivery through the PID.

The TD VMCS posted interrupt execution controls are reset to their initial values when the TD is migrated. The host VMM on the destination platform must set them in order to use posted interrupts.

tdx_mig_import_state_vp
    tdx_td_vcpu_post_init
        // Write to TD VMCS's posted-interrupt notification vector
        td_vmcs_write16(tdx, POSTED_INTR_NV, POSTED_INTR_VECTOR);
        // Write to TD VMCS's posted-interrupt descriptor address
        td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->pi_desc));
        // Enable processing posted-interrupt
        // If this control is 1, the processor treats interrupts with the posted-interrupt notification vector
        // specially, updating the virtual-APIC page with posted-interrupt requests.
    	td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);

与 VMX 类似 vmx->pi_desc,TDX 也定义了自己的 PI 成员 tdx->pi_desc。并且,pi_descvcpu_vmxvcpu_tdx 里的 offset 也是一样的:

static_assert(offsetof(struct vcpu_pi, pi_desc) == offsetof(struct vcpu_vmx, pi_desc));
static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) == offsetof(struct vcpu_vmx, pi_wakeup_list));
#ifdef CONFIG_INTEL_TDX_HOST
static_assert(offsetof(struct vcpu_pi, pi_desc) == offsetof(struct vcpu_tdx, pi_desc));
static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) == offsetof(struct vcpu_tdx, pi_wakeup_list));
#endif

tdx_deliver_interrupt() KVM

vt_deliver_interrupt
    tdx_deliver_interrupt
    	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
            kvm_vcpu_trigger_posted_interrupt
                // send posted-interupt notification vector IPI to self
    			__apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);

__vmx_deliver_posted_interrupt() KVM

// vector 表示的不是 PI 的 notification vector,是我们要注入的 interrupt vector。
static inline void __vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, struct pi_desc *pi_desc, int vector)
{
    // 把 PIR 的 vector bit 置上。
	pi_test_and_set_pir(vector, pi_desc));

	// If a previous notification has sent the IPI, nothing to do.
    // 把 PID.ON 置上。
	pi_test_and_set_on(pi_desc);

    //...
    // 发送 IPI 给自己。
	kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
}

kvm_vcpu_trigger_posted_interrupt() KVM

正如 posted-interrupt 的 SPEC 里所写,如果我们需要触发 posted-interrupt,需要发送一个 vector 为 PI notification vector 的 IPI 给自己这个 CPU,这个函数就是做这件事的。

static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu, int pi_vec)
{
    //...
	if (vcpu->mode == IN_GUEST_MODE) {
		/*
		 * The vector of the virtual has already been set in the PIR.
		 * Send a notification event to deliver the virtual interrupt
		 * unless the vCPU is the currently running vCPU, i.e. the
		 * event is being sent from a fastpath VM-Exit handler, in
		 * which case the PIR will be synced to the vIRR before
		 * re-entering the guest.
		 *
		 * When the target is not the running vCPU, the following
		 * possibilities emerge:
		 *
		 * Case 1: vCPU stays in non-root mode. Sending a notification
		 * event posts the interrupt to the vCPU.
		 *
		 * Case 2: vCPU exits to root mode and is still runnable. The
		 * PIR will be synced to the vIRR before re-entering the guest.
		 * Sending a notification event is ok as the host IRQ handler
		 * will ignore the spurious event.
		 *
		 * Case 3: vCPU exits to root mode and is blocked. vcpu_block()
		 * has already synced PIR to vIRR and never blocks the vCPU if
		 * the vIRR is not empty. Therefore, a blocked vCPU here does
		 * not wait for any requested interrupts in PIR, and sending a
		 * notification event also results in a benign, spurious event.
		 */

		if (vcpu != kvm_get_running_vcpu())
			__apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
		return;
	}
	/*
	 * Wake the vCPU in case it is blocking, otherwise do nothing as KVM will grab the highest priority pending
	 * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
	 */
	kvm_vcpu_wake_up(vcpu);
}

tdx_protected_apic_has_interrupt() KVM

TDX 默认用的是 x2APIC。主要是为了 check 当前有没有 interrupt。

kvm_arch_vcpu_runnable
    kvm_vcpu_has_events
        if (kvm_cpu_has_interrupt())
bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
    // post-interupt PIR 里有没有还没有处理的 interrupt?
	bool ret = pi_has_pending_interrupt(vcpu);
	union tdx_vcpu_state_details details;
	struct vcpu_tdx *tdx = to_tdx(vcpu);

    // 如果有 PI 中断,或者 vcpu 现在不是 halted 的状态(其他状态都是中断状态比如 INIT, SIPI 等等),那就返回 true,表示有 interrupt
	if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
		return true;
    
	if (tdx->interrupt_disabled_hlt)
		return false;

	/*
	 * This is for the case where the virtual interrupt is recognized,
	 * i.e. set in vmcs.RVI, between the STI and "HLT".  KVM doesn't have
	 * access to RVI and the interrupt is no longer in the PID (because it
	 * was "recognized".  It doesn't get delivered in the guest because the
	 * TDCALL completes before interrupts are enabled.
	 *
	 * TDX modules sets RVI while in an STI interrupt shadow.
	 * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
	 *   The interrupt shadow at this point is gone.
	 * - It knows that there is an interrupt that can be delivered
	 *   (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
	 *    matter)
	 * - It forwards the TDExit nevertheless, to a clueless hypervisor that
	 *   has no way to glean either RVI or PPR.
	 */
	if (xchg(&tdx->buggy_hlt_workaround, 0))
		return true;

	/*
	 * This is needed for device assignment. Interrupts can arrive from
	 * the assigned devices.  Because tdx.buggy_hlt_workaround can't be set
	 * by VMM, use TDX SEAMCALL to query pending interrupts.
	 */
	details.full = td_state_non_arch_read64(tdx, TD_VCPU_STATE_DETAILS_NON_ARCH);
	return !!details.vmxip;
}

Memory

Guest 在使用一段 private memory 之前需要先 TDG.MEM.PAGE.ACCEPT

See Documentation/virt/kvm/tdx-tdp-mmu.rst.

Since the execution of such interface functions takes much longer time than accessing memory directly, in KVM we use the existing TDP code to mirror the Secure EPT for the TD. And we think there are at least two options today in terms of the timing for executing such SEAMCALLs:

  1. synchronous, i.e. while walking the TDP page tables, or
  2. post-walk, i.e. record what needs to be done to the real Secure EPT during the walk, and execute SEAMCALLs later.

tdx_unpin() KVM

static void tdx_unpin(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, enum pg_level level)
{
    //...
	for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++)
		put_page(pfn_to_page(pfn + i));
}

Memory mapping in TDX

如果想要把一段内存给 TDX Module 映射,Host kernel 首先需要 reserve 一段内存,这样 host 才能知道每一个 page 的 HPA 是什么,才能不会再把这段内存分配给其他进程或者 TD。KVM 的 AUG 可以传入 (GPA, HPA) 来进行映射。因为这种 page 有

  • host 页表从 HVA 到 HPA 的映射。
  • SEPT 页表从 GPA 到 HPA 的映射。

所以在 teardown 阶段,free 一个 page 的时候,如果是 TDR, TDVPR 这种 page,需要:

  • tdx_reclaim_page(td_page_pa)
  • free_page((unsigned long)__va(td_page_pa));

如果是 TD 里用的 page:

  • tdx_reclaim_page(td_page_pa)
  • tdx_unpin()

来将其从两个映射中删除,这才算真正地回收了这个 page。tdx_reclaim_td_page 正是这么实现的:

void tdx_reclaim_td_page(unsigned long td_page_pa)
{
    //..
	tdx_reclaim_page(td_page_pa, PG_LEVEL_4K, false, 0)
	free_page((unsigned long)__va(td_page_pa));
    //..
}

如果只是想从 TDX Module 里取消映射,host 上仍然 reserve 这段内存区域,host 并不是也想释放掉:


Page promotion (Why?)

Page size promotion is intended to be used by the host VMM to merge 512 pages mapped as 4KB or 2MB into a single page mapped as 2MB or 1GB.

Page demotion (Why?)

Page size demotion is intended to be used by the host VMM to split a page mapped as 1GB or 2MB into 512 pages mapped as 2MB or 4KB, respectively.

TDX MMU TDCALLs

TDG.MEM.PAGE.ACCEPT

TDG.MEM.PAGE.ACCEPT accepts a PENDING private page, previously added by TDH.MEM.PAGE.AUG, into the TD.

直接 TDG.MEM.PAGE.ACCEPT 一段内存,如果这段内存还没有被 VMM AUG/ADD,那么会触发 EPT Violation 从而 Exit 到 VMM,VMM 可以通过这种方式记录 private/shared bitmap。这是 TDVF 采用的方式

尽管 Map GPA 可以选择 private to shared 也可以 shared to private,但是 ACCEPT 只能够 accept private page

The guest TD can accept a dynamically added 4KB or 2MB(所以只能是 TDH.MEM.PAGE.AUG 添加的 page?) page using TDG.MEM.PAGE.ACCEPT.

The guest TD must accept the page using TDG.MEM.PAGE.ACCEPT before it can access it. A guest TD attempt to access a page that has been dynamically added by TDH.MEM.PAGE.AUG but has not yet been accepted by TDH.MEM.PAGE.ACCEPT results in a #VE exception.

我们可以看看这个 TDCALL 在 guest kernel 里具体是怎么用到的。

在 Guest kernel 的文件 arch/x86/include/asm/shared/tdx.h 中定义。

#define TDX_ACCEPT_PAGE			6

set_memory_encrypted
__set_memory_enc_dec
__set_memory_enc_pgtable // used for the hypervisors that get informed about "encryption" status via page tables.
x86_platform.guest.enc_status_change_finish
tdx_enc_status_changed
tdx_enc_status_changed_phys
try_accept_one
__tdx_module_call(TDX_ACCEPT_PAGE


tdx_accept_memory
tdx_enc_status_changed_phys
try_accept_one
__tdx_module_call(TDX_ACCEPT_PAGE

TDX MMU SEAMCALLs

TDH.MEM.SEPT.ADD

TDH.MEM.SEPT.REMOVE

TDH.MEM.SEPT.RD

不是 private 的吗?为什么可以读?

TDH.MEM.PAGE.ADD

在 build time 为 private page build SEPT entry。

TDH.MEM.PAGE.AUG

TDH.MEM.PAGE.ADD 是为了在 build time 添加 page,但是 TDH.MEM.PAGE.AUG 是为了动态地添加。

THD.MEM.PAGE.REMOVE

TDH.MEM.TRACK / TLB tracking in TDX

The goal of TLB tracking is to be able to prove (when needed) that no logical processor holds any cached Secure EPT address translations to a given TD private GPA range.(注意不要和 cache invalid 弄混了,这个要保证的是 translation 没有)。

TLB tracking is required when:

  • removing a mapped TD private page(TDH.MEM.PAGE.REMOVE) or
  • changing the page mapping size (TDH.MEM.PAGE.PROMOTE)

The sequence typically includes five steps:

  1. Execute TDH.MEM.RANGE.BLOCK on each GPA range, blocking creation of TLB translation to that range. Note that cached translations may still exist at this stage.
  2. Execute TDH.MEM.TRACK, advancing the TD’s epoch counter.
  3. Send an IPI to each RLP on which any of the TD’s VCPUs is currently scheduled.
  4. Upon receiving the IPI, each RLP will TD exit to the VMM. At this point the target GPA ranges are considered tracked. Even though some LPs may still hold TLB entries to the target GPA ranges, the following TD entry is designed to flush them.
  5. Normally, the host VMM on each RLP will treat the TD exit as spurious and will immediately re-enter the TD.
TDX Base
9. TD Private Memory Management
9.7. Introduction to TLB Tracking

TDCS.TD_EPOCH, PAMT.BEPOCH, TDCS.BW_EPOCH, TDVPS.VCPU_EPOCH

TD’s TLB epoch counter 就是 TDCS.TD_EPOCH

TD_EPOCH 是通过 TDH.MEM.TRACK, TDH.EXPORT.PAUSE 来 advance 的。

BEPOCH, BW_EPOCHVCPU_EPOCH 都是在特定时刻对 TD_EPOCH 的 sample。

  • TDH.MEM.RANGE.BLOCKW 会采样 TD_EPOCH 并放到 TDCS.BW_EPOCH 中;
  • TDH.MEM.RANGE.BLOCK 会采样 TD_EPOCH 并放到 PAMT.BEPOCH 中;
  • TDH.VP.ENTER 会采样 TD_EPOCH 并放到 TDVPS.VCPU_EPOCH 中。

对于要 export 的每一个 page,TDH.EXPORT.MEM 会 check TDCS.BW_EPOCH。(TDX Module 里的 is_tlb_tracked 函数)需要保证此时的 BW_EPOCH 小于等于 TDCS.TD_EPOCH

总之,主要是用来 check 用的,感觉不需要深挖。

TDX shared bit of GPA

TDX repurposes one GPA bit (51 bit or 47 bit based on configuration) to indicate the GPA is private(if cleared) or shared (if set) with VMM. If GPA.shared is set, GPA is covered by the existing conventional EPT pointed by EPTP. If GPA.shared bit is cleared, GPA is covered by TDX module. VMM has to issue SEAMCALLs to operate.

Add a member to remember GPA shared bit for each guest TDs, add address conversion functions between private GPA and shared GPA and test if GPA is private.

Because struct kvm_arch (or struct kvm which includes struct kvm_arch. See kvm_arch_alloc_vm() that passes __GPF_ZERO) is zero-cleared when allocated, the new member to remember GPA shared bit is guaranteed to be zero with

this patch unless it's initialized explicitly.

fault.is_private means that host page should be gotten from guest_memfd is_private_gpa() means that KVM MMU should invoke private MMU hooks.

Shared EPT

Why the shared EPT can reside in KVM's memory

Q: Shared page is also encrypted using the MKTME stuffs, which means it's content cannot be accessed by KVM, why shared EPT can reside in KVM's memory?

A: EPT only care the page address translation, MKTME only work on accessing the real page.

Secure EPT

The Secure EPT pages are encrypted and integrity-protected with the TD’s ephemeral private key. The Secure EPT is not intended to be directly accessible by any software other than the Intel TDX module.

Secure EPT entry is opaque; KVM may not access it directly. KVM may read a Secure EPT entry information using the TDH.MEM.SEPT.RD interface function.

From the CPU perspective, Secure EPT has the same structure as a legacy VMX EPT.

Can I say both Secure EPT page and private page are both encrypted using the same private HKID?

Does the control structures are encrypted with another private key?

The control structures are encrypted and integrity-protected with a private key, and managed by Intel TDX functions.

The controls structures are encrypted with private keys and HKIDs.

Why using secure EPT? How does it benefit than shared EPT?

"stolen" Bit from HPA and GPA

The "stolen" bits in HPA denotes the HKID set for this physical page (1 bit denotes shared or private).

The "stolen" bit in GPA denotes the Shared/Private bit of this page.

Measurement and Attestation

Attestation: 证词。可以理解为向 Challenger 证明自己(是可信的)。

TDX uses SGX-Based Attestation, which means SGX should be supported.

Software within the guest TD can use the TDG.MR.REPORT and specifying a REPORTDATA value to generate an integrity-protected TDREPORT_STRUCT. Which includes:

  • the TD’s measurements,
  • the Intel TDX module’s measurements,
  • REPORTDATA. This will typically be an asymmetric key that the attestation verifier can use to establish a secure channel or protect sensitive data to be sent to the TD software.

TDREPORT_STRUCT can ONLY be verified on the local platform via the SGX ENCLU(EVERIFYREPORT2) instruction.

By design, TDREPORT_STRUCT CANNOT be verified off platform; it first must be converted into signed Quotes.

What is TDINFO_STRUCT?

TDINFO_STRUCT is part of TDREPORT_STRUCT, you can see the figure in Base: Figure 12.1: UPDATED: TD Measurement Reporting.

For more, see ABI: 3.9.5. UPDATED: TDINFO_STRUCT

Who do the attestation?

Attestation is driven by software in TD.

TD attestation is initiated from inside the TD by calling TDG.MR.REPORT and specifying a REPORTDATA value.

What is mutual TD attestation?

The migration TDs use a TD-quote-based mutual authentication protocol to create a session between them.

MRTD: Build-Time Measurement Register

Helps provide static measurement of the TD build process and the initial contents of the TD.

The process is:

  1. TDH.MNG.INIT begins the process by initializing the digest.
  2. TDH.MEM.PAGE.ADD adds a private page to the TD and inserts GPA into the MRTD digest calculation.
  3. Control structure pages (TDR, TDCX and TDVPR) and Secure EPT pages are NOT measured.
  4. For pages whose data contribute to the TD, that data should be included in the TD measurement via TDH.MR.EXTEND. TDH.MR.EXTEND inserts the data contained in those pages and its GPA into the digest calculation. If a page will be wiped and initialized by TD code, the loader may opt not to measure the initial contents.
  5. The measurement is then completed by TDH.MR.FINALIZE. Once completed, further TDH.MEM.PAGE.ADDs or TDEXTENDs will fail.

From 2 and 3 we can see, attestation only cares about the data rather than the meta information.

This state is migrated as part of the global immutable state of the TD.

When will we call TDH.MR.EXTEND and when will we call TDH.MEM.PAGE.ADD?

RTMR: Run-Time Measurement Registers

An array of general-purpose measurement registers made available to the TD software to enable measuring additional logic and data loaded into the TD at run-time.

The RTMR array is initialized to zero on build, and it can be extended at run-time by the guest TD using the TDCALL(TDG.MR.RTMR.EXTEND) leaf. (Note: TDH.MR.EXTEND is to extend MRTD).

Migrated as TD’s mutable state.

What is measurement quoting?

To create a remotely verifiable attestation, the TDREPORT_STRUCT should be converted into a Quote signed by a certified Quote signing key.

TDMR

What is TDMR?

8.Physical Memory Management

8.2TDMR Details

A range of memory, configured by the host VMM, that is covered by PAMT and is intended to hold TD private memory and TD control structures.

  • TDMR configuration is "soft" – no hardware range registers are used.
  • Each TDMR defines a single physical address range.
  • TDMRs cannot overlap with each other.
  • TDMRs are configured at platform scope (no separate configuration per package).

TDR and TDVPR are in TDMR, because they are control structures.

TDMRs may contain reserved areas.

Once each 1GB block of TDMR has been initialized the PAMT structure by TDH.SYS.TDMR.INIT, it can be used to hold TD private pages.

Why TDMR needs to be multiple of 1GB?

Are TD guest pages reside in TDMR?

Yes.

Once each 1GB block of TDMR has been initialized by TDH.SYS.TDMR.INIT, it can be used to hold TD private pages.

What is TDMR reserved areas?

13.1.4.2.1. Background: Reserved Areas within TDMRs

Reserved areas are still covered by PAMT. Pages in reserved areas are not used by the Intel TDX module for allocating privately encrypted memory pages.

The physical page is reserved for non-TDX usage. The Intel TDX module will not allow converting this page to any other page type. The page can be used by the host VMM for any purpose.

PAMT (Physical Address Metadata Table)

8.3. PAMT Details

The PAMT is designed to hold metadata of each page (includes page type, page size, assignment to a TD, and other attributes.) in a TDMR. It controls assignment of physical pages to guest TDs, etc. The PAMT is intended not to be directly accessible to software. It resides in memory allocated by the host VMM on TDX initialization.

Each TDMR is defined as controlled by a (logically) single PAMT.

Encrypted by TDX global private key.

PAMT Entry: A PAMT entry is designed to hold metadata for a single physical page. The page size may be 4KB, 2MB or 1GB.

PAMT Array: Physically, for each TDMR the design includes three arrays of PAMT entries, one for each PAMT level.

PAMT Block: For each 1GB of TDMR physical memory, there is a corresponding PAMT Block. A PAMT Block is logically arranged in a three-level tree structure of PAMT Entries like page table.

So, A Block includes Arrays, which include Entries.

PAMT Page type / page state

page type / page state 和 SPTE 的 state 是不一样的,不要弄混。

Specifies the corresponding TD private page type (Assigned? Reserved? holding TDR? holding TDCX?, etc…).

ABI
Table 4.25: PAMT Page Type Values
4.7.4. PAMT Page Type (PT) Values
4.7. Physical Memory Management Types
4. Data Types
  • PT_NDA: 表示我们还没有把这个 page 加入到 TDX Module 里面去。

HKID

为了避免和 MKTME 自用的 keyID 冲突,TDX 的 HKID 是从 MKTME 的最后一个 ID 开始分配的。

IOCTLs

TDX add new parameter to existing ioctl KVM_CHECK_EXTENSION (Based on their initialization different VMs may have different capabilities. It is thus encouraged to use the vm ioctl to query for capabilities):

  • For system-wide: KVM_CAP_VM_TYPES
  • For VM-wide: KVM_CAP_VM_TYPES

They return the same thing.

TDX reuse 1 ioctl KVM_MEMORY_ENCRYPT_OP (这个 ioctl 是 AMD 在 upstream SEV 的时候引入的) which can suit different conditions:

  • system-wide, use wrapper function tdx_platform_ioctl()
  • VM-wide, use wrapper function tdx_vm_ioctl()
  • vcpu-wide, use wrapper function tdx_vcpu_ioctl()

Parameters

struct kvm_tdx_cmd {
	/* enum kvm_tdx_cmd_id, see following */
	__u32 id;
	/* flags for sub-command. If sub-command doesn't use this, set zero. */
	__u32 flags;
	__u64 data;
    // ...
};

// command id
enum kvm_tdx_cmd_id {
	KVM_TDX_CAPABILITIES = 0, // only system-wide is supported
	KVM_TDX_INIT_VM, // only vm wide, invoked in a lazy style (when initialize VCPU)
	KVM_TDX_INIT_VCPU, // only vcpu wide
	KVM_TDX_INIT_MEM_REGION, // only vm wide
	KVM_TDX_FINALIZE_VM, // only vm wide

	KVM_TDX_CMD_NR_MAX,
};

// sub-command
#define KVM_TDX_MEASURE_MEMORY_REGION	(1UL << 0)

Configs

CONFIG_KVM_PROTECTED_VM

Enable support KVM-protected VMs. Currently 'protected' means the VM can be backed with restricted/private memory.

CONFIG_KVM_PRIVATE_MEM

CONFIG_RESTRICTEDMEM

How does the TDX track dirty pages?

KVM unblock the page in a lazy style after it is blocked. When an EPT violation occurs, KVM will TDH.EXPORT.UNBLOCKW and TDX module will mark the page as dirty if exported.

// The place enable the page dirty logging
case KVM_SET_USER_MEMORY_REGION2:
case KVM_SET_USER_MEMORY_REGION: {
    kvm_vm_ioctl_set_memory_region
        kvm_set_memory_region
            __kvm_set_memory_region
                kvm_set_memslot
                    kvm_commit_memory_region
                        kvm_arch_commit_memory_region
                            kvm_mmu_slot_apply_flags
                                kvm_mmu_slot_remove_write_access
                                    kvm_tdp_mmu_wrprot_slot
                                        wrprot_gfn_range
                                            new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
                                                tdp_mmu_set_spte_atomic
                                                    set_private_spte_present
                                                        __set_private_spte_present
                                                            private_spte_change_flags
                                                            	if (was_writable && !is_writable) {
                                                                    tdx_write_block_private_pages
                                                                        tdh_export_blockw


kvm_arch_mmu_enable_log_dirty_pt_masked
kvm_mmu_write_protect_pt_masked / kvm_mmu_clear_dirty_pt_masked
kvm_tdp_mmu_clear_dirty_pt_masked
clear_dirty_pt_masked
tdx_write_block_private_pages
tdh_export_blockw



// The place mark the page as dirty
// 发生 page fault,说明 guest OS 想要 modify 这个 page 了。
fast_pf_fix_direct_spte
    kvm_write_unblock_private_page
        tdx_write_unblock_private_page
        	err = tdh_export_unblockw(kvm_tdx->tdr_pa, ept_info.val, &out);
    mark_page_dirty_in_slot(vcpu->kvm, fault->slot, gfn);

So, not all blocked pages by TDH.EXPORT.BLOCKW need to be unblocked.

TDCS.DIRTY_COUNT is TD-scope dirty page counter.

  • It is cleared when a new migration session begins.
  • It is incremented when a page that has previously been exported is unblocked.
  • It is decremented when a dirty page is exported by TDH.EXPORT.ME.

For successful start token generation by TDH.EXPORT.TRACK, DIRTY_COUNT must be 0, indicating that all pages exported so far have their newest pages exported.

CPUID Virtualization in TDX

According to Chapter "CPUID Virtualization" in TDX module spec, CPUID
bits of TD can be classified into 6 types:

------------------------------------------------------------------------
1 | As configured | configurable by VMM, independent of native value;
------------------------------------------------------------------------
2 | As configured | configurable by VMM if the bit is supported natively
    (if native)   | Otherwise it equals as native(0).
------------------------------------------------------------------------
3 | Fixed         | fixed to 0/1
------------------------------------------------------------------------
4 | Native        | reflect the native value
------------------------------------------------------------------------
5 | Calculated    | calculated by TDX module.
------------------------------------------------------------------------
6 | Inducing #VE  | get #VE exception
------------------------------------------------------------------------

As Configured:VMM 负责通过 TDH.MNG.INIT 来配置 CPUID 值,值最后会被 guest TD 看到。TDX Module 不会对所配置的值进行干预。

As Configured (If Native):VMM 通过 TDH.MNG.INIT 来配置值。如果 VMM 要置上,同时 native CPUID 为 1,才会被置为 1 并暴露给 Guest。

Fixed:无需多言(不管 Native CPU 的 CPUID 是什么值,都是 fix 为这个的值)。

Native:和 Native 的值一样。

Calculated:是通过 TDX Module 计算出来的(输入是什么呢?)。

Inducing VE:当 TD Guest query 这个 CPUID 的时候,会注入一个 VE 给 TD Guest。

  1. All the configurable XFAM related features and TD attributes related features fall into type #2. And fixed0/1 bits of XFAM and TD attributes fall into type #3.
  2. For CPUID leaves not listed in "CPUID virtualization Overview" table in TDX module spec, TDX module injects #VE to TDs when those are queried. For this case, TDs can request CPUID emulation from VMM via TDVMCALL and the values are fully controlled by VMM.

TDX enables and disables

TDX disables APICv

For lapic, it's safe guard. Because TDX KVM disables APICv with APICV_INHIBIT_REASON_TDX.

APICv is disabled because TDX doesn't support it