TDX Basic
TDX Misc
有两个 SEPT,一个是 KVM 用来维护状态用的 SEPT,另一个是 TDX module 里的 SEPT,对外保密,只能通过接口进行设置和删除(x86_ops
里的那些 TD 相关的函数)。
pc.bios
is not private memory. TDX requires OVMF to act as private memory,这两句话好像是冲突的。
TDX 现在不支持 swap memory out。
目前 TDX 只支持到 2M 的大页。
A correctly written TD should either not use UC locks or split locks, or be ready to handle any #AC and #GP(0) faults raised when such locks are used. 在 TD guest 跑 userspace 的可以触发 split lock 的程序时,发生 split lock 不会发生 vmexit。但是如果 host kernel 是 sld = off
的情况,他就不会 configure MSR,然后硬件上也就不会产生 #AC。为了让 TDX guest 不 handle AC,判断一下是不是 TDX guest,是的话就 ignore,这样不管 host sld
是啥,TD guest 都不会产生 warning 了。
正常非 debug 版本的 tdx module 应该是不建立自己的 IDT 表的,所以在 tdx module 里面发生 exception 都会直接挂掉。tdx module 和 kernel 做 context switch 的时候,对应的 handler 也会 switch。handler 是根据 IDT 里设置的来确定的,tdx module 和 kernel 不一样。现在 debug 版本的让他们在 exception handler 里面加了一些 log。
Guest ACCEPT private memory 并且将 private/shared bitmap 信息更新到 VMM 有两种方式:
- 直接
TDG.MEM.PAGE.ACCEPT
一段内存,如果这段内存还没有被 VMM AUG/ADD,那么会触发 Violation 从而 Exit 到 VMM,VMM 可以通过这种方式记录 private/shared bitmap。这是 TDVF 采用的方式; - 调用 TDVMCALL
MapGPA
来请求 VMM,KVM 会 exit 到 QEMU 来进行通知,QEMU 会设置cgs_bmap
(Seeram_block_convert_range()
),调用SET_ATTRIBUTE
下来让 VMM AUG 完之后返回 success,这样 Guest 就可以接着ACCEPT
。这是 Guest kernel 采用的方式。(注意,如果 MapGPA 是从private->shared
,那么不需要后续的 Accept)。可以看 guest kernel 里的函数tdx_enc_status_changed()
。
Kernel 里的 xarray
记录的是从 kernel 的角度来看这个 page 应该是 private 的还是 shared 的,QEMU 里的 cgs_bmap
也是如此。
In TDX, tsc multiplier can't be changed.
TDX is supported using legacy MMU, not just for EPT, see commit dbbdd4e8ab6f370fe8c232006a1433c1d947c718
.
TDX is NOT enabled for all SPR CPUs. Some steppings are not supported, such as E3.
TDX doesn't support APICV.
TDX is PV (Software within the guest TD can use the TDCALL(TDG.MR.REPORT) function to request the Intel TDX module to generate an integrity-protected TDREPORT structure.), so A TD OS is considered enlightened if it is aware that it is running as a TD.
TDX module code is open source. Intel® Trust Domain Extensions
TDX Module can be loaded by 2 ways, one from BIOS, and one for OS by GETSEC.
Secure EPT is intended to be managed indirectly by the host VMM using Intel TDX functions.
Because TDVPS includes VMCS, and TDX module also use the VMLUANCH/VMRESUME to start a VM, so
- on each TD exit, it just need to save non-VMCS CPU state in TDVPS, and,
- on each TD entry, it just need to restore non-VMCS CPU state from TDVPS.
To ensure it is a TD in the guest, just lscpu | grep tdx_guest
.
TDX doesn't trust the BIOS.
The TDX module is expected to be loaded by the BIOS when it enables TDX. The TDX module will be initialized by the KVM subsystem when KVM wants to use TDX.
TDX requires x2APIC. The tdx guest only supports x2apic.
TEE-IO Provisioning Agent: TPA.
cgs: Confidential Guest Support.
MapGPA call points in TD guest kernel / TDVMCALL_MAP_GPA
有以下两个地方 call 到了 TDVMCALL_MAP_GPA
。
platform_device_add / pci_device_add / acpi_device_add
arch_dev_authorized
authorized_node_match
tdx_guest_dev_attest
tdxio_devif_accept
tdxio_devif_accept_mmios
tdx_map_private_mmio
__tdx_map_gpa
_tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
set_memory_encrypted / set_memory_decrypted / set_memory_decrypted_noflush
__set_memory_enc_dec
__set_memory_enc_pgtable
x86_platform.guest.enc_status_change_prepare / x86_platform.guest.enc_status_change_finish
tdx_enc_status_change_prepare / tdx_enc_status_change_finish
tdx_enc_status_changed
tdx_map_gpa
.r11 = TDVMCALL_MAP_GPA,
Page's refcount during TDX lifetime
TDX:
发生 page fault,page fault handler 里调用 kvm_gmem_get_pfn()
会把 refcount 初始化为 3(alloc 1, page cache 1, lru 1),过段时间 lru 里的会被异步移除从而减一,所以我们可以看作起始是 2。
tdx_unpin()
这里会减一,从而释放。
TDX Live Migration:
Page fault in TDX
Guest memory access 并不是相同的,而是也有 private access 和 shared access 之分的。
Based on the value of a new SHARED bit in the Guest Physical Address (GPA).
Guest Physical Address (GPA) space is divided into private and shared sub-spaces, determined by the SHARED bit of GPA.
一个 page 的映射可以分为以下几个情况:
Shared EPT | SEPT | |
---|---|---|
Not mapped | ||
Private | Yes | |
Shared | Yes |
有三个不同的概念需要澄清:
- 访问是 private 还是 shared,看
gfn_shared_mask
- 真实的 page 是 private 还是 shared,看 SEPT 有没有建立起来
- 期许的 page 应该是 private 还是 shared,看
mem_attr_array
所以发生 page fault 的可能原因,
- TD Guest 访问的和 page 真实的不匹配,比如 TD Guest 是一个 shared access,但是真实的 page 是 private 的。
- 本身这个 page 的 mapping 就还没有建立起来。
对于这两种情况的处理有点绕:但都是基于看 TD Guest 访问的(也就是 page fault 是否是 private)和存到 xarray 也就是这个页应该是 private 还是 shared 的进行比对:
- 如果相同,那么我们有理由相信这是由于 page 的 mapping 还没有建立起来引起的,这种情况没必要告诉 Userspace,直接我们自己拿到 page 建立起映射就行了;
- 如果不相同,既然 Guest 以这种方式访问了,说明 Guest 想要把这个 page 转换以下,这种情况我们要 exit 到 Userspace 来处理这个问题。
tdx_track()
KVM
相当于 TLB Flushing,在每次 enter 的时候如果发现有 request,那么就调用 tdx_track()
来 flush。
vcpu_enter_guest
if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
kvm_vcpu_flush_tlb_all(vcpu);
vt_flush_tlb_all
tdx_flush_tlb
kvm_service_local_tlb_flush_requests
kvm_service_local_tlb_flush_requests
if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
kvm_vcpu_flush_tlb_current(vcpu);
vt_flush_tlb_current
tdx_flush_tlb_current
tdx_track(vcpu->kvm);
主要是为了执行 SEAMCALL TDH.MEM.TRACK
^。
/*
* TLB shoot down procedure:
* There is a global epoch counter and each vcpu has local epoch counter.
* - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
* This blocks the subsequenct creation of TLB translation on that range.
* This corresponds to clear the present bit(all RXW) in EPT entry
* - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
* - IPI to remote vcpus
* - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
* - On re-entry, TDX module compares the local epoch counter with the global
* epoch counter. If the local epoch counter is older than the global epoch
* counter, update the local epoch counter and flushes TLB.
*/
static void tdx_track(struct kvm *kvm)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
struct kvm_vcpu *vcpu;
unsigned long i;
u64 err;
KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
/* If TD isn't finalized, it's before any vcpu running. */
if (unlikely(!is_td_finalized(kvm_tdx)))
return;
/*
* tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
* the counter. The counter is used instead of bool because multiple
* TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
*/
atomic_inc(&kvm_tdx->doing_track);
while (atomic_cmpxchg(&kvm_tdx->tdh_mem_track, 0, 1)) {
cpu_relax();
}
smp_store_release(&kvm_tdx->has_range_blocked, false);
/*
* Don't wait for other vcpus with the empty IPI handler. Instead,
* Synchronize after tdh_mem_track() to reduce synchronization time.
*/
kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH & ~KVM_REQUEST_WAIT);
/*
* kvm_flush_remote_tlbs() doesn't allow to return error and
* retry.
*/
err = tdh_mem_track(kvm_tdx->tdr_pa);
if (!err)
tdx_tdi_iq_inv_iotlb(kvm_tdx);
/* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
atomic_set(&kvm_tdx->tdh_mem_track, 0);
/*
* Avoid TDX_TLB_TRACKING_NOT_DONE on the following Secure-EPT operation
* by waiting here for all other vcpus to go through TDExit once or not
* running TD guest. The alternative is loop on
* TDX_TLB_TRACKING_NOT_DONE with Secure-EPT operation. But if we hit
* problem with tlb shoot down, debug will be very difficult. So we
* don't choose the loop option.
*/
kvm_for_each_vcpu(i, vcpu, kvm) {
int mode;
/* If vcpu == current vcpu, vcpu->mode == OUTSIDE_GUEST_MODE */
mode = smp_load_acquire(&vcpu->mode);
while ((mode == IN_GUEST_MODE || mode == EXITING_GUEST_MODE) &&
kvm_test_request(KVM_REQ_TLB_FLUSH, vcpu)) {
cpu_relax();
mode = smp_load_acquire(&vcpu->mode);
}
}
atomic_dec(&kvm_tdx->doing_track);
if (KVM_BUG_ON(err, kvm))
pr_tdx_error(TDH_MEM_TRACK, err, NULL);
}
tdx_td_init()
KVM
static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params, u64 *seamcall_err, bool post_init)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
struct tdx_module_args out;
cpumask_var_t packages;
unsigned long *tdcs_pa = NULL;
unsigned long tdr_pa = 0;
unsigned long va;
int ret, i;
u64 err;
//...
// alloc HKID
ret = tdx_guest_keyid_alloc();
//...
kvm_tdx->hkid = ret;
// cgroup 相关的
kvm_tdx->misc_cg = get_current_misc_cg();
ret = misc_cg_try_charge(MISC_CG_RES_TDX, kvm_tdx->misc_cg, 1);
//...
// 分配 TDR
va = __get_free_page(GFP_KERNEL_ACCOUNT);
//...
tdr_pa = __pa(va);
// 分配 TDCS 数组
tdcs_pa = kcalloc(tdx_info.nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
GFP_KERNEL_ACCOUNT | __GFP_ZERO);
//...
// 分配 TDCS 数组每一个 entry指向的页
for (i = 0; i < tdx_info.nr_tdcs_pages; i++) {
va = __get_free_page(GFP_KERNEL_ACCOUNT);
//...
tdcs_pa[i] = __pa(va);
}
// 分配一个 cpu_mask 出来,放到 packages 里
zalloc_cpumask_var(&packages, GFP_KERNEL)
//...
// check if tdx module is available or not
// Need at least one CPU of the package to be online in order to
// program all packages for host key id. Check it.
for_each_present_cpu(i)
cpumask_set_cpu(topology_physical_package_id(i), packages);
for_each_online_cpu(i)
cpumask_clear_cpu(topology_physical_package_id(i), packages);
//...
/*
* Acquire global lock to avoid TDX_OPERAND_BUSY:
* TDH.MNG.CREATE and other APIs try to lock the global Key Owner
* Table (KOT) to track the assigned TDX private HKID. It doesn't spin
* to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
* caller to handle the contention. This is because of time limitation
* usable inside the TDX module and OS/VMM knows better about process
* scheduling.
*
* APIs to acquire the lock of KOT:
* TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
* TDH.PHYMEM.CACHE.WB.
*/
mutex_lock(&tdx_lock);
err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
mutex_unlock(&tdx_lock);
// error checkings...
kvm_tdx->tdr_pa = tdr_pa;
tdx_account_ctl_page(kvm);
for_each_online_cpu(i) {
int pkg = topology_physical_package_id(i);
if (cpumask_test_and_set_cpu(pkg, packages))
continue;
/*
* Program the memory controller in the package with an
* encryption key associated to a TDX private host key id
* assigned to this TDR. Concurrent operations on same memory
* controller results in TDX_OPERAND_BUSY. Avoid this race by
* mutex.
*/
mutex_lock(&tdx_mng_key_config_lock[pkg]);
ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
&kvm_tdx->tdr_pa, true);
mutex_unlock(&tdx_mng_key_config_lock[pkg]);
if (ret)
break;
}
if (ret)
atomic_dec(&nr_configured_hkid);
cpus_read_unlock();
free_cpumask_var(packages);
// error checking...
kvm_tdx->tdcs_pa = tdcs_pa;
for (i = 0; i < tdx_info.nr_tdcs_pages; i++) {
err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
// error checking...
tdx_account_ctl_page(kvm);
}
if (!post_init) {
err = tdh_mng_init(kvm_tdx->tdr_pa, __pa(td_params), &out);
// error handling...
tdx_td_post_init(kvm_tdx);
}
kvm_tdx->attributes = td_params->attributes;
kvm_tdx->xfam = td_params->xfam;
kvm_tdx->eptp_controls = td_params->eptp_controls;
(td_params->attributes & TDX_TD_ATTRIBUTE_MIG) && tdx_mig_state_create(to_kvm_tdx(kvm))
// error handling...
if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
else
kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
return 0;
// a bunch of error handling tags...
}
tdx_map_gpa()
KVM
当 KVM 收到 Map GPA 请求时,有两种处理方式:
- (常见)如果能够找到 GPA 所对应的 memory slot,那我们 exit 到 Userspace,通知它,它会调用 ioctl
SET_ATTRIBUTE
下来。 - 如果找不到,自己处理。
static int tdx_map_gpa(struct kvm_vcpu *vcpu)
{
struct kvm *kvm = vcpu->kvm;
gpa_t gpa = tdvmcall_a0_read(vcpu);
gpa_t size = tdvmcall_a1_read(vcpu);
gpa_t end = gpa + size;
// s == start
gfn_t s = gpa_to_gfn(gpa) & ~kvm_gfn_shared_mask(kvm);
// e == end
gfn_t e = gpa_to_gfn(end) & ~kvm_gfn_shared_mask(kvm);
// 查看 TD Guest 传过来得 GPA 有没有置上 shared bit,
// 来表示 guest 像 map 成 private 的还是 shared 得。
bool map_private = kvm_is_private_gpa(kvm, gpa);
int ret;
int i;
// sanity checks...
//...
// 其实也没几个 address space
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
struct kvm_memslots *slots = __kvm_memslots(kvm, i);
struct kvm_memslot_iter iter;
kvm_for_each_memslot_in_gfn_range(&iter, slots, s, e) {
struct kvm_memory_slot *slot = iter.slot;
gfn_t slot_s = slot->base_gfn;
gfn_t slot_e = slot->base_gfn + slot->npages;
// 我们的 range 和 slot 的 range 重合不了一点
// 那就继续
if (e < slot_s || s >= slot_e)
continue;
// 我们完全被这个 slot 包在里面
if (slot_s <= s && e <= slot_e) {
// 如果这个 slot 的 flag 是包含private 的
if (kvm_slot_can_be_private(slot))
// 让 Userspace(QEMU)来处理
// userspace 会调用 KVM_SET_MEMORY_ATTRIBUTES 下来
return tdx_vp_vmcall_to_user(vcpu);
continue;
}
break;
}
}
// 没有找到重合且满足条件的 slot,这种情况只能 Kernel 处理
// 不通知 Userspace 了。这种情况好像是 MMIO 的情况?应该不常见吧
ret = kvm_mmu_map_private(vcpu, &s, e, map_private);
//error handling...
}
kvm_mmu_map_private()
/ __kvm_mmu_map_private()
KVM
int kvm_mmu_map_private(struct kvm_vcpu *vcpu, gfn_t *startp, gfn_t end, bool map_private)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
// checking...
return __kvm_mmu_map_private(vcpu->kvm, startp, end, map_private);
}
static int __kvm_mmu_map_private(struct kvm *kvm, gfn_t *startp, gfn_t end, bool map_private)
{
gfn_t start = *startp;
u64 attrs;
int ret;
//...
attrs = map_private ? KVM_MEMORY_ATTRIBUTE_PRIVATE : 0;
start = start & ~kvm_gfn_shared_mask(kvm);
end = end & ~kvm_gfn_shared_mask(kvm);
//...
kvm_vm_reserve_mem_attr_array(kvm, start, end);
//...
kvm_mmu_invalidate_begin(kvm);
kvm_mmu_invalidate_range_add(kvm, start, end);
if (is_tdp_mmu_enabled(kvm)) {
//...
ret = kvm_tdp_mmu_map_private(kvm, start, end, map_private);
// 设置 xarray 为想要的属性,不管是 shared 还是 private
kvm_vm_set_memory_attributes(kvm, attrs, start, end);
//...
} else {
gfn_t gfn;
for (gfn = start; gfn < end; gfn++) {
/* mmu_map_private() handles only 1 gfn. */
ret = mmu_map_private(kvm, gfn, map_private);
if (ret) {
if (gfn > start) {
ret = -EAGAIN;
start = gfn;
}
break;
}
KVM_BUG_ON(kvm_vm_set_memory_attributes(kvm, attrs, gfn, gfn + 1), kvm);
if (need_resched()) {
ret = -EAGAIN;
start = gfn + 1;
break;
}
}
}
kvm_mmu_invalidate_end(kvm);
//...
}
TDX MCE handling
注册:
tdx_hardware_setup
mce_register_decode_chain(&tdx_mce_nb);
mce_register_decode_chain
运行时:
.notifier_call = tdx_mce_notifier
tdx_mce_notifier
/* Clear poisoned bit to avoid further #MC */
static int tdx_mce_notifier(struct notifier_block *nb, unsigned long val, void *data)
{
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
struct mce *m = (struct mce *)data;
unsigned long kaddr;
unsigned long addr;
struct page *page;
u16 hkid;
// 我们需要这个 feature 来 clear poison bit
if (!boot_cpu_has(X86_FEATURE_MOVDIR64B))
return NOTIFY_DONE;
// 目前 TDX 只关注 memory error,其他类型的 error 直接返回。
if (!m)
return NOTIFY_DONE;
if (!mce_is_memory_error(m))
return NOTIFY_DONE;
addr = m->addr & ((1ULL << boot_cpu_data.x86_phys_bits) - 1);
hkid = m->addr >> boot_cpu_data.x86_phys_bits;
/* Is hkid used for TDX? */
if (hkid < tdx_global_keyid)
return NOTIFY_DONE;
// 后面做的都是把 kaddr 指向的地址所对应的 cache line 大小的区域清空。
/*
* MCE handler may make the page non-present in direct map. Map the page
* to access. Use VM_FLUSH_RESET_PERMS flag to tlb flush at vunmap()
* and reset direct mapping region.
*/
// 找到对应的页描述结构
page = pfn_to_page(addr >> PAGE_SHIFT);
kaddr = (unsigned long)vmap(&page, 1, VM_FLUSH_RESET_PERMS, PAGE_KERNEL);
if (!kaddr)
return NOTIFY_DONE;
/* Adjust page offset. */
kaddr |= addr & ~PAGE_MASK;
/* Align to cache line. */
kaddr = ALIGN_DOWN(kaddr, 64);
/* Direct write to clear poison bit. */
movdir64b((void *)kaddr, zero_page);
__mb();
vunmap((void *)(kaddr & PAGE_MASK));
pr_err("cleared poisoned cache hkid 0x%x pa 0x%lx\n", hkid, addr);
return NOTIFY_DONE;
}
TD Preserving
TD Preserving 不是一个写在 SPEC 里的 TDX Module 支持的 feature,它是一个纯在 Kernel/KVM 里实现的 feature。
Partitioned TD / TD Partitioning
1.5 的 feature。
Designed to provide a minimal environment for supporting the Microsoft VSM and similar architectures.
TDX_OPERAND_BUSY_HOST_PRIORITY
For guest-side functions: The operand is busy (e.g., it is locked in Exclusive mode) due to host priority.
For host-side functions: The operand is busy; the host VMM should retry the operation until successful to avoid guest being stuck on host priority.
TSC in TDX
TDX protects TDX guest TSC state from VMM (VMM cannot access guest's TSC value). The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM owns TSC virtualization for VMs, but the TDX module does for TDs.
Guest TDs are not allowed to modify the TSC. WRMSR attempts of IA32_TIME_STAMP_COUNTER result in a #VE.
Guest TDs are not allowed to access IA32_TSC_ADJUST because its value is meaningless to them.
TDX leaves
arch/x86/virt/vmx/tdx/tdx.h
arch/x86/kvm/vmx/tdx_arch.h
arch/x86/virt/vmx/tdx/tdx_module_loader_old/tdx_arch.h
handle_removed_private_spte()
KVM
做两件事:zap(RANGE.BLOCK
),remove(PAGE.REMOVE
)。
- 如果 new spte 是 zapped 的话,只做第一件事,也就是 zap(block)。
- 如果 new spte 是 removed 的话,那么两件事都要做。
static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
u64 old_spte, u64 new_spte,
int level)
{
//...
// 不能从 zap 到 zap 状态。
KVM_BUG_ON(was_private_zapped && is_private_zapped, kvm);
// 因为要么是 zapped 要么是 removed,不可能还是 present
WARN_ON_ONCE(is_present);
// 不是 leaf 不用处理
if (!was_leaf)
return;
// zap
ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
// 如过只需要 zap,那么 return。
if (is_private_zapped) {
/* page migration isn't supported yet. */
// 而且我们还要保证 zap 前后的 pfn 是一样的。
KVM_BUG_ON(new_pfn != old_pfn, kvm);
return;
}
//...
/* non-present -> non-present doesn't make sense. */
KVM_BUG_ON(!was_present, kvm);
// 都是 removed 了,new_pfn 不能有值了
KVM_BUG_ON(new_pfn, kvm);
// page.remove 掉
ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn);
}
TDX private SPTE state transition / state diagram (public state)
一个 public state 是一个数,这个数和一个 TDX Module 内部维护的 SPTE 状态一一对应。对于这个内部状态,一些 SEAMCALL 会在出错时:
- 在 RCX 返回 SPTE,我们可以通过这个程序进行分析:
- 在 RDX 的 bit 7~15 这 8 位返回出错时的 public state。
https://github.com/tristone13th/code-snippets/blob/main/c/tdx/spte_state.c
所有的状态以及解释:
TDX Base SPEC
9. TD Private Memory Management
9.2. Secure EPT Entry
9.2.1. UPDATED: Overview
Table 9.1: UPDATED: Secure EPT Entry State High Level Description
// 这个更全一些
ABI
Table 4.32: Secure L1 EPT Entry TDX State Returned by TDX Interface Functions
基础状态图(注意 SEPT non-leaf entry 和 leaf entry 的状态是不一样的):
TDX Base SPEC
9. TD Private Memory Management
9.2. Secure EPT Entry
9.2.2. UPDATED: SEPT Entry State Diagrams
加了 Migration 之后的状态图:
TD Migraiton Spec
Figure 9.3: Partial SEPT Leaf Entry State Diagram for Mapped Page Export
Figure 9.6: Partial SEPT Leaf Entry State Diagram for Pending Page Export
Figure 9.8: Page In-Order Import Phase Partial SEPT Entry State Diagram
Figure 9.9: Page Out-of-Order Import Phase Partial SEPT Entry State Diagram
每个 state 由这七位表示:
-
SEPT_ENTRY_D_BIT_POSITION
(0): dirty bit (bit 9) -
SEPT_ENTRY_TDEX_BIT_POSITION
(1 到 4):-
tdex
, // bit 53 - Exported -
tdbw
, // bit 54 - Blocked for Writing -
tdb
, // bit 55 - Blocked -
tdp
, // bit 56 - Pending
-
-
SEPT_ENTRY_IPAT_TDMEM_BIT_POSITION
(5 到 6):- bit 6: // always 1
- bit 7 // Non-Leaf(0) / Leaf(1)
MAPPED
: leaf is 1
BLOCKED
: leaf, tdb is 1
BLOCKEDW
: leaf, tdbw is 1(只有 migration 才能转移到这个状态)
EXPORTED_BLOCKEDW
: leaf, tdbw, tdex is 1(只有 migration 才能转移到这个状态)
Why there are 2 different states FREE
and REMOVED
?
TDH.MEM.PAGE.REMOVE
will mark the state to FREE
.
TDH.IMPORT.MEM(Cancel)
will mark the state to REMOVED
.
BLOCKED
+ TDH.EPORT.BLOCKW
理论上,SPEC 里,没有看到有这种状态转移。
TDX Module 的 Code 里看下来,这种状态转移也是不允许的。
BLOCKED
+ TDH.EPORT.MEM
理论上,SPEC 里,没有看到有这种状态转移。
TDX Module 的 Code 里看下来,这种状态转移也是不允许的。
BLOCKEDW
+ TDH.MEM.RANGE.BLOCK
理论上,SPEC 里,会转移到 BLOCKED
状态。
leaf 7, index = 6
TD Teardown Process
TD Teardown Process from QEMU
When does hkid
is freed?
KVM:
kvm_vcpu_release
kvm_vm_release
kvm_put_kvm(kvm); // 如果 ref_count 变成了 0
if (refcount_dec_and_test(&kvm->users_count))
kvm_destroy_vm
mmu_notifier_unregister
subscription->ops->release
kvm_mmu_notifier_release
kvm_flush_shadow_all
kvm_arch_flush_shadow_all
vt_flush_shadow_all_private // vt_x86_ops.flush_shadow_all_private()
tdx_mmu_release_hkid
tdx_hkid_free // free the hkid
kvm_arch_destroy_vm
static_call_cond(kvm_x86_vm_free)(kvm);
vt_vm_free
tdx_vm_free
Zap in TDX
enum tdp_zap_private {
ZAP_PRIVATE_SKIP = 0,
ZAP_PRIVATE_BLOCK,
ZAP_PRIVATE_REMOVE,
};
这三个 action 置上的地方不一样,但是检查并使用的地方都是一样的,都是在函数 tdp_mmu_zap_leafs()
里面。
从 kvm_tdp_mmu_unmap_gfn_range()
的 code 里其实可以看出来,这个代表了三种情况:
-
ZAP_PRIVATE_REMOVE
:需要调用PAGE.REMOVE
的 SEAMCALL 来把页从 TDX Module 里去除,因为我们要 gmem invalidation (PUNCH_HOLE),这其实表示的就是这段 gmem 空间我们不要了,kernel 可以收回这些 page 了,如果不从 TDX Module 里 remove 掉,那么 kernel 分配给其他进程使用的时候就会出错。当然,当要 delete 一个 memory slot 的时候也会触发这种情况。 -
ZAP_PRIVATE_BLOCK
:从 private 往 shared 转,因为我们不能保证后面还会不会转回来,所以我们暂时先不 remove,这样可以提高性能。 -
ZAP_PRIVATE_SKIP
:MMU Notifier。
ZAP_PRIVATE_SKIP
KVM
默认的,对于 shared 也是用 ZAP_PRIVATE_SKIP
。只在下面的函数 tdp_mmu_zap_leafs()
用到了(这个函数用来 zap 一个 range 的 leafs)。
- 对于 private 的 sp,什么也不做;
- 对于 shared sp,把这些 SPTE leaf 直接置为
SHADOW_NONPRESENT_VALUE
,不像 private SPTE 还有一个中间步骤。
// 这个函数可以用来 zap private spt 的 leaf,也可以用来 zap shared spt 的 leaf。
static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t start, gfn_t end, bool can_yield, bool flush,
enum tdp_zap_private zap_private)
{
bool is_private = is_private_sp(root);
//...
// 可以看到,SKIP 对无论 shared 还是 private 都是有意义的
// 另外两个(BLOCK, REMOVE)都是仅仅对 private 有意义的。
WARN_ON_ONCE(zap_private != ZAP_PRIVATE_SKIP && !is_private);
//...
// 正如 SKIP 的名字,如果我们是 private 的,并且 skip 那么就不 zap
if (zap_private == ZAP_PRIVATE_SKIP && is_private)
return flush;
//...
for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
// 如果是 SKIP,那么必定是 shared SPTE。
// 这里的条件是达不成的。
if ((zap_private == ZAP_PRIVATE_SKIP ||
zap_private == ZAP_PRIVATE_BLOCK) &&
is_private_zapped_spte(iter.old_spte))
continue;
if (zap_private == ZAP_PRIVATE_REMOVE)
new_spte = SHADOW_NONPRESENT_VALUE;
// 虽然函数是 private zapped,但是里面会检查,如果是 shared
// 返回的就是 SHADOW_NONPRESENT_VALUWE
else
new_spte = private_zapped_spte(kvm, &iter);
}
ZAP_PRIVATE_BLOCK
KVM
只在下面的函数用到了:
static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t start, gfn_t end, bool can_yield, bool flush,
enum tdp_zap_private zap_private)
//...
for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
if ((zap_private == ZAP_PRIVATE_SKIP || zap_private == ZAP_PRIVATE_BLOCK) &&
is_private_zapped_spte(iter.old_spte))
continue;
//...
if (zap_private == ZAP_PRIVATE_REMOVE)
new_spte = SHADOW_NONPRESENT_VALUE;
// 这里暗示了其实
else
new_spte = private_zapped_spte(kvm, &iter);
//...
不难分析得出,这个 action 表示将 SPTE 置为 private_zapped_spte
的状态,也就是置上 bit 62 (SPTE_PRIVATE_ZAPPED
)。
ZAP_PRIVATE_REMOVE
KVM
REMOVE 就是一步到位,直接变成 SHADOW_NONPRESENT_VALUE
。
static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t start, gfn_t end, bool can_yield, bool flush,
enum tdp_zap_private zap_private)
//...
for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
//...
if (zap_private == ZAP_PRIVATE_REMOVE)
new_spte = SHADOW_NONPRESENT_VALUE;
else
new_spte = private_zapped_spte(kvm, &iter);
//...
}
.free_private_spt()
, .remove_private_spte()
, .drop_private_spte()
, .zap_private_spte()
因为这些是 private 相关的,所以 VMX 并没有对应的 callback,这些是对应在 TDX 里面的。
注意,free
是 spt
,其它都是 spte
。
设置一个 SPTE 为 zapped 的三条路径
// Kernel 擅自把 PTE 改的时候会通过这个 path 通知到 KVM
mmu_notifier_change_pte
__mmu_notifier_change_pte
kvm_mmu_notifier_ops->change_pte()
kvm_mmu_notifier_change_pte
kvm_change_spte_gfn
kvm_set_spte_gfn
kvm_tdp_mmu_set_spte_gfn
set_spte_gfn
tdp_mmu_iter_set_spte(kvm, iter, private_zapped_spte(kvm, iter));
zap_collapsible_spte_range
tdp_mmu_zap_spte_atomic
__kvm_tdp_mmu_write_spte(iter->sptep, private_zapped_spte(kvm, iter));
// 这个 path 就是 gmem invalidate gfn 的 path
kvm_mmu_unmap_gfn_range
kvm_unmap_gfn_range
kvm_tdp_mmu_unmap_gfn_range
tdp_mmu_zap_leafs
new_spte = private_zapped_spte(kvm, &iter);
// 这个 path 就是 set 一个 memslot 时会把整个 memslot invalidate 掉的 path
kvm_set_memslot
kvm_invalidate_memslot
kvm_arch_flush_shadow_memslot
kvm_mmu_zap_memslot
kvm_tdp_mmu_unmap_gfn_range
tdp_mmu_zap_leafs
new_spte = private_zapped_spte(kvm, &iter);
// 看起来是不太常用的一些 corner case
__kvm_set_or_clear_apicv_inhibit
kvm_post_set_cr0
update_mtrr
kvm_zap_gfn_range
kvm_tdp_mmu_zap_leafs
tdp_mmu_zap_leafs
new_spte = private_zapped_spte(kvm, &iter);
tdx_sept_zap_private_spte()
& .zap_private_spte
/ SPTE_PRIVATE_ZAPPED
KVM
只是 block 这一段 GPA 而已。
Zap 的语义其实是被 TDX patchset 给改了,原来没有引入 TDX 的时候就是只是简单的把这个 SPTE remove 掉,加入了 TDX 之后,表示一种中间的 range block 的状态(SPTE_PRIVATE_ZAPPED
)。可能是为了性能的考虑,在做 page attribute conversion 的时候,如果要把一个 page 从 private 转成 shared 的状态,因为我们并不着急回收内存,那么我们只需要先 zap 一下,保留着其他的元信息比如 PFN,并不调用 TDH.MEM.PAGE.REMOVE
来把它 remove 掉,这样如果以后我们还需要再 convert 回来的话,我们就省了很多开销(比如 TDH.MEM.PAGE.ADD
)。
KVM: x86/mmu: add SPTE_PRIVATE_ZAPPED
KVM: x86/tdp_mmu: optimize remote tlb flush
This is preparation to optimize TLB shootdown. The existing code to zap the EPT entry always issues the TLB shootdown each the EPT entry, doesn't batch TLB shootdown for zapping multiple EPT entries. The origin procedure is:
- clear the EPT entry (in the KVM maintained shadow table).
- TDX SEAMCALL
TDH.MEM.RANGE.BLOCK
with GFN. This corresponds to clearing the present bit. -
TDH.MEM.TRACK
. corresponds to local tlb flush - send IPI to remote vcpu. This corresponds to remote tlb flush.
- When destructing TD,
TDH.MEM.PAGE.REMOVE
with PFN. There is no corresponding to the VMX EPT operation.
At the last step, PFN is needed to unlink the private memory from the Secure EPT. Because this procedure is doing synchronously, the PFN is saved on the stack.
If we'd like to batched TLB shootdown (TLB shootdown when entering guest?), the PFNs needs to be saved somewhere because the stack can't be used as the array of PFNs can be large.
- multiple the EPT entries
- TDX SEAMCALL
TDH.MEM.RANGE.BLOCK
with GFNs This corresponds to clearing the present bit. The step 1) and 2) are repeated for multiple GFNs. and then -
TDH.MEM.TRACK
. corresponds to local tlb flush - send IPI to remote vcpu. This corresponds to remote tlb flush. 3) and 4) is a batched TLB shootdown.
- When destructing VM,
TDH.MEM.PAGE.REMOVE
with PFNs. There is no corresponding to the VMX EPT operation.
For the step 5), PFNs needs to be remembered somewhere. One option is to use the zapped EPT entry. by setting the special flag SPTE_PRIVATE_ZAPPED
. 从这里可以看出引入 SPTE_PRIVATE_ZAPPED
的作用,在一个 page 被标记成 zapped 的时候,并不会马上被 remove,而是在 SPTE 要变为其他的状态(非 zapped)时才会被 remove。在最后要关闭整个 TD 的时候,会把每一个 SPTE 设置为 SHADOW_NONPRESENT_VALUE
,从而触发之前被置为 private zapped 的 page 的 removing:
kvm_mmu_notifier_release
kvm_flush_shadow_all
kvm_arch_flush_shadow_all
kvm_mmu_zap_all
kvm_tdp_mmu_zap_all
tdp_mmu_zap_root
tdp_mmu_set_spte_atomic(SHADOW_NONPRESENT_VALUE)
handle_changed_spte
// 这个条件很重要,是 delay removing 的关键
// 从 zap 的状态置为 non-present 的状态
if (was_private_zapped && !is_present) {
handle_private_zapped_spte
// TDH.MEM.PAGE.REMOVE
ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn);
static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level)
{
//...
err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
//...
WRITE_ONCE(kvm_tdx->has_range_blocked, true);
}
tdx_sept_drop_private_spte()
& .drop_private_spte
KVM
虽然名字叫 drop_private_spte
,但最主要的作用还是回收掉指定的那一个 page。
在 TDX module 里 remove 掉 SPTE 映射到的那个 private page,并且 clear 对应 SPTE 的 value(当我们在 teardown TD 时就不需要了)。
在这里调用 tdx_unpin()
的原因,我觉得应该是因为在 close gmem/restricted fd 的时候,对应的内存并不会被 kernel 回收掉。所以需要在这里调用 tdx_unpin()
一个页一个页地把 private 的 memory 都回收掉。
有两个调用到的地方:
tdx_sept_remove_private_spte
rmap_remove
static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)
{
int tdx_level = pg_level_to_tdx_sept_level(level);
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
struct tdx_module_output out;
gpa_t gpa = gfn_to_gpa(gfn);
hpa_t hpa = pfn_to_hpa(pfn);
hpa_t hpa_with_hkid;
int r = 0;
u64 err;
int i;
// This means we are destroying the TD, because we don't need
// to set the spte value to 0, so we use reclaim SEAMCALL
if (!is_hkid_assigned(kvm_tdx)) {
/*
* The HKID assigned to this TD was already freed and cache
* was already flushed. We don't have to flush again.
*/
tdx_reclaim_page(hpa, level, false, 0);
// 真正地 free 掉这个 page
tdx_unpin(kvm, gfn, pfn, level);
return 0;
}
//...
// 在 TDX module 里 remove 那个 private page, 并且清空对应的 spte value。
err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
//...
// 对应的是这个 page 是大页时的情况,需要把每一个 4KB 的 page 都回收。
for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++) {
// hpa 和 hkid 关联起来
hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
// ...
// Write back and invalidate all cache lines with the specified page
err = tdh_phymem_page_wbinvd(hpa_with_hkid);
tdx_set_page_present(hpa);
// tdx_unpin 支持回收一整个大页,但可以看到这里参数是 PG_LEVEL_4K,说明
// 我们在一个一个地回收,是不是有可以优化的地方?
tdx_unpin(kvm, gfn + i, pfn + i, PG_LEVEL_4K);
hpa += PAGE_SIZE;
}
return r;
}
tdx_sept_remove_private_spte()
& .remove_private_spte()
KVM
这里的 SPTE 和函数 handle_changed_spte()
里的一样,这里的 SPTE 并不一定就是最后一级 SPTE。
Remove = TLB flush + Drop
static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)
{
if (is_hkid_assigned(to_kvm_tdx(kvm)))
kvm_flush_remote_tlbs(kvm);
return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
}
在两个地方会被调用到:handle_changed_private_spte()
以及 handle_private_zapped_spte()
。
__handle_changed_spte
handle_private_zapped_spte
handle_changed_private_spte
tdx_sept_free_private_spt()
& .free_private_spt
KVM
free_private_spt()
is (obviously) called when a shadow page is being zapped.
kvm_mmu_free_shadow_page
和 handle_removed_pt
中均有调用。传进来的 gfn 是 sp->gfn
,也就是这个 mmu page 的第一个 entry 所映射的 gfn。这个函数的作用并不是 free 这个 gfn 所指向 entry 表示的页表,而是这个 entry 所在的这整个 mmu page,也就是 sp
。
这个函数最主要的就是调用了 tdh_mem_sept_remove
SEAMCALL 来在 tdx module 里取消 track 这个 EPT page。
static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level, void *private_spt)
{
// page 的 gfn 能够确定一个 SPTE 的 key,这个 SPTE 的父节点所表示的 gfn 范围 (key) 可以通过
// mask gfn 的一些位来得到。
gpa_t parent_gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level + 1);
int parent_tdx_level = pg_level_to_tdx_sept_level(level + 1);
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
struct tdx_module_output out;
u64 err;
// 如果我们是在 teardown process,那么我们将这个页表页从 TDX Module 回收掉。
// 至于在 kernel 里的回收,kvm_mmu_free_private_spt 会 free_page,所以这就不需要了
if (!is_hkid_assigned(kvm_tdx))
return tdx_reclaim_page(__pa(private_spt), PG_LEVEL_4K, false, 0);
// 先把父 page 表示的 range 整个 block 住
if (kvm_tdx->td_initialized)
err = tdh_mem_range_block(kvm_tdx->tdr_pa, parent_gpa, parent_tdx_level, &out);
// Flush TLB on all vcpus
tdx_track(kvm_tdx);
// 把父节点所在的 shadow page remove 掉
// TDH.MEM.SEPT.REMOVE removes an empty Secure EPT page or pages, with all 512 entries marked as FREE
err = tdh_mem_sept_remove(kvm_tdx->tdr_pa, parent_gpa, parent_tdx_level, &out);
//...
// 这个 EPT 页可能在 cache line 中还存在,写回并 invalidate cache line
err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(__pa(private_spt), kvm_tdx->hkid));
//...
return 0;
}
TDX Basic
Documentation
GHCI 主要讲 TDG.VP.VMCALL 这个 TDCALL 的 leaf。
为什么遇到特权指令比如 CPUID 时,不是 VMM 直接模拟并返回,而是通过 VE 返回然后 guest 再 TDG.VP.VMCALL Instruction.CPUID?
Scripts
guest kernel cmdline should add idle=poll
.
How to see TDX module version of a running system?
dmesg | grep "TDX module"
TD_PARAMS
Structure
ABI 3.4.4. UPDATED: TD_PARAMS
KVM calls the TDH.MNG.INIT
(passing the TD_PARAMS
structure) to initialize the TD. It's size is 1KB.
- From KVM to TDX module: pass
TD_PARAMS
; - From QEMU to KVM: pass
init_vm
.
Persistent SEAMLDR and Non-Persistent SEAMLDR
NP-SEAMLDR: Load P-SEAMLDR. This one is the binary /boot/efi/EFI/TDX/TDX-SEAM_SEAMLDR.bin
.
P-SEAMLDR: Load TDX Module.
How does host know the TDX module can be trusted?
TD Measurement and Attestation.
Intel SGX-Based Attestation.
All TD measurements are reflected in TD attestations.
run-time measurement registers can be used by the guest TD software, e.g., to measure a boot process.
What is XMM register?
Finalize the TD measurement?
- Its measurement cannot be modified anymore (except the run-time measurement registers).
- TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER).
When?
After the initial set of pages is added and extended, the VMM can finalize the TD measurement using the TDH.MR.FINALIZE function.
TD Exit logic
TD guest code ->
TD Exit reasons
TD Exit qualification
TDX ABI
3. Data Types
3.7. TD Entry and Exit Types
3.7.1. Extended Exit Qualification
TDX SPTE bits illustrated
typedef union ia32e_sept_u {
struct {
uint64_t
r : 1, // 0
w : 1, // 1
x : 1, // 2
mt : 3, // 3-5 - Set to 110 (WB)
ipat_tdmem : 1, // 6 - Set to 1
leaf : 1, // 7 - Non-Leaf(0) / Leaf(1), always 1 for 4KB (level 0)
a : 1, // 8 - Accessed
d : 1, // 9 - Dirty, set and cleared by the TDX module in all the *EXPORTED_* states
tdel : 1, // 10 - Entry Lock
reserved_0 : 1, // 11 - Reserved for IOMMU SNP
base : 40, // 12-51
hp : 1, // 52 - Host Priority, used together with TDEL
tdex : 1, // 53 - Exported
tdbw : 1, // 54 - Blocked for Writing
tdb : 1, // 55 - Blocked
tdp : 1, // 56 - Pending
tdpin : 1, // 57 - 1: Page is pinned in memory
pw : 1, // 58 - Paging-Write
ignored_0 : 1, // 59
sss_tdsa : 1, // 60 - Supervisor Shadow Stack / SEPT Alias (Link)
reserved_1 : 1, // 61 - Reserved for IOMMU
reserved_2 : 1, // 62 - Reserved for IOMMU (BlockDMA)
supp_ve : 1; // 63
};
uint64_t raw;
}
What is Asynchronous TD Exit and TD Resumption?
所谓的同步与异步,其实是对于 Guest TD 的 code flow 来说的。
A TD Exit might be asynchronous, triggered by some external event (e.g., external interrupt or SMI) or an exception, or it might be synchronous, triggered by a TDCALL(TDG.VP.VMCALL) function.
异步比如 interrupt / EPT violation,都是在 Guest TD 的 code flow 之外的,所以算是异步的。
- Guest TD memory access to a non-present private GPA causes an asynchronous TD exit with an EPT Violation exit reason.
而 VMCALL 是 TD 主动调用的,所以算是同步的。
What is Synchronous TD Exit and Subsequent TD Entry?
It is regarded to the TD, TD can exit by invoking TDG.VP.VMCALL, so it is synchronous.
What's the difference between TDX modes and SEAM modes?
TDX modes are logical concepts which doesn't exist, it is logical, SEAM root is physical.
TDX root mode contains:
- Non-SEAM root mode: KVM is running under this mode.
- half SEAM root mode: TDX Module which serve host-side functions is running under this mode.
TDX non-root mode contains:
- SEAM non-root mode: TD is running under this mode.
- half SEAM root mode: TDX Module which serve guest-side functions is running under this mode.
You can refer to figure 2.2 in TDX spec for more information.
SEAM is Secure Arbitration Mode
Like SMM, it is a new mode.
You can image a cubic, which has x, y and z axis, they are:
- x: Modes, such as SMM, SEAM
- y: Out of VMX, VMX Root, VMX Non-Root
- z: Ring 0-3
Shared EPT and Secure EPT
Shared EPT | Secure EPT | |
---|---|---|
For | Shared GPAs | Private GPAs |
Managed by | VMM | VMM indirectly |
GPA encrypted by | Key shared with VMM | TD private key |
Encrypted? | Not | Not |
Shared key and TD private key
Shared/Private is a bit in physical address.
- Shared accesses are intended to behave as legacy memory accesses and use the upper bits of the host physical address as an HKID, which must be from the range allocated to legacy MKTME.
- Private accesses use the guest TD’s private HKID. (Which means it won't use the key in the upper bits!)
How to understand "Accessible" and "Addressable"?
Accessible (Memory): Memory whose content is readable and/or writeable (e.g., TD private memory is accessible to the guest TD).
Addressable (Memory): Memory that can be referred to by its address. The content of addressable memory might not necessarily be accessible (e.g., TDCS is not accessible to the host VMM, you need to use SEAMCALL to invoke the TDX Module).
TDH submodules?
MNG: Management
MR: Measurement Register
MEM: Memory
VP: Virtual Processor
All the interfaces can be seen TDX Spec: Sec 2.9.
tdsysinfo_struct
The tdsysinfo_struct
is fairly large (1024 bytes) and contains a lot of info about the TDX module.
tdsysinfo_struct
is also a structure defined in KVM.
TDH.SYS.RD
/RDALL
and TDH.SYS.INFO
Can both enumerate TDX module capabilities.
TDH.SYS.RD
and TDH.SYS.RDALL
are added in TDX 1.5, and are the recommended enumeration methods.
tdx_capabilities
(KVM)
用来让 KVM 把其关注的 tdsysinfo_struct
里的一些 field 进行解析变换处理后存起来。
struct tdx_capabilities {
//...
u8 sys_rd; // 支不支持 TDH.SYS.RD 这个 SEAMCALL
u32 max_servtds; // 最多同时能够 bind 多少个 service TD
//...
struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
};
What is initialize?
Who initialize? The host VMM.
Initialize who? The TDX module.
When to initialize? After loaded the module.
What the TDX-module exactly is?
TDX module is in SEAM VMX Root.
TD is in SEAM VMX non-root.
So TDX module and TD is like VMM and VM.
TDX module also use the VMRESUME/VMLAUNCH to start a VM. (This is wrapped in the TDX SEAMCALL TDH.VP.ENTER, so actually, KVM can just call this SEAMCALL).
What is CMR (Convertible Memory Ranges)?
Memory regions that can hold TD-private memory pages.
The meaning of "Convertible": Can be converted from a Shared page to a Private page.
TDX Spec: 13.1.4.1
A 4KB memory page is defined as convertible if it can be used to hold an Intel TDX private memory page or any Intel TDX control structure pages while helping guarantee Intel TDX security properties (i.e., if it can be converted from a Shared page to a Private page).
- CMR configuration is checked by MCHECK and cannot be modified afterwards.
How does KVM get the CMR information?
The host VMM should then call the TDH.SYS.RD/RDALL
or TDH.SYS.INFO
function to enumerate the Intel TDX module functionality and parameters, and retrieve the trusted platform topology and CMR information as previously checked by MCHECK.
What's the relationship between TDMR and CMR?
TDMR is very similar with CMR, it also has many characteristics:
CMRs | TDMRs | |
---|---|---|
Size | multiple of 4KB | multiple of 1GB |
Power of 2 | not required | not required |
Can overlap? | No | No |
Scope | Platform | Platform |
Soft configuration | Yes | Yes |
Phsical or Virtual | Physical | Physical |
Every TDMR page must reside within a CMR. There is no requirement for TDMRs to cover all CMRs.
- During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees.
- The KVM should decide on a set of TDMRs based on the CMR information.
- The KVM should then call the TDH.SYS.CONFIG function and pass TDMR information with other configuration information.
- The KVM should then use the TDH.SYS.TDMR.INIT function to initialize the TDMRs and their associated control structures.
TDX reports a list of CMR to tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. Once this is done, those "TDX-usable" memory regions are fixed during module's lifetime.
SEAM module / TDX module
Identical.
What's the difference between SEAMRR and CMR?
SEAMRR | CMR | |
---|---|---|
Intention | Loading and executing the Intel TDX module | hold TDX memory pages encrypted with a private HKID |
Configured by | BIOS | BIOS |
It is a | Register | Table |
P or V | Physical Range | Physical Range |
SEAMRR is for memory range for loading and executing the Intel TDX module.
MCHECK stores the CMR table in a pre-defined location in SEAMRR’s SEAMCFG region so it can be read later and trusted by the Intel TDX module.
Does TDX module know the content of KET(Key Encryption Table)?
What is a service TD?
// currently only migtd is supported
enum kvm_tdx_servtd_type {
KVM_TDX_SERVTD_TYPE_MIGTD = 0,
KVM_TDX_SERVTD_TYPE_MAX,
};
Service TD to target TD binding relationship is many-to-many
- Multiple service TDs of different types may be bound to a single target TD. (On target TD can have no more 1 service TD with the same type).
- Multiple target TDs may be bound to a single service TD.
What is ephemeral key?
It is just another name for the key in MKTME to encrypt TD pages.
TDVMCALL
TDVMCALL: Guest Call a host VMM service.
- It is a TDCALL, the function name is TDG.VP.VMCALL
- The call is forwarded by TDX module to the host VMM (e.g., KVM), so it is a hypercall implemented in the TDX context.
- EXIT_REASON_TDCALL
static int vt_handle_exit(struct kvm_vcpu *vcpu,
enum exit_fastpath_completion fastpath)
{
if (is_td_vcpu(vcpu))
return tdx_handle_exit(vcpu, fastpath);
return vmx_handle_exit(vcpu, fastpath);
}
Actually, all TDVMCALL is triggered by a TDCALL(TDG.VP.VMCALL), when to process go from TDX module to the Hypervisor, it is named TDVMCALL.
TDVPS.LAST_TD_EXIT has a name TDVMCALL which denotes last TD exit was due to a TDG.VP.VMCALL. On the next TD entry, most GPR and all XMM state will be forwarded to the guest TD from the host VMM.
For more, pls. refer to TDX Spec: Sec 2.9.
All the interfaces can be seen TDX Spec: Sec 2.9.1.
Relationship between PAMT and Secure EPT
Control Structures
TDVPS, TDVPR, TDVPX?
TDVPS | TDVPR | TDVPX | |
---|---|---|---|
Name | State | Root | Extension |
Pages | Multi | 1 | Multi |
Page type | Mixed | TDVPR Page | TDCX Pages |
TDVPR is the root (first) page of a TDVPS.
TDVPX are the non-root pages of a TDVPS.
TDCX: 4KB physical pages that are intended to hold parts of a multi-page control structure. (It is a page type)
TDVPS includes the VMCS of the TD, so I think TDVPS can be seen as the superset to the VMCS.
What's the difference between TD Root (TDR), TDCS, TDVPS?
It seems TDVPS is like VMCS, they both control a vcpu.
The Intel TDX module is designed to load CPU state from the TDVPS structure and perform VM entry to go into TDX non-root mode. When TD exit is triggered, the Intel TDX module is designed to save CPU state into the TDVPS structure, load the CPU state saved on TD entry.
TDR | TDCS | TDVPS | |
---|---|---|---|
Meaning | Trust Domain Root | Trust Domain Control Structure | Trust Domain Virtual Processor State |
Scope | TD | TD | VCPU |
Controls | key management and build/teardown process. | the operation of a guest TD | operation and hold the state of a guest TD virtual processor. |
Pages | 1 | Multi | Multi |
Encrypted by | global private HKID | guest's private HKID (TDR/TDCX) | guest's private HKID (TDVPR/TDCX) |
For more, pls. refer to TDX Spec: Table 2.2: TDX-Managed Control Structures Overview.
TDX host kernel code learning
Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption. It is assumed The TDX host patch series implements necessary functions.
TDX_MODULE_CALL
/ seamcall()
/ kvm_seamcall()
tdcall
和 seamcall
是两个不同的 instruction,但是使用它们的 ABI 却是相似的,所以定义了 TDX_MODULE_CALL
来同时处理这两种情况:
- 当
TDX_MODULE_CALL host=1
时,使用seamcall
- 当
TDX_MODULE_CALL host=0
时,使用tdcall
我们可以看到,__seamcall()
函数是这么实现的,__tdx_module_call_asm
就是 TDCALL:
SYM_FUNC_START(__seamcall)
FRAME_BEGIN
TDX_MODULE_CALL host=1
FRAME_END
RET
SYM_FUNC_END(__seamcall)
SYM_FUNC_START(__tdx_module_call_asm)
FRAME_BEGIN
TDX_MODULE_CALL host=0
FRAME_END
RET
SYM_FUNC_END(__tdx_module_call_asm)
__seamcall()
在 seamcall()
和 kvm_seamcall()
中都有调用:
// 这个函数主要是 kernel 在用
seamcall()
// 添加一些结构体的封装
__seamcall()
TDX_MODULE_CALL host=1
// 这个函数主要是 KVM 在用
kvm_seamcall()
__seamcall()
//...
为什么要设计两个封装函数呢?
SYM_FUNC_START(__tdx_module_call_asm)
FRAME_BEGIN
TDX_MODULE_CALL host=0
FRAME_END
RET
SYM_FUNC_END(__tdx_module_call_asm)
KVM TDX basic feature support (by Isaku)
Note that MSR bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the same value for all VCPUs of the same TD.
The CPU translates shared GPAs using the usual EPT or "Shared EPT" (in this document), which resides in KVM memory. The Shared EPT is directly managed by the host VMM - the same as with the current VMX.
这个没有看懂?
Since execution of such interface functions takes much longer time than accessing memory directly, in KVM we use the existing TDP code to minor the Secure EPT for the TD. This way, we can effectively walk Secure EPT without using the TDX interface functions.
One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if the guest physical address is private (the bit is cleared) or shared (the bit is set). The bits are called stolen bits.
Because it's costly to access secure EPT during walking EPTs with SEAMCALLs for the private guest physical address, another private EPT is used as a shadow of Secure-EPT with the existing logic at the cost of extra memory.
Use 'vt' for the naming scheme as a nod to VT-x and as a concatenation of VmxTdx.
Dependency:
The assumed APIs the TDX host patch series provides are
- int seamrr_enabled()
Check if required cpu feature (SEAM mode) is available. This only check CPU
feature availability. At this point, the TDX module may not be ready for KVM
to use.
- int init_tdx(void);
Initialization of TDX module so that the TDX module is ready for KVM to use.
- const struct tdsysinfo_struct *tdx_get_sysinfo(void);
Return the system wide information about the TDX module. NULL if the TDX
isn't initialized.
- u32 tdx_get_global_keyid(void);
Return global key id that is used for the TDX module itself.
- int tdx_keyid_alloc(void);
Allocate HKID for guest TD.
- void tdx_keyid_free(int keyid);
Free HKID for guest TD.
Tear Down
As part of the TD teardown process, the VMM needs to put the TD into a TD_TEARDOWN state, as described in 6.3. This is a non-recoverable state.
as long as the TDR page is the last one to be reclaimed.
For TDR page, the intention is for the host VMM to call TDH.PHYMEM.PAGE.WBINVD after calling TDH.PHYMEM.PAGE.RECLAIM
.
Functions such as TDH.MEM.PAGE.REMOVE
and TDH.MEM.PAGE.PROMOTE
are designed to remove TD private pages and Secure EPT pages, respectively.
MKTME
Why using multi-keys, is 1 key not enough?
Tenants can setup their own keys to encrypt their VMs.
Config
Misc
-no-hpet
is a must option to boot a TD.
q35
is the must motherboard to boot a TD.
q35
has influence on virtio
network card, make the Ubuntu guest (Ubuntu 16 and Ubuntu 22 is tested) cannot access the network.
Segment fault, lib.so.6 on ubuntu22
guest command line add "noccfilter".
guest kernel add patch x86/tdx: Virtualize CPUID leaf 0x2 · intel-innersource/os.linux.cloud.mvp.kernel-dev@a2bc4b6
Ubuntu 22 doesn't enable the network card
ip a
# do not open in tmux
sudo vim /etc/netplan/01-netcfg.yaml
sudo netplan apply
network:
version: 2
ethernets:
enp0s1:
dhcp4: true
dhcp6: false
Ubuntu 22.04 LTS : Configure DHCP Client : Server World
只需要配一次,重启之后也没有问题。
TDCALL
TDCALL is an instruction. RAX to select the leaf.
TDCALL(guest interface): used by the guest TD software (in TDX non-root mode) to invoke guest-side TDX functions.
- From: Guest TD
- To: TDX Module
To find the leaf functions: *ABI: 5.4.1. TDCALL Instruction (Common) | Guest-Side (TDCALL) Interface Functions | Interface Functions*, there is a table lists all the leaves. |
TDG.VP.VMCALL (TDCALL Leaf (RAX) = 0)
ABI: 7.5.29. TDG.VP.VMCALL Leaf: GHCI is totally for this leaf.
R11 indicates the sub function.
TDG.VP.VMCALL <SetupEventNotifyInterrupt> (sub-function (R11) = 0x10001)
The guest TD may request VMM specify which interrupt vector to use as an event-notify vector.
Example of an operation that can use the event notify is the VMM signaling a device removal to the TD, in response to which a TD may unload a device driver.
The VMM should use SEAMCALL[TDWRVPS]
leaf to inject an interrupt at the requested interrupt vector into the TD VCPU that executed TDG.VP.VMCALL<SetupEventNotifyInterrupt>
via the posted-interrupt descriptor.
TDG.VP.VMCALL <MapGPA> (sub-function (R11) = 0x10001)
Please search in GHCI 3.2.
Request the host VMM to map a GPA range as private or shared-memory mappings。Guest 通过指定 Start GPA of address range 里的 Shared bit 来表示 if a sharedor private-page mapping is desired.
注意,这个可以在 private 和 shared 之间转化的,也就是 Guest 可以请求将一个 private 的 mapping 转化为 shared 的。
The aim is for the VMM to use TDH.MEM.PAGE.AUG
to add the GPA(s) to the TD as pending, private mapping(s) in the secureEPT. When the VMM responds to this TDG.VP.VMCALL
with success, the goal is for the TD to execute TDG.MEM.PAGE.ACCEPT
to complete the process to make the page(s) usable.
所以说,这个和 TDG.MEM.PAGE.ACCEPT
并不是等价的,这个是为了让 Guest 通知 VMM 先进行 AUG,这样 Guest 自己才能够 ACCEPT
。
我们可以看下 TDX guest kernel 里的 code:
#define TDVMCALL_MAP_GPA 0x10001
tdx_enc_status_changed
tdx_enc_status_changed_phys
_tdx_hypercall(TDVMCALL_MAP_GPA, ...)
tdx_accept_memory
tdx_enc_status_changed_phys
_tdx_hypercall(TDVMCALL_MAP_GPA, ...)
TDG.VP.VMCALL <Service> (sub-function (R11) = 0x10005) / tdx_handle_service()
/ struct tdvmcall_service
KVM
The <Service>
means this TDCALL is from a service TD, not a normal TD.
Service is identified by the GUID
in the command buffer.
Service command is identified by the command
filed in the data
field of the command buffer.
Command/Response Buffer (CRB)
This buffer is allocated by TD and shared with KVM:
- The command buffer is filled by the service TD with commands for KVM to handle, and
- the response buffer is filled by the KVM to respond to the service TD.
- These 2 buffers shouldn't be private. (Because we need KVM to get the information!)
When receiving the TDG.VP.VMCALL
, KVM allocates 2 host buffers of the same size as the command buffer and response buffer, and copies the commands into the host side buffer. When the command handling is done, the response data in the KVM allocated response buffer are copied to the service TD shared response buffer. This avoids the inconvenience of direct accessing to userspace memory. (你可能会问,我们明明是从 guest TD 直接 TDVMCALL exit 到 KVM 里的,和 QEMU 没什么关系,为什么要从 userspace copy 数据呢?这是因为 guest TD 传进来的这个 shared buffer 本质不还是被 QEMU 所 handle 的处于 userspace 的内存区域吗,所以我们才可以这么进行 copy,因为 guest 的 memory 就是位于 userspace 的 memory)。
Can be async, which means KVM can return immediately and interrupt guest when response is ready. This is controlled by the R14 register of the command buffer.
CRB 的内存布局是这样的(GHCI 所定义的):
struct tdvmcall_service {
guid_t guid;
// 整个 CRB 的长度(也就可以理解为这个结构体的大小)
uint32_t length;
uint32_t status;
uint8_t data[0];
};
TDG.VP.VMCALL <Service.Query>
The Query service currently only has a query command.
Allows the service TD to query if a service handling is supported by KVM.
TDG.VP.VMCALL <Service.MigTD>
The MigTD service currently has a bunch of commands supported.
-
WaitForRequest
: check from KVM if there is an operation needs to perform on MigTD side. -
ReportStatus
: do the operation, report the status of the operation back to KVM.
This is used to allow MigTD to get the migration information from VMM.
handle_tdvmcall
TDG_VP_VMCALL_SERVICE
tdx_handle_service
static int tdx_handle_service(struct kvm_vcpu *vcpu)
{
struct kvm *kvm = vcpu->kvm;
struct kvm_tdx *tdx = to_kvm_tdx(kvm);
// CRB 的地址被放在了寄存器中传过来
gpa_t cmd_gpa = tdvmcall_a0_read(vcpu) & ~gfn_to_gpa(kvm_gfn_shared_mask(kvm));
gpa_t resp_gpa = tdvmcall_a1_read(vcpu) & ~gfn_to_gpa(kvm_gfn_shared_mask(kvm));
uint64_t nvector = tdvmcall_a2_read(vcpu);
struct tdvmcall_service *cmd_buf, *resp_buf;
enum tdvmcall_service_id service_id;
bool need_block = false;
int ret = 1;
unsigned long tdvmcall_ret = TDG_VP_VMCALL_INVALID_OPERAND;
// CRB 不能是 private memory
if (kvm_mem_is_private(kvm, gpa_to_gfn(cmd_gpa)) ||
kvm_mem_is_private(kvm, gpa_to_gfn(resp_gpa))) {
pr_warn("%s: cmd or resp buffer is private\n", __func__);
tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
goto err_cmd;
}
// 如我们上面所说,allocate 2 host buffers
cmd_buf = tdvmcall_servbuf_alloc(vcpu, cmd_gpa);
resp_buf = tdvmcall_servbuf_alloc(vcpu, resp_gpa);
resp_buf->length = sizeof(struct tdvmcall_service);
service_id = tdvmcall_get_service_id(cmd_buf->guid);
switch (service_id) {
case TDVMCALL_SERVICE_ID_QUERY:
tdx_handle_service_query(cmd_buf, resp_buf);
break;
case TDVMCALL_SERVICE_ID_MIGTD:
if (nvector) {
pr_warn("%s: interrupt not supported, nvector %lld\n",
__func__, nvector);
nvector = 0;
break;
}
need_block = tdx_handle_service_migtd(tdx, cmd_buf, resp_buf);
break;
case TDVMCALL_SERVICE_ID_VTPM:
case TDVMCALL_SERVICE_ID_VTPMTD:
case TDVMCALL_SERVICE_ID_TDCM:
case TDVMCALL_SERVICE_ID_TPA:
case TDVMCALL_SERVICE_ID_SPDM:
ret = 0;
break;
default:
resp_buf->status = TDVMCALL_SERVICE_S_UNSUPP;
pr_warn("%s: unsupported service type\n", __func__);
}
if (ret == 0) {
/* user handles the service and update the guest status buf */
ret = tdx_vp_vmcall_to_user(vcpu);
kfree(resp_buf);
} else {
/* Update the guest status buf and free the host buf */
tdvmcall_status_copy_and_free(resp_buf, vcpu, resp_gpa);
tdvmcall_ret = TDG_VP_VMCALL_SUCCESS;
}
err_status:
kfree(cmd_buf);
if (need_block && !nvector)
return kvm_emulate_halt_noskip(vcpu);
err_cmd:
if (ret) {
tdvmcall_set_return_code(vcpu, tdvmcall_ret);
if (nvector)
tdx_inject_notification(vcpu, nvector);
}
return ret;
}
SEAMCALL
Used by the host VMM to invoke host-side TDX interface functions.
- From: Host VMM (e.g., KVM)
- To: TDX Module
All SEAMCALL leaves
To find the leaf functions: ABI: 5.3.1. SEAMCALL Instruction (Common), there is a table:
Table 5.4: SEAMCALL Instruction Leaf Numbers Definition
lists all the leaves.
SEAMCALL Completion Status Codes
64Bit, returned in RAX.
ABI 3.1.3. Function Completion Status Codes
TDH.MNG.KEY.CONFIG (TDCALL Leaf 8)
Configure the TD private key on a single package.
Input is the TDR physical address with HKID set to 0.
A CPU-generated random key is used. The operation may fail due to lack of entropy.
A KET entry in private HKIDs range is configured per package by KVM using the this function.
Why? if set to 0, how does MKTME build the connection between the HKID and the key?
Thy HKID is written to TDR during the TDH.MNG.CREATE
function. So HKID is in the content, not in the physical address.
TD-scope key management fields are held in TDR. They include the key state, ephemeral private HKID and key information, and a bitmap for tracking key configuration.
TDH.MEM.SEPT.ADD
Add and map 4KB SEPT pages to a TD. 这个不是用来映射 GPA 到 HPA 的,而是 Add 多个页用来放 SEPT 页表的内容。
这个相当于 Non-leaf entry 版的对于 leaf entry 的 TDH.MEM.PAGE.ADD
。
-
TDH.MEM.PAGE.ADD
加的是 leaf entry 真正要 map 到的 TD Guest page; -
TDH.MEM.SEPT.ADD
加的是 non-leaf entry 要 map 到的 SEPT 要用的 page。
TDH.MEM.SEPT.ADD
adds a set of 4KB Secure EPT pages to a TD and maps them to the provided GPA.
TDH.MEM.SEPT.ADD
initializes the SEPT pages to hold 512 free entries using the TD’s ephemeral private key.
在代码里主要是通过 tdx_sept_link_private_spt()
里调用的,而且这个函数也很简单,只是调用了这个 SEAMCALL,把 PT page 加到 TDX Module 里,并映射传进来的 GPA 到这个页。
主要的输入有一个 GPA,一个 HPA。
- HPA 表示的是要 ADD 的 SEPT page。
- GPA 表示的是要 map 的地址。
作用就是,让给定的 GPA 的位置的 SPTE 指向加进来 page 的 HPA。
API 这么设计的好处是灵活,这种方式:
- 既可以添加中间页表的映射,映射到下一级页表页的 HPA;
-
也可以添加最后一级页表的映射,映射到真正最后想要让 GPA 映射到的 HPA。(我猜的,因为其实TDH.MEM.PAGE.ADD
和TDH.MEM.PAGE.AUG
可以处理最后一级页表的映射,当 TD 已经 finalized 之后,我使用TDH.MEM.PAGE.AUG
来添加;如果还没有,用TDH.MEM.PAGE.ADD
来添加)。
这个 SEAMCALL 的流程主要如下:
- Walk the SEPT based on the GPA and level and find the SEPT entry
- …
- Initialize the new SEPT page, indicating 512 entries in the FREE state
- Update the parent SEPT entry with the new SEPT page HPA.
- Increment TDR.CHLDCNT.
- …
TDH.PHYMEM.PAGE.WBINVD
Write back and invalidate all cache lines associated with the specified memory page.
TDH.MEM.SEPT.REMOVE
TDH.MEM.SEPT.REMOVE removes an empty Secure EPT page or pages, with all 512 entries marked as FREE, from the TD’s Secure EPT trees.
- Walk the L1 Secure EPT based on the GPA operand and find the non-leaf SEPT entry of the SEPT page to be removed.
- Scan the L1 Secure EPT page content and check all 512 entries are FREE. If passed, set the parent L1 Secure EPT entry to FREE.
- Atomically decrement TDR.CHLDCNT.
主要的输入就是 GPA,
TDH.MEM.PAGE.ADD
Add a 4KB private page to a TD, mapped to the specified GPA, filled with the given page image.
Input:
- GPA: 要 map 到 guest 里的地址
- HPA: HPA of the target page to be added to the TD
AUG/ADD 都会增加 TDR.CHLDCNT
。注意,ADD 是在 build time 增加的,而 AUG 是动态增加的。
这个 SEAMCALL 会更新 SPTE,使其指向所传入的 HPA。
这个 GPA->HPA
映射 VMM 应该也保存了一份。
TDH.MEM.PAGE.AUG
Dynamically add a 4KB or a 2MB private page to an initialized TD, mapped to the specified GPAs.
别名叫做 shared to private conversion。这只是把这个 page 加到了 tdx module 里,让这个 page 成了 private 的可以加密的。
AUG/ADD 都会增加 TDR.CHLDCNT
。
Input:
- GPA: 要 map 到 guest 里的地址
- HPA: HPA of the target page to be added to the TD
TDH.MEM.PAGE.REMOVE
Remove a GPA-mapped 4KB, 2MB or 1GB private page from a TD.
别名也是 private to shared conversion,从名字也可以看出,只是把这个页从 TDX Module 去掉了,不需要再加密了。
Input:
- GPA: 要在 guest 里 remove 的 page 的 GPA
Process:
- Walk the Secure EPT based on the GPA, and find the leaf entry of the page to be removed.
- Set the SEPT entry state to FREE.
- Atomically decrement TDR.CHLDCNT by 1, 512 or 5122
- Free the physical page: Set the PAMT entry of the removed TD private page to PT_NDA.
和 RECLAIM 有什么区别呢?
-
RECLAIM
需要在 TD State teardown 的时候执行。REMOVE
是在 running 的时候执行的。有点像ADD
和AUG
之间的关系哈。 -
TDH.MEM.PAGE.REMOVE
会 Set 这个 page SEPT entry to FREE,也就是会清空映射。
最后他们两个都会把对应的 page 在 PAMT 里的状态设置为 PT_NDA。
这个 SEAMCALL 会在 tdx_sept_drop_private_spte()
里面被调用。
TDH.PHYMEM.PAGE.RECLAIM
/ tdh_reclaim_page()
/ KVM
Can reclaim pages only if the owner TD is in the TD_TEARDOWN state. 这个和 PAGE.REMOVE
的区别就是,PAGE.REMOVE
会是在 TD running 的过程中调用的,而这个是 teardown 过程中调用的。所以 unmap 一个 page 并不是这两个 SEAMCALL 都要跑,而是根据情况这二者选其一。
Reclaim a physical 4KB, 2MB or 1GB TD-owned page from a TD.
注意,reclaim 只是告诉 TDX module 这个 page 我们不用了,reclaim 完之后我们还需要在 kernel 里释放掉。
Input:
- HPA: 要 reclaim 的 page 的 HPA.
Process:
- Check that the target page metadata in PAMT are correct (PT must NOT be PT_NDA nor PT_RSVD). 这就说明给不能先 REMOVE 再 RECLAIM。
- Update the PAMT entry of the reclaimed page to PT_NDA.
为什么 TDX Module 不设计成当一个 TD 的 TDR 被 free 的时候自动 reclaim 掉所有的 page,而是要用户在 teardown 的过程中手动调用?不知道,可能有隐情吧。
相比于 PAGE.REMOVE
,PAGE.RECLAIM
并不会动 SEPT,是不是因为在 teardown process,所以动 SEPT 没有意义,反正早晚都会被 clear 掉。
TDH.PHYMEM.PAGE.RECLAIM
can reclaim pages only if the owner TD is in the TD_TEARDOWN state.
static int tdx_reclaim_page(hpa_t pa, enum pg_level level,
bool do_wb, u16 hkid)
{
struct tdx_module_output out;
u64 err;
do {
err = tdh_phymem_page_reclaim(pa, &out);
/*
* TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is shutdown.
* state. i.e. destructing TD.
* TDH.PHYMEM.PAGE.RECLAIM requires TDR and target page.
* Because we're destructing TD, it's rare to contend with TDR.
*/
} while (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX));
if (err & TDX_SEAMCALL_STATUS_MASK)
return -EIO;
/* out.r8 == tdx sept page level */
WARN_ON_ONCE(out.r8 != pg_level_to_tdx_sept_level(level));
if (do_wb && level == PG_LEVEL_4K) {
/*
* Only TDR page gets into this path. No contention is expected
* because of the last page of TD.
*/
err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
if (WARN_ON_ONCE(err)) {
pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
return -EIO;
}
}
tdx_set_page_present_level(pa, level);
tdx_clear_page(pa, KVM_HPAGE_SIZE(level));
return 0;
}
Process
TDX module loading process
Software Use Cases
Intel TDX Module Lifecycle
Intel TDX Module Platform-Scope Initialization
Table Typical Intel TDX Module Platform-Scope Initialization Sequence
When build the kernel, set CONFIG_INTEL_TDX_HOST=y
.
When boot the kernel, add kernel cmdline parameter tdx_host=on
.
When load the KVM, make sure sudo cat /sys/module/kvm_intel/parameters/tdx
is Y.
TDX module has 2 possible names:
-
libtdx.so
. load from initrd -
TDX-SEAM.so
. load from IFWI
You can change their name to support different way loading.
If "UEFI SEAM Load" is not enabled:
-
SEAMLdr
andlibtdx.so
are both in initrd.
else:
- If
SEAMLdr
andTDX-SEAM.so
are both in ESP,SEAMLdr
will load the TDX module from ESP. - If not,
SEAMLdr
andTDX-SEAM.so
are built in IFWI. (这种方式叫做 FV)
With the old IFWI, Linux kernel will continue help to load TDX SEAM module from /lib/firmware/intel-seam/
.
When load KVM (kernel boot will also load KVM), if /lib/firmware/intel-seam/
has module, it will load it and override existing one (这个优先级是最高的). The corresponding file name is: libtdx.bin libtdx.bin.sigstruct np-seamldr.acm
TD build / create process
Table 3.3: Typical TD Build Sequence.
To use a TD, we should first build it.
KVM can create a new guest TD by allocating and initializing a TDR control structure using the TDH.MNG.CREATE
function. As an input to it, the host VMM assigns the TD with a HKID. A TD is identified with the containing bits 51:12 of the physical address of the TD’s TDR page.
static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params)
{
// ...
ret = tdx_guest_keyid_alloc();
// ...
va = __get_free_page(GFP_KERNEL_ACCOUNT);
tdr_pa = __pa(va);
// ...
err = tdh_mng_create(tdr_pa, kvm_tdx->hkid); // Create the TDR and generate the TD’s random ephemeral key.
// ...
}
KVM then program the HKID and encryption key into the MKTME encryption engines using the TDH.MNG.KEY.CONFIG
function on each package.
Build the TD Control Structure (TDCS) by adding control structure pages, using the TDH.MNG.ADDCX function, and initialize using the TDH.MNG.INIT function.
It can then build the Secure EPT tree using the TDH.MEM.SEPT.ADD function and add the initial set of TD-private pages using the TDH.MEM.PAGE.ADD function.
//...
kvm_init
kvm_arch_init
kvm_confidential_guest_init
tdx_kvm_init
get_tdx_capabilities
r = kvm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd); // cmd is KVM_TDX_CAPABILITIES
r = kvm_vm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd); // only if the above line returned -EINVLD
// KVM
kvm_dev_ioctl // the system-wide ioctl
kvm_arch_dev_ioctl
// tdx_dev_ioctl(), pass the TDX system-wide information to user because currently only KVM_TDX_CAPABILITIES is supported
r = static_call(kvm_x86_dev_mem_enc_ioctl)(argp);
vt_mem_enc_ioctl
tdx_vm_ioctl
tdx_td_init
__tdx_td_init
TD memory allocation
Guest use TDG.VP.VMCALL to request GPA range allocation.
KVM TDH.MEM.SEPT.ADD build SEPT.
KVM TDH.MEM.PAGE.AUG add pages.
KVM TDH.VP.ENTER
Guest TDG.MEM.PAGE.ACCEPT.
TD memory removal
A bit complicated than allocation.
Interrupt Handling in TDX
TDX Base Spec: Interrupt Handling and APIC Virtualization
TDX supports only posted interrupt. No LAPIC emulation.
Guest TDs must use virtualized x2APIC mode. xAPIC mode (using memory mapped APIC access) is not allowed. The guest TD cannot disable the APIC.
Guest TDs are allowed access to a subset of the virtual APIC registers^, which are virtualized by the CPU. Access to other registers can cause a #VE
. The guest TD is expected to use a software protocol over TDG.VP.VMCALL
(GHCI) to request such operations from the VMM. 至于访问哪些会产生 VE,请看 TDX Module Base: Figure 11.3: Virtual APIC Access by Guest TD
。
Non-NMI interrupt injection into the guest TD by the host VMM or the IOMMU can be done through the posted-interrupt mechanism. If there are pending interrupts in the PID, the VMM can post a self IPI with the notify vector prior to TD entry.
The PID resides in a shared page. If needed, the guest TD may use a software protocol over TDCALL(TDG.VP.VMCALL)
to ask the VMM to stop interrupt delivery through the PID.
The TD VMCS posted interrupt execution controls are reset to their initial values when the TD is migrated. The host VMM on the destination platform must set them in order to use posted interrupts.
tdx_mig_import_state_vp
tdx_td_vcpu_post_init
// Write to TD VMCS's posted-interrupt notification vector
td_vmcs_write16(tdx, POSTED_INTR_NV, POSTED_INTR_VECTOR);
// Write to TD VMCS's posted-interrupt descriptor address
td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->pi_desc));
// Enable processing posted-interrupt
// If this control is 1, the processor treats interrupts with the posted-interrupt notification vector
// specially, updating the virtual-APIC page with posted-interrupt requests.
td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);
与 VMX 类似 vmx->pi_desc
,TDX 也定义了自己的 PI 成员 tdx->pi_desc
。并且,pi_desc
在 vcpu_vmx
和 vcpu_tdx
里的 offset 也是一样的:
static_assert(offsetof(struct vcpu_pi, pi_desc) == offsetof(struct vcpu_vmx, pi_desc));
static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) == offsetof(struct vcpu_vmx, pi_wakeup_list));
#ifdef CONFIG_INTEL_TDX_HOST
static_assert(offsetof(struct vcpu_pi, pi_desc) == offsetof(struct vcpu_tdx, pi_desc));
static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) == offsetof(struct vcpu_tdx, pi_wakeup_list));
#endif
tdx_deliver_interrupt()
KVM
vt_deliver_interrupt
tdx_deliver_interrupt
__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
kvm_vcpu_trigger_posted_interrupt
// send posted-interupt notification vector IPI to self
__apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
__vmx_deliver_posted_interrupt()
KVM
// vector 表示的不是 PI 的 notification vector,是我们要注入的 interrupt vector。
static inline void __vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, struct pi_desc *pi_desc, int vector)
{
// 把 PIR 的 vector bit 置上。
pi_test_and_set_pir(vector, pi_desc));
// If a previous notification has sent the IPI, nothing to do.
// 把 PID.ON 置上。
pi_test_and_set_on(pi_desc);
//...
// 发送 IPI 给自己。
kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
}
kvm_vcpu_trigger_posted_interrupt()
KVM
正如 posted-interrupt 的 SPEC 里所写,如果我们需要触发 posted-interrupt,需要发送一个 vector 为 PI notification vector 的 IPI 给自己这个 CPU,这个函数就是做这件事的。
static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu, int pi_vec)
{
//...
if (vcpu->mode == IN_GUEST_MODE) {
/*
* The vector of the virtual has already been set in the PIR.
* Send a notification event to deliver the virtual interrupt
* unless the vCPU is the currently running vCPU, i.e. the
* event is being sent from a fastpath VM-Exit handler, in
* which case the PIR will be synced to the vIRR before
* re-entering the guest.
*
* When the target is not the running vCPU, the following
* possibilities emerge:
*
* Case 1: vCPU stays in non-root mode. Sending a notification
* event posts the interrupt to the vCPU.
*
* Case 2: vCPU exits to root mode and is still runnable. The
* PIR will be synced to the vIRR before re-entering the guest.
* Sending a notification event is ok as the host IRQ handler
* will ignore the spurious event.
*
* Case 3: vCPU exits to root mode and is blocked. vcpu_block()
* has already synced PIR to vIRR and never blocks the vCPU if
* the vIRR is not empty. Therefore, a blocked vCPU here does
* not wait for any requested interrupts in PIR, and sending a
* notification event also results in a benign, spurious event.
*/
if (vcpu != kvm_get_running_vcpu())
__apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
return;
}
/*
* Wake the vCPU in case it is blocking, otherwise do nothing as KVM will grab the highest priority pending
* IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
*/
kvm_vcpu_wake_up(vcpu);
}
tdx_protected_apic_has_interrupt()
KVM
TDX 默认用的是 x2APIC。主要是为了 check 当前有没有 interrupt。
kvm_arch_vcpu_runnable
kvm_vcpu_has_events
if (kvm_cpu_has_interrupt())
bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
// post-interupt PIR 里有没有还没有处理的 interrupt?
bool ret = pi_has_pending_interrupt(vcpu);
union tdx_vcpu_state_details details;
struct vcpu_tdx *tdx = to_tdx(vcpu);
// 如果有 PI 中断,或者 vcpu 现在不是 halted 的状态(其他状态都是中断状态比如 INIT, SIPI 等等),那就返回 true,表示有 interrupt
if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
return true;
if (tdx->interrupt_disabled_hlt)
return false;
/*
* This is for the case where the virtual interrupt is recognized,
* i.e. set in vmcs.RVI, between the STI and "HLT". KVM doesn't have
* access to RVI and the interrupt is no longer in the PID (because it
* was "recognized". It doesn't get delivered in the guest because the
* TDCALL completes before interrupts are enabled.
*
* TDX modules sets RVI while in an STI interrupt shadow.
* - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
* The interrupt shadow at this point is gone.
* - It knows that there is an interrupt that can be delivered
* (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
* matter)
* - It forwards the TDExit nevertheless, to a clueless hypervisor that
* has no way to glean either RVI or PPR.
*/
if (xchg(&tdx->buggy_hlt_workaround, 0))
return true;
/*
* This is needed for device assignment. Interrupts can arrive from
* the assigned devices. Because tdx.buggy_hlt_workaround can't be set
* by VMM, use TDX SEAMCALL to query pending interrupts.
*/
details.full = td_state_non_arch_read64(tdx, TD_VCPU_STATE_DETAILS_NON_ARCH);
return !!details.vmxip;
}
Memory
Guest 在使用一段 private memory 之前需要先 TDG.MEM.PAGE.ACCEPT
。
See Documentation/virt/kvm/tdx-tdp-mmu.rst
.
Since the execution of such interface functions takes much longer time than accessing memory directly, in KVM we use the existing TDP code to mirror the Secure EPT for the TD. And we think there are at least two options today in terms of the timing for executing such SEAMCALLs:
- synchronous, i.e. while walking the TDP page tables, or
- post-walk, i.e. record what needs to be done to the real Secure EPT during the walk, and execute SEAMCALLs later.
tdx_unpin()
KVM
static void tdx_unpin(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, enum pg_level level)
{
//...
for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++)
put_page(pfn_to_page(pfn + i));
}
Memory mapping in TDX
如果想要把一段内存给 TDX Module 映射,Host kernel 首先需要 reserve 一段内存,这样 host 才能知道每一个 page 的 HPA 是什么,才能不会再把这段内存分配给其他进程或者 TD。KVM 的 AUG 可以传入 (GPA, HPA) 来进行映射。因为这种 page 有
- host 页表从 HVA 到 HPA 的映射。
- SEPT 页表从 GPA 到 HPA 的映射。
所以在 teardown 阶段,free 一个 page 的时候,如果是 TDR, TDVPR 这种 page,需要:
tdx_reclaim_page(td_page_pa)
free_page((unsigned long)__va(td_page_pa));
如果是 TD 里用的 page:
tdx_reclaim_page(td_page_pa)
tdx_unpin()
来将其从两个映射中删除,这才算真正地回收了这个 page。tdx_reclaim_td_page
正是这么实现的:
void tdx_reclaim_td_page(unsigned long td_page_pa)
{
//..
tdx_reclaim_page(td_page_pa, PG_LEVEL_4K, false, 0)
free_page((unsigned long)__va(td_page_pa));
//..
}
如果只是想从 TDX Module 里取消映射,host 上仍然 reserve 这段内存区域,host 并不是也想释放掉:
Page promotion (Why?)
Page size promotion is intended to be used by the host VMM to merge 512 pages mapped as 4KB or 2MB into a single page mapped as 2MB or 1GB.
Page demotion (Why?)
Page size demotion is intended to be used by the host VMM to split a page mapped as 1GB or 2MB into 512 pages mapped as 2MB or 4KB, respectively.
TDX MMU TDCALLs
TDG.MEM.PAGE.ACCEPT
TDG.MEM.PAGE.ACCEPT accepts a PENDING private page, previously added by TDH.MEM.PAGE.AUG
, into the TD.
直接 TDG.MEM.PAGE.ACCEPT
一段内存,如果这段内存还没有被 VMM AUG/ADD,那么会触发 EPT Violation 从而 Exit 到 VMM,VMM 可以通过这种方式记录 private/shared bitmap。这是 TDVF 采用的方式;
尽管 Map GPA 可以选择 private to shared 也可以 shared to private,但是 ACCEPT 只能够 accept private page。
The guest TD can accept a dynamically added 4KB or 2MB(所以只能是 TDH.MEM.PAGE.AUG
添加的 page?) page using TDG.MEM.PAGE.ACCEPT
.
The guest TD must accept the page using TDG.MEM.PAGE.ACCEPT
before it can access it. A guest TD attempt to access a page that has been dynamically added by TDH.MEM.PAGE.AUG
but has not yet been accepted by TDH.MEM.PAGE.ACCEPT
results in a #VE exception.
我们可以看看这个 TDCALL 在 guest kernel 里具体是怎么用到的。
在 Guest kernel 的文件 arch/x86/include/asm/shared/tdx.h
中定义。
#define TDX_ACCEPT_PAGE 6
set_memory_encrypted
__set_memory_enc_dec
__set_memory_enc_pgtable // used for the hypervisors that get informed about "encryption" status via page tables.
x86_platform.guest.enc_status_change_finish
tdx_enc_status_changed
tdx_enc_status_changed_phys
try_accept_one
__tdx_module_call(TDX_ACCEPT_PAGE
tdx_accept_memory
tdx_enc_status_changed_phys
try_accept_one
__tdx_module_call(TDX_ACCEPT_PAGE
TDX MMU SEAMCALLs
TDH.MEM.SEPT.ADD
TDH.MEM.SEPT.REMOVE
TDH.MEM.SEPT.RD
不是 private 的吗?为什么可以读?
TDH.MEM.PAGE.ADD
在 build time 为 private page build SEPT entry。
TDH.MEM.PAGE.AUG
TDH.MEM.PAGE.ADD
是为了在 build time 添加 page,但是 TDH.MEM.PAGE.AUG
是为了动态地添加。
THD.MEM.PAGE.REMOVE
TDH.MEM.TRACK
/ TLB tracking in TDX
The goal of TLB tracking is to be able to prove (when needed) that no logical processor holds any cached Secure EPT address translations to a given TD private GPA range.(注意不要和 cache invalid 弄混了,这个要保证的是 translation 没有)。
TLB tracking is required when:
- removing a mapped TD private page(
TDH.MEM.PAGE.REMOVE
) or - changing the page mapping size (
TDH.MEM.PAGE.PROMOTE
)
The sequence typically includes five steps:
- Execute
TDH.MEM.RANGE.BLOCK
on each GPA range, blocking creation of TLB translation to that range. Note that cached translations may still exist at this stage. - Execute
TDH.MEM.TRACK
, advancing the TD’s epoch counter. - Send an IPI to each RLP on which any of the TD’s VCPUs is currently scheduled.
- Upon receiving the IPI, each RLP will TD exit to the VMM. At this point the target GPA ranges are considered tracked. Even though some LPs may still hold TLB entries to the target GPA ranges, the following TD entry is designed to flush them.
- Normally, the host VMM on each RLP will treat the TD exit as spurious and will immediately re-enter the TD.
TDX Base
9. TD Private Memory Management
9.7. Introduction to TLB Tracking
TDCS.TD_EPOCH
, PAMT.BEPOCH
, TDCS.BW_EPOCH
, TDVPS.VCPU_EPOCH
TD’s TLB epoch counter 就是 TDCS.TD_EPOCH
。
TD_EPOCH
是通过 TDH.MEM.TRACK
, TDH.EXPORT.PAUSE
来 advance 的。
BEPOCH
, BW_EPOCH
和 VCPU_EPOCH
都是在特定时刻对 TD_EPOCH
的 sample。
-
TDH.MEM.RANGE.BLOCKW
会采样TD_EPOCH
并放到TDCS.BW_EPOCH
中; -
TDH.MEM.RANGE.BLOCK
会采样TD_EPOCH
并放到PAMT.BEPOCH
中; -
TDH.VP.ENTER
会采样TD_EPOCH
并放到TDVPS.VCPU_EPOCH
中。
对于要 export 的每一个 page,TDH.EXPORT.MEM
会 check TDCS.BW_EPOCH
。(TDX Module 里的 is_tlb_tracked
函数)需要保证此时的 BW_EPOCH
小于等于 TDCS.TD_EPOCH
。
总之,主要是用来 check 用的,感觉不需要深挖。
TDX shared bit of GPA
TDX repurposes one GPA bit (51 bit or 47 bit based on configuration) to indicate the GPA is private(if cleared) or shared (if set) with VMM. If GPA.shared is set, GPA is covered by the existing conventional EPT pointed by EPTP. If GPA.shared bit is cleared, GPA is covered by TDX module. VMM has to issue SEAMCALLs to operate.
Add a member to remember GPA shared bit for each guest TDs, add address conversion functions between private GPA and shared GPA and test if GPA is private.
Because struct kvm_arch (or struct kvm which includes struct kvm_arch. See kvm_arch_alloc_vm() that passes __GPF_ZERO) is zero-cleared when allocated, the new member to remember GPA shared bit is guaranteed to be zero with
this patch unless it's initialized explicitly.
fault.is_private means that host page should be gotten from guest_memfd is_private_gpa() means that KVM MMU should invoke private MMU hooks.
Shared EPT
Why the shared EPT can reside in KVM's memory
Q: Shared page is also encrypted using the MKTME stuffs, which means it's content cannot be accessed by KVM, why shared EPT can reside in KVM's memory?
A: EPT only care the page address translation, MKTME only work on accessing the real page.
Secure EPT
The Secure EPT pages are encrypted and integrity-protected with the TD’s ephemeral private key. The Secure EPT is not intended to be directly accessible by any software other than the Intel TDX module.
Secure EPT entry is opaque; KVM may not access it directly. KVM may read a Secure EPT entry information using the TDH.MEM.SEPT.RD interface function.
From the CPU perspective, Secure EPT has the same structure as a legacy VMX EPT.
Can I say both Secure EPT page and private page are both encrypted using the same private HKID?
Does the control structures are encrypted with another private key?
The control structures are encrypted and integrity-protected with a private key, and managed by Intel TDX functions.
The controls structures are encrypted with private keys and HKIDs.
Why using secure EPT? How does it benefit than shared EPT?
"stolen" Bit from HPA and GPA
The "stolen" bits in HPA denotes the HKID set for this physical page (1 bit denotes shared or private).
The "stolen" bit in GPA denotes the Shared/Private bit of this page.
Measurement and Attestation
Attestation: 证词。可以理解为向 Challenger 证明自己(是可信的)。
TDX uses SGX-Based Attestation, which means SGX should be supported.
Software within the guest TD can use the TDG.MR.REPORT and specifying a REPORTDATA value to generate an integrity-protected TDREPORT_STRUCT. Which includes:
- the TD’s measurements,
- the Intel TDX module’s measurements,
- REPORTDATA. This will typically be an asymmetric key that the attestation verifier can use to establish a secure channel or protect sensitive data to be sent to the TD software.
TDREPORT_STRUCT
can ONLY be verified on the local platform via the SGX ENCLU(EVERIFYREPORT2)
instruction.
By design, TDREPORT_STRUCT
CANNOT be verified off platform; it first must be converted into signed Quotes.
What is TDINFO_STRUCT
?
TDINFO_STRUCT
is part of TDREPORT_STRUCT
, you can see the figure in Base: Figure 12.1: UPDATED: TD Measurement Reporting.
For more, see ABI: 3.9.5. UPDATED: TDINFO_STRUCT
Who do the attestation?
Attestation is driven by software in TD.
TD attestation is initiated from inside the TD by calling TDG.MR.REPORT
and specifying a REPORTDATA value.
What is mutual TD attestation?
The migration TDs use a TD-quote-based mutual authentication protocol to create a session between them.
MRTD: Build-Time Measurement Register
Helps provide static measurement of the TD build process and the initial contents of the TD.
The process is:
-
TDH.MNG.INIT
begins the process by initializing the digest. -
TDH.MEM.PAGE.ADD
adds a private page to the TD and inserts GPA into the MRTD digest calculation. - Control structure pages (TDR, TDCX and TDVPR) and Secure EPT pages are NOT measured.
- For pages whose data contribute to the TD, that data should be included in the TD measurement via
TDH.MR.EXTEND
.TDH.MR.EXTEND
inserts the data contained in those pages and its GPA into the digest calculation. If a page will be wiped and initialized by TD code, the loader may opt not to measure the initial contents. - The measurement is then completed by
TDH.MR.FINALIZE
. Once completed, furtherTDH.MEM.PAGE.ADDs
orTDEXTENDs
will fail.
From 2 and 3 we can see, attestation only cares about the data rather than the meta information.
This state is migrated as part of the global immutable state of the TD.
When will we call TDH.MR.EXTEND
and when will we call TDH.MEM.PAGE.ADD
?
RTMR: Run-Time Measurement Registers
An array of general-purpose measurement registers made available to the TD software to enable measuring additional logic and data loaded into the TD at run-time.
The RTMR array is initialized to zero on build, and it can be extended at run-time by the guest TD using the TDCALL(TDG.MR.RTMR.EXTEND
) leaf. (Note: TDH.MR.EXTEND
is to extend MRTD).
Migrated as TD’s mutable state.
What is measurement quoting?
To create a remotely verifiable attestation, the TDREPORT_STRUCT should be converted into a Quote signed by a certified Quote signing key.
TDMR
What is TDMR?
8.Physical Memory Management
8.2TDMR Details
A range of memory, configured by the host VMM, that is covered by PAMT and is intended to hold TD private memory and TD control structures.
- TDMR configuration is "soft" – no hardware range registers are used.
- Each TDMR defines a single physical address range.
- TDMRs cannot overlap with each other.
- TDMRs are configured at platform scope (no separate configuration per package).
TDR and TDVPR are in TDMR, because they are control structures.
TDMRs may contain reserved areas.
Once each 1GB block of TDMR has been initialized the PAMT structure by TDH.SYS.TDMR.INIT, it can be used to hold TD private pages.
Why TDMR needs to be multiple of 1GB?
Are TD guest pages reside in TDMR?
Yes.
Once each 1GB block of TDMR has been initialized by TDH.SYS.TDMR.INIT, it can be used to hold TD private pages.
What is TDMR reserved areas?
13.1.4.2.1. Background: Reserved Areas within TDMRs
Reserved areas are still covered by PAMT. Pages in reserved areas are not used by the Intel TDX module for allocating privately encrypted memory pages.
The physical page is reserved for non-TDX usage. The Intel TDX module will not allow converting this page to any other page type. The page can be used by the host VMM for any purpose.
PAMT (Physical Address Metadata Table)
8.3. PAMT Details
The PAMT is designed to hold metadata of each page (includes page type, page size, assignment to a TD, and other attributes.) in a TDMR. It controls assignment of physical pages to guest TDs, etc. The PAMT is intended not to be directly accessible to software. It resides in memory allocated by the host VMM on TDX initialization.
Each TDMR is defined as controlled by a (logically) single PAMT.
Encrypted by TDX global private key.
PAMT Entry: A PAMT entry is designed to hold metadata for a single physical page. The page size may be 4KB, 2MB or 1GB.
PAMT Array: Physically, for each TDMR the design includes three arrays of PAMT entries, one for each PAMT level.
PAMT Block: For each 1GB of TDMR physical memory, there is a corresponding PAMT Block. A PAMT Block is logically arranged in a three-level tree structure of PAMT Entries like page table.
So, A Block includes Arrays, which include Entries.
PAMT Page type / page state
page type / page state 和 SPTE 的 state 是不一样的,不要弄混。
Specifies the corresponding TD private page type (Assigned? Reserved? holding TDR? holding TDCX?, etc…).
ABI
Table 4.25: PAMT Page Type Values
4.7.4. PAMT Page Type (PT) Values
4.7. Physical Memory Management Types
4. Data Types
-
PT_NDA
: 表示我们还没有把这个 page 加入到 TDX Module 里面去。
HKID
为了避免和 MKTME 自用的 keyID 冲突,TDX 的 HKID 是从 MKTME 的最后一个 ID 开始分配的。
IOCTLs
TDX add new parameter to existing ioctl KVM_CHECK_EXTENSION
(Based on their initialization different VMs may have different capabilities. It is thus encouraged to use the vm ioctl to query for capabilities):
- For system-wide: KVM_CAP_VM_TYPES
- For VM-wide: KVM_CAP_VM_TYPES
They return the same thing.
TDX reuse 1 ioctl KVM_MEMORY_ENCRYPT_OP
(这个 ioctl 是 AMD 在 upstream SEV 的时候引入的) which can suit different conditions:
- system-wide, use wrapper function
tdx_platform_ioctl()
- VM-wide, use wrapper function
tdx_vm_ioctl()
- vcpu-wide, use wrapper function
tdx_vcpu_ioctl()
Parameters
struct kvm_tdx_cmd {
/* enum kvm_tdx_cmd_id, see following */
__u32 id;
/* flags for sub-command. If sub-command doesn't use this, set zero. */
__u32 flags;
__u64 data;
// ...
};
// command id
enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0, // only system-wide is supported
KVM_TDX_INIT_VM, // only vm wide, invoked in a lazy style (when initialize VCPU)
KVM_TDX_INIT_VCPU, // only vcpu wide
KVM_TDX_INIT_MEM_REGION, // only vm wide
KVM_TDX_FINALIZE_VM, // only vm wide
KVM_TDX_CMD_NR_MAX,
};
// sub-command
#define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
Configs
CONFIG_KVM_PROTECTED_VM
Enable support KVM-protected VMs. Currently 'protected' means the VM can be backed with restricted/private memory.
CONFIG_KVM_PRIVATE_MEM
CONFIG_RESTRICTEDMEM
How does the TDX track dirty pages?
KVM unblock the page in a lazy style after it is blocked. When an EPT violation occurs, KVM will TDH.EXPORT.UNBLOCKW
and TDX module will mark the page as dirty if exported.
// The place enable the page dirty logging
case KVM_SET_USER_MEMORY_REGION2:
case KVM_SET_USER_MEMORY_REGION: {
kvm_vm_ioctl_set_memory_region
kvm_set_memory_region
__kvm_set_memory_region
kvm_set_memslot
kvm_commit_memory_region
kvm_arch_commit_memory_region
kvm_mmu_slot_apply_flags
kvm_mmu_slot_remove_write_access
kvm_tdp_mmu_wrprot_slot
wrprot_gfn_range
new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
tdp_mmu_set_spte_atomic
set_private_spte_present
__set_private_spte_present
private_spte_change_flags
if (was_writable && !is_writable) {
tdx_write_block_private_pages
tdh_export_blockw
kvm_arch_mmu_enable_log_dirty_pt_masked
kvm_mmu_write_protect_pt_masked / kvm_mmu_clear_dirty_pt_masked
kvm_tdp_mmu_clear_dirty_pt_masked
clear_dirty_pt_masked
tdx_write_block_private_pages
tdh_export_blockw
// The place mark the page as dirty
// 发生 page fault,说明 guest OS 想要 modify 这个 page 了。
fast_pf_fix_direct_spte
kvm_write_unblock_private_page
tdx_write_unblock_private_page
err = tdh_export_unblockw(kvm_tdx->tdr_pa, ept_info.val, &out);
mark_page_dirty_in_slot(vcpu->kvm, fault->slot, gfn);
So, not all blocked pages by TDH.EXPORT.BLOCKW
need to be unblocked.
TDCS.DIRTY_COUNT
is TD-scope dirty page counter.
- It is cleared when a new migration session begins.
- It is incremented when a page that has previously been exported is unblocked.
- It is decremented when a dirty page is exported by
TDH.EXPORT.ME
.
For successful start token generation by TDH.EXPORT.TRACK
, DIRTY_COUNT
must be 0, indicating that all pages exported so far have their newest pages exported.
CPUID Virtualization in TDX
According to Chapter "CPUID Virtualization" in TDX module spec, CPUID
bits of TD can be classified into 6 types:
------------------------------------------------------------------------
1 | As configured | configurable by VMM, independent of native value;
------------------------------------------------------------------------
2 | As configured | configurable by VMM if the bit is supported natively
(if native) | Otherwise it equals as native(0).
------------------------------------------------------------------------
3 | Fixed | fixed to 0/1
------------------------------------------------------------------------
4 | Native | reflect the native value
------------------------------------------------------------------------
5 | Calculated | calculated by TDX module.
------------------------------------------------------------------------
6 | Inducing #VE | get #VE exception
------------------------------------------------------------------------
As Configured:VMM 负责通过 TDH.MNG.INIT
来配置 CPUID 值,值最后会被 guest TD 看到。TDX Module 不会对所配置的值进行干预。
As Configured (If Native):VMM 通过 TDH.MNG.INIT
来配置值。如果 VMM 要置上,同时 native CPUID 为 1,才会被置为 1 并暴露给 Guest。
Fixed:无需多言(不管 Native CPU 的 CPUID 是什么值,都是 fix 为这个的值)。
Native:和 Native 的值一样。
Calculated:是通过 TDX Module 计算出来的(输入是什么呢?)。
Inducing VE:当 TD Guest query 这个 CPUID 的时候,会注入一个 VE 给 TD Guest。
- All the configurable XFAM related features and TD attributes related features fall into type #2. And fixed0/1 bits of XFAM and TD attributes fall into type #3.
- For CPUID leaves not listed in "CPUID virtualization Overview" table in TDX module spec, TDX module injects #VE to TDs when those are queried. For this case, TDs can request CPUID emulation from VMM via TDVMCALL and the values are fully controlled by VMM.
TDX enables and disables
TDX disables APICv
For lapic, it's safe guard. Because TDX KVM disables APICv with APICV_INHIBIT_REASON_TDX
.
APICv is disabled because TDX doesn't support it