Intel EPT (Extended Page Tables)
SDM 28.3 THE EXTENDED PAGE TABLE MECHANISM (EPT) CHAPTER 28 VMX SUPPORT FOR ADDRESS TRANSLATION VOLUME 3
From GPA to HPA.
当 non-root 下的 CPU 操作 GVA 时,首先使用客户机页表进行地址转换,得到 GPA,然后 CPU 根据 GPA 查询 EPT,在 VMCS 结构中有一个 EPTP 的指针,其中的 12-51 位指向 EPT 页表的一级目录即 PML4 Table 的物理地址。这样根据客户机物理地址的首个 9 位就可以定位一个 PML4 entry。
EPT is used when the “enable EPT” VM-execution control is 1.
Shared EPT pointer / Secure EPT pointer / Secure EPTP
这是一个 VMCS field。这是 TDX 引入的。
A 64-bit execution control field to specify the Shared-EPT pointer. In SEAM VMX non-root operation, the plan dictates two EPTs be active:
- the private EPT specified using the EPTP field of the VMCS and,
- a Shared-EPT specified using the Shared-EPTP field of the VMCS.
private EPT 需要 EPTP field 吗?压根不需要,因为 private EPT 是 TDX Module 里面 manage 的,硬件做 traverse 的时候是用的 TDX Module 里的页表,KVM 里仅仅是做了 cache 而已,EPTP 存在的意义就是 KVM 告诉硬件 EPT 翻译的页表在这里,现在 TDX Module 里面有自己的页表了,所以不需要 KVM 来指定。TDX Module 里面的页表空间是我们手动 SEPT ADD 进去的。也就无需传递一个 pointer 进去了。可能仅仅需要把第一个 SEPT 页表的地址传进去就行了。
就像注释里说的:Adding private page: The procedure of populating the private page looks as follows.
- TDH.MEM.SEPT.ADD(512G level)
- TDH.MEM.SEPT.ADD(1G level)
- TDH.MEM.SEPT.ADD(2M level)
- TDH.MEM.PAGE.AUG(4K level)
r = tdx_seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, 0, 0, out);
EPTP (EPT Pointer)
enum vmcs_field {
//...
EPT_POINTER = 0x0000201a,
EPT_POINTER_HIGH = 0x0000201b,
//...
}
kvm_mmu_load_pgd
static_call(kvm_x86_load_mmu_pgd)
vt_load_mmu_pgd
if (is_td_vcpu(vcpu))
tdx_load_mmu_pgd
vmx_load_mmu_pgd
if (enable_ept) {
vmcs_write64(EPT_POINTER, eptp);
EPT PTE bits illustrated / EPT format
Table 28-1. Format of an EPT PML4 Entry (PML4E) that References an EPT Page-Directory-Pointer Table
Table 28-2. Format of an EPT Page-Directory-Pointer-Table Entry (PDPTE) that Maps a 1-GByte Page(可以看作 1G page 的 PTE)
Table 28-3. Format of an EPT Page-Directory-Pointer-Table Entry (PDPTE) that References an EPT Page Directory
Table 28-4. Format of an EPT Page-Directory Entry (PDE) that Maps a 2-MByte Page (可以看作 2M page 的 PTE)
Table 28-5. Format of an EPT Page-Directory Entry (PDE) that References an EPT Page Table
Table 28-6. Format of an EPT Page-Table Entry that Maps a 4-KByte Page(4K page 的 PTE)
How to build the page table
kvm_tdp_page_fault
direct_page_fault
__direct_map
Page table vs. EPT
Page Table (CR3) | EPT | |
---|---|---|
Per | Process | VM |
Mapping in each level | To GPA of the next level table | to HPA of the next level table |
Ept page walk
- (gCR3 (L4 GPA), EPTP) -> L4 HPA
- (L4 HPA, GVA) -> L3 GPA
- (L3 GPA, EPTP) -> L3 HPA
- (L3 HPA, GVA) -> L2 GPA
- (L2 GPA, EPTP) -> L2 HPA
- (L2 HPA, GVA) -> L1 GPA
- (L1 GPA, EPTP) -> L1 HPA
- (L1 HPA, GVA) -> GPA
- (GPA, EPTP) -> HPA
if all the structure is correctly settled, then there is no vmexit.
EPT violation / EPT misconfiguration
EPT 页表项里并没有 present bit,那么什么叫做一个 present 的 EPT PTE(SDM 里会用这个词)?
From SDM: An EPT paging-structure entry is present if any of bits 2:0 is 1; otherwise, the entry is not present.
EPT violation 和 EPT misconfiguration 都属于是一种 VM-exit,其本身的概念用来描述 VM-exit 发生的原因。当然,少部分情况:If the “EPT-violation #VE” VM-execution control is 1, certain EPT violations may cause virtualization exceptions instead of VM exits.
Two conditions:
- EPT violation,EPT violation 是一个同步的事件, VM-exit,这种情况典型场景就是页结构不存在,类似于主机上的缺页故障(#PF Page Fault),发生这种故障后 CPU 会退回到内核态,同时会在 VM-exit qualification 字段记录 EPT violation 故障的详细信息。EPT 页结构和主机上普通的页结构类似,页表都是故障触发,逐渐填充页结构的过程。在客户态模式下,这是 EPT violation 故障;在主机上,这就是缺页故障。怎样判断页结构不存在呢?通过查看每个描述页结构条目的低 3 位(read/write/execute),如果低 3 位全都为 0,就表示指向的下一级页结构不存在(not-present)。这是非页表的情况,如果是页表(Page Table),它的条目的最低位是 Present 位,可以单独用来表示指向的页是否存在。
- EPT misconfiguration,另一种会导致页故障的情况,它是在页结构存在的情况下(页结构低 3 位的任何一位非 0),发现页结构的内容设置不恰当而产生的故障(包含 0-2 bits 设置不合理的情况),这种情况是 EPT 的页结构真的出了配置上的问题而产生的故障,对于 EPT misconfiguration,并不会记录详细信息。VM-exit qualification 字段为未定义值。
注意,故障不同于异常,硬件故障后会跳转到相应的处理例程,在处理完之后会重新运行出故障时的指令而非下一条,EPT violation 和主机页表中的 Page Fault 一样,都属于故障。对于使用 EPT 页表或者主机页表的用户程序来说,当访问到不存在的页结构触发了故障后,硬件会处理然后重新执行,用户程序是感觉不到的。
内存虚拟化硬件基础——EPT_享乐主的博客-CSDN博客_ept
Why if a guest page walk has n levels and a nested page walk has m levels, a 2d walk requires nm + n + m page entry references.
?
You can imagine a rectangle with length n + 1 and width m + 1, but without the final item
Who maintains the EPT and how does it be maintained?
One page table is maintained by guest OS, which is used to generate the guest’s physical address (GPA). The other page table is maintained by VMM, which maps the guest’s physical address to the host’s physical address.
EPT invalidation / INVEPT
INVEPT
instruction.
Invalidates mappings in the translation lookaside buffers (TLBs) and paging-structure caches that were derived from extended page tables (EPT).
This instruction will only invalidate current CPU's EPT, if we want all the TLB's are invalidated, we should send IPIs.
How to generate a lot of EPT invalidations / EPT violations?
- Memory ballooning^
-
migratepages
^ - In older kernel version,
mmu_shrink_scan()
这个函数会对 EPT 页表本身所占的空间(注意不是 Guest VM 内存)进行回收,在需要的时候再从 KVM 或者 Userspace 根据映射来取。这本身会造成大量的 EPT Violations,因为 EPT 页表不存在了。