PML (Page Modification Logging)
Intel PML is Page Modification Logging, don't mix it with Page Map Level. Intel PML 是基于 A/D bits support 的。
Patchset: [PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support - Kai Huang
Not supported in TDX:
__init int vt_hardware_setup(void)
{
//...
/* Need to be set before vmx_hardware setup */
if (enable_tdx)
enable_pml = false;
//...
}
Intel Page Modification Logging (PML) is an enhancement of EPT.
A new 64-bit VM-execution control field called PML address is introduced. The PML address points to a 4KB aligned physical memory page called PML logging buffer. The buffer is organized in 512 64-bit entries which store logged GPAs. A new16-bit guest-state field called PML index is also introduced. The PML index is the logical index of the next entry in the logging buffer. The PML index is typically a value in the range 0-511 (starting from 511).
When PML is enabled, each write instruction which sets a dirty flag in the EPT during a page walk triggers the logging of the GPA. The PML index is decremented after each logging operation. Whenever the PML logging buffer is full, the processor raises a VMExit and the hypervisor comes into play. The logging process restarts once the PML index is reset.
问题一:只 log dirty bit,不 log accessed bit 吗?
是的,只是做 dirty logging 用的,不涉及到 accessed logging。主要是用来记录哪些 GPA page 已经更改过了,变 dirty 了。
问题二:没有 PML 之前的世界是什么样的?
即使没有 PML,也有 A/D bits 可以用吧,我们知道 A/D bits 相比于 write protection 的方式性能上会有很大提升,那么 PML 相比于 A/D bits 的优势是什么?即使没有 PML,某一时刻我们如果想知道一个 page 是否 dirty 的话,直接读它的 dirty bit 不就好了吗?
在每次 GET_DIRTY_LOG
的时候,需要把整个 EPT table 都遍历一遍来看谁的 dirty 被置上了,有了 PML 之后就不需要了,可能这是一个优势。
Host page table 有 PML 类似的东西吗?应该没有,可能是因为在 host 上不需要一次性拿到所有的 dirty 的 pages,可能只是需要周期性地把 dirty 的 page 写回磁盘所以定期扫一下所有的 page cache 发现如果是 dirty 的话说明写过了就写回磁盘就行了,use case 不一样,所以就没给 host page table 设计 PML 对应的技术。
一言以蔽之,在虚拟化中,我们关注的是 dirty 的 page 有哪些;在 Bare-metal 环境中,我们关注的是一个 page 是不是 dirty 的。
Currently, dirty logging is done by write protection, which write protects guest memory, and mark dirty GFN to dirty_bitmap
in subsequent write fault. This works fine, except with overhead of additional write fault for logging each dirty GFN. The overhead can be large if the write operations from geust is intensive.
在没有 PML 前,VMM 要监控 Guest 对于 Guest page 的修改时,需要将 EPT 的页面结构设置为 not-present 或者 read-only,这样对于这些 page 的写会触发许多 EPT violations,开销非常大。有了 PML 之后,写就不会触发 VMExit 了。这提升了性能(极端情况下可能是 512 倍?)。
问题三:为什么 TDX 不支持 PML?
Introduction to VT-x Page-Modification Logging - L
vmx_flush_pml_buffer()
KVM
把硬件 PML buffer 里的 dirty pfn 拿出来,存到 KVM 对应的结构体中去。
__vmx_handle_exit
// 本来我们每次因为 PML full exit 出来时进行 flush 就行了,现在的实现是
// 每次 exit 出来都 flush,这样的好处一方面是能够让 dirty bitmap 更加准确。
// 另一个好处是 kick vCPUs 出来就能保证他们 dirty bitmap 是最新的。
/*
* Flush logged GPAs PML buffer, this will make dirty_bitmap more
* updated. Another good is, in kvm_vm_ioctl_get_dirty_log, before
* querying dirty_bitmap, we only need to kick all vcpus out of guest
* mode as if vcpus is in root mode, the PML buffer must has been
* flushed already. Note, PML is never enabled in hardware while
* running L2.
*/
if (enable_pml && !is_guest_mode(vcpu))
vmx_flush_pml_buffer(vcpu);
static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u64 *pml_buf;
u16 pml_idx;
// pml_idx VMCS guest state
pml_idx = vmcs_read16(GUEST_PML_INDEX);
// 因为是从上往下减的,所以为 511 时可以看作是 empty。
// Do nothing if PML buffer is empty,因为没什么好 flush 的
if (pml_idx == (PML_ENTITY_NUM - 1))
return;
// PML index always points to next available PML buffer entity
if (pml_idx >= PML_ENTITY_NUM)
pml_idx = 0;
else
pml_idx++;
pml_buf = page_address(vmx->pml_pg);
// 从当前 pml index 一直加到上面 PML_ENTITY_NUM
// 并且一路标为 dirty
for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
gpa = pml_buf[pml_idx];
WARN_ON(gpa & (PAGE_SIZE - 1));
// 这个地方是重点,mark 这个 page 为 dirty
kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
}
// 写回 PML index
vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
}