Two dimensional paging (TDP).

TDP it is known as the AMD NPT and Intel EPT. For technical neutrality reason TDP is used to refer to the paging mechanism with hardware assistance.[^5]

hardware_setup
    kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(), ept_caps_to_lpage_level(vmx_capability.ept));
        kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, int tdp_max_root_level, int tdp_huge_page_level)

可见,enable_ept 就是 enable_tdp。但是 TDP 这个概念很早就有了,2008 年这个 enable_tdp 全局变量就已经引入进来了。是在这个 patchset 中被 AMD 引入用来 enable NPT 的:

KVM: add support for SVM Nested Paging - Joerg Roedel

为什么起名叫 TDP 呢,因为第四个 patch:

The generic x86 code has to know if the specific implementation uses Nested Paging. In the generic code Nested Paging is called Two Dimensional Paging (TDP) to avoid confusion with (future) TDP implementations of other vendors. This patch exports the availability of TDP to the generic x86 code.

所以在 Google 的 TDP patch 被提出的时候,TDP 实际上是可以使用的。那么 Google 推出这些 patch 的原因是什么呢?

This patch set introduces a new implementation of much of the KVM MMU, optimized for running guests with TDP. We have re-implemented many of the MMU functions to take advantage of the relative simplicity of TDP and eliminate the need for an rmap.

可以看来主要有两个作用:1. 重构,2. 优化。

优化了什么地方呢?

  • Support for handling page faults in parallel for very large VMs. When VMs have hundreds of vCPUs and terabytes of memory, KVM's MMU lock suffers extreme contention, resulting in soft-lockups and long latency on guest page faults.

Google 的 TDP patch series 实际上是两个,分别是:

tdp_mmu_init_sp() KVM

sp 是一个 page table,gfn 表示的是这个 page table 要 map 的 base gfn。比如对于 root page,这个 gfn 就是 0。

sptep 表示的是这个 sp 所对应的 PSE。


static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep, gfn_t gfn)
{
	INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);

	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
    //...
	sp->gfn = gfn;
	sp->ptep = sptep;
	sp->tdp_mmu_page = true;
    //...
}

__tdp_mmu_set_spte() / tdp_mmu_set_spte_atomic() KVM

__tdp_mmu_set_spte() 要做地就是把一个旧的 SPTE 替换成新的 SPTE,同时返回旧的值。

tdp_mmu_set_spte_atomic() 先把旧的 SPTE 写成一个中间值(freeze 住),然后再写入新的 SPTE,返回执行结果。

/*
 * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
 * ...
 * @sptep:	      Pointer to the SPTE
 * @old_spte:	      The current value of the SPTE
 * @new_spte:	      The new value that will be set for the SPTE
 * @gfn:	      The base GFN that was (or will be) mapped by the SPTE
 * @level:	      The level _containing_ the SPTE (its parent PT's level)
 * ...
 *
 * Returns the old SPTE value, which _may_ be different than @old_spte if the
 * SPTE had voldatile bits.
 */
static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
			      u64 old_spte, u64 new_spte, gfn_t gfn, int level,
			      bool record_acc_track, bool record_dirty_log)
{
	union kvm_mmu_page_role role;

    // ...
    // 这个函数做的事情很简单:根据 sptep 把 new_spte 替换 old_spte
	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);

	role = sptep_to_sp(sptep)->role;
	role.level = level;
	ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
	/* Because write spin lock is held, no race.  It should success. */
	WARN_ON_ONCE(ret);

    //...
	return old_spte;
}


/*
 * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
 * and handle the associated bookkeeping.  Do not mark the page dirty
 * in KVM's dirty bitmaps.
 *
 * If setting the SPTE fails because it has changed, iter->old_spte will be
 * refreshed to the current value of the spte.
 */
static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
						       struct tdp_iter *iter,
						       u64 new_spte)
{
	/*
	 * For conventional page table, the update flow is
	 * - update STPE with atomic operation
	 * - handle changed SPTE. __handle_changed_spte()
     * ...
	 *
	 * For private page table, callbacks are needed to propagate SPTE
	 * change into the protected page table.  In order to atomically update
	 * both the SPTE and the protected page tables with callbacks, utilize
	 * freezing SPTE.
	 * - Freeze the SPTE. Set entry to REMOVED_SPTE.
	 * - Trigger callbacks for protected page tables. __handle_changed_spte()
	 * - Unfreeze the SPTE.  Set the entry to new_spte.
	 */
    // 当是私有的 spte 以及我们并不是要删除这个 spte 时,先冻结它
	bool freeze_spte = is_private_sptep(iter->sptep) && !is_removed_spte(new_spte);
	u64 tmp_spte = freeze_spte ? REMOVED_SPTE : new_spte;
    //...

    // 主要部分,将 old_spte 替换为 tmp_spte
	try_cmpxchg64(sptep, &iter->old_spte, tmp_spte)

	ret = __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
				    new_spte, sptep_to_sp(sptep)->role, true);
    //...
    // 将 new_spte 写入
	if (freeze_spte)
		__kvm_tdp_mmu_write_spte(sptep, new_spte);
	return ret;
}

for_each_tdp_mmu_root_yield_safe() KVM

#define for_each_tdp_mmu_root_yield_safe(_kvm, _root)			\
	for (_root = tdp_mmu_next_root(_kvm, NULL, false);		\
	     ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root;	\
	     _root = tdp_mmu_next_root(_kvm, _root, false))

tdp_mmu_next_root() KVM

/*
 * Returns the next root after @prev_root (or the first root if @prev_root is
 * NULL).  A reference to the returned root is acquired, and the reference to
 * @prev_root is released (the caller obviously must hold a reference to
 * @prev_root if it's non-NULL).
 *
 * If @only_valid is true, invalid roots are skipped.
 *
 * Returns NULL if the end of tdp_mmu_roots was reached.
 */
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
					      struct kvm_mmu_page *prev_root,
					      bool only_valid)
{
	struct kvm_mmu_page *next_root;

	/*
	 * While the roots themselves are RCU-protected, fields such as
	 * role.invalid are protected by mmu_lock.
	 */
	lockdep_assert_held(&kvm->mmu_lock);

	rcu_read_lock();

	if (prev_root)
		next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
						  &prev_root->link,
						  typeof(*prev_root), link);
	else
		next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
						   typeof(*next_root), link);

	while (next_root) {
		if ((!only_valid || !next_root->role.invalid) &&
		    kvm_tdp_mmu_get_root(next_root))
			break;

		next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
				&next_root->link, typeof(*next_root), link);
	}

	rcu_read_unlock();

	if (prev_root)
		kvm_tdp_mmu_put_root(kvm, prev_root);

	return next_root;
}

The first patch series of TDP MMU: basic framework

[PATCH V2 01/20] kvm: x86/mmu: Separate making SPTEs from set_spte

原来的 set_spte 函数做了两件事:

  • Generate leaf page table entries.
  • Inserts them into the paging structure.

现在仍然做这两件事,只不过第一件事交给了函数 make_spte 去做,set_spte 调用就好了。

No functional change expected.

	bool flush = false;

	pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
		 *sptep, write_fault, gfn);

	if (is_shadow_present_pte(*sptep)) {
		/*
		 * If we overwrite a PTE page pointer with a 2MB PMD, unlink
		 * the parent of the now unreachable PTE.
		 */
		if (level > PG_LEVEL_4K && !is_large_pte(*sptep)) {
			struct kvm_mmu_page *child;
			u64 pte = *sptep;

			child = to_shadow_page(pte & PT64_BASE_ADDR_MASK);
			drop_parent_pte(child, sptep);
			flush = true;
		} else if (pfn != spte_to_pfn(*sptep)) {
			pgprintk("hfn old %llx new %llx\n",
				 spte_to_pfn(*sptep), pfn);
			drop_spte(vcpu->kvm, sptep);
			flush = true;
		} else
			was_rmapped = 1;
	}

	set_spte_ret = set_spte(vcpu, sptep, pte_access, level, gfn, pfn,
				speculative, true, host_writable);
	if (set_spte_ret & SET_SPTE_WRITE_PROTECTED_PT) {
		if (write_fault)
			ret = RET_PF_EMULATE;
		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
	}

	if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH || flush)
		kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn,
				KVM_PAGES_PER_HPAGE(level));

	if (unlikely(is_mmio_spte(*sptep)))
		ret = RET_PF_EMULATE;

	/*
	 * The fault is fully spurious if and only if the new SPTE and old SPTE
	 * are identical, and emulation is not required.
	 */
	if ((set_spte_ret & SET_SPTE_SPURIOUS) && ret == RET_PF_FIXED) {
		WARN_ON_ONCE(!was_rmapped);
		return RET_PF_SPURIOUS;
	}

	pgprintk("%s: setting spte %llx\n", __func__, *sptep);
	trace_kvm_mmu_set_spte(level, gfn, sptep);
	if (!was_rmapped && is_large_pte(*sptep))
		++vcpu->kvm->stat.lpages;

	if (is_shadow_present_pte(*sptep)) {
		if (!was_rmapped) {
			rmap_count = rmap_add(vcpu, sptep, gfn);
			if (rmap_count > RMAP_RECYCLE_THRESHOLD)
				rmap_recycle(vcpu, sptep, gfn);
		}
	}

	return ret;
}

static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
			u64 *sptep, unsigned int pte_access, gfn_t gfn,
			kvm_pfn_t pfn, struct kvm_page_fault *fault)
{
	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
	int level = sp->role.level;
	int was_rmapped = 0;
	int ret = RET_PF_FIXED;
	bool flush = false;
	bool wrprot;
	u64 spte;

	/* Prefetching always gets a writable pfn.  */
	bool host_writable = !fault || fault->map_writable;
	bool prefetch = !fault || fault->prefetch;
	bool write_fault = fault && fault->write;

	printk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
		 *sptep, write_fault, gfn);

	if (unlikely(is_noslot_pfn(pfn))) {
		vcpu->stat.pf_mmio_spte_created++;
		mark_mmio_spte(vcpu, sptep, gfn, pte_access);
		return RET_PF_EMULATE;
	}

	if (is_shadow_present_pte(*sptep)) {
		/*
		 * If we overwrite a PTE page pointer with a 2MB PMD, unlink
		 * the parent of the now unreachable PTE.
		 */
		if (level > PG_LEVEL_4K && !is_large_pte(*sptep)) {
			struct kvm_mmu_page *child;
			u64 pte = *sptep;

			child = spte_to_child_sp(pte);
			drop_parent_pte(child, sptep);
			flush = true;
		} else if (pfn != spte_to_pfn(*sptep)) {
			pgprintk("hfn old %llx new %llx\n",
				 spte_to_pfn(*sptep), pfn);
			drop_spte(vcpu->kvm, sptep);
			flush = true;
		} else
			was_rmapped = 1;
	}

	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
			   true, host_writable, &spte);

	if (*sptep == spte) {
		ret = RET_PF_SPURIOUS;
	} else {
		flush |= mmu_spte_update(sptep, spte);
		trace_kvm_mmu_set_spte(level, gfn, sptep);
	}

	if (wrprot) {
		if (write_fault)
			ret = RET_PF_EMULATE;
	}

	if (flush)
		kvm_flush_remote_tlbs_gfn(vcpu->kvm, gfn, level);

	pgprintk("%s: setting spte %llx\n", __func__, *sptep);

	if (!was_rmapped) {
		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
		rmap_add(vcpu, slot, sptep, gfn, pte_access);
	} else {
		/* Already rmapped but the pte_access bits may have changed. */
		kvm_mmu_page_set_access(sp, spte_index(sptep), pte_access);
	}

	return ret;
}

[PATCH V2 02/20] kvm: x86/mmu: Introduce tdp_iter

The TDP iterator implements a pre-order traversal of a TDP paging structure.

[PATCH V2 03/20] kvm: x86/mmu: Init / Uninit the TDP MMU

The TDP MMU will require new fields that need to be initialized and torn down. Add hooks into the existing KVM MMU initialization process to do that initialization / cleanup. Currently the initialization and cleanup fucntions do not do very much, however more operations will be added in future patches.

[PATCH V2 04/20] kvm: x86/mmu: Allocate and free TDP MMU roots

Implement a similar, but separate system for root page allocation to that of the x86 shadow paging implementation. When

future patches add synchronization model changes to allow for parallel page faults, these pages will need to be handled differently from the x86 shadow paging based MMU's root pages.

The second patch series of TDP MMU: allow parallel MMU operations