TLB Flush 以及 TLB Shootdown 的虚拟化依赖于 KVM Steal Time 机制。

What is a TLB shootdown/TLB Flush

The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown.

A quick example:

  • You have some memory shared by all of the processors in your system.
  • One of your processors restricts access to a page of that shared memory.
  • Now, all of the processors have to flush their TLBs, so that the ones that were allowed to access that page can't do so any more.

TLB shootdown virtualization performance issue

Remote TLB flush does a busy wait which is fine in bare-metal scenario. But within the guest, the vCPUs might have been preempted or blocked. In this scenario, the initiator vCPU would end up busy-waiting for a long amount of time; it also consumes CPU unnecessarily to wake up the target of the shootdown.

TLB flush without PV will cause IPI:

flush_tlb_mm_range()
    flush_tlb_multi()
        __flush_tlb_multi()
            // if using kvm-pv-tlb-flush, it will be kvm_flush_tlb_multi()
            native_flush_tlb_multi()
                on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func, (void *)info, 1, cpumask);
                    smp_call_function_many_cond()
                        arch_send_call_function_ipi_mask()
                            smp_ops.send_call_func_ipi()
                                apic->send_IPI_mask(mask, CALL_FUNCTION_VECTOR);

How to solve this using a paravirtualization idea

In PV TLB shootdown, the TLB flush initiator vCPU will not wait the sleeping vCPU, instead it just set a flag in the guest-vmm shared area and then kvm will check this flag and do the TLB flush when the sleeping vCPU come to run.

kvm-pv-tlb-flush 基于 kvm-steal-time 实现。

Guest PV feature detection

Guest checks feature bit KVM_FEATURE_PV_TLB_FLUSH before enabling para-virtualized TLB flush.

kvm_guest_init
    // Use KVM_FEATURE_PV_TLB_FLUSH and other things to detect
    if pv_tlb_flush_supported
        // Add 2 function callbacks
        pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
		pv_ops.mmu.tlb_remove_table = tlb_remove_table;

KVM_VCPU_FLUSH_TLB Bit

This bit is in the following per-vcpu shared variable:

struct kvm_steal_time {
	__u64 steal;
	__u32 version;
	__u32 flags;
	__u8  preempted; // KVM_VCPU_FLUSH_TLB bit is defined here
	__u8  u8_pad[3];
	__u32 pad[11];
};

Check and use place (In KVM):

if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
    record_steal_time(vcpu);
        if (st_preempted & KVM_VCPU_FLUSH_TLB)
            // flush ths vCPU's TLB
			kvm_vcpu_flush_tlb_guest(vcpu);

Set place (In guest kernel):

// In guest kernel, trigger flush tlb
flush_tlb_mm_range
    flush_tlb_multi
        __flush_tlb_multi
            PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
                // set the KVM_VCPU_FLUSH_TLB for all vCPU
                kvm_flush_tlb_multi

KVM_VCPU_PREEMPTED Bit

Same as KVM_VCPU_FLUSH_TLB, this is also 1 bit in the preempted:

struct kvm_steal_time {
	__u64 steal;
	__u32 version;
	__u32 flags;
	__u8  preempted; // KVM_VCPU_PREEMPTED bit is defined here
	__u8  u8_pad[3];
	__u32 pad[11];
};

当 vCPU 被调度出去时,在 host 与 guest 共享的 steal_time variable 上打上该 flag,下一次 vm entry 时,抹掉该 flag;这个 flag 的作用是:当置上时,guest 可以直接用 PV TLB flush feature,不需要再发送 IPI 来 flush TLB。

When the vCPU is scheduled out, this flag is marked on the steal_time variable shared by the host and the guest, and the flag is cleared when the next vm entry; the set of this bit indicates the vCPU is scheduled out, so the guest can directly use the PV TLB flush feature, no need to follow the traditional method to send IPI to flush TLB.

The place to set this bit:

kvm_sched_out
    kvm_arch_vcpu_put
        kvm_steal_time_set_preempted
            // set KVM_VCPU_PREEMPTED on steal time
            copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted))

https://lore.kernel.org/all/1513128784-5924-1-git-send-email-wanpeng.li@hotmail.com/

Testing on a Xeon Gold 6142 2.6GHz 2 sockets, 32 cores, 64 threads, so 64 pCPUs, and each VM is 64 vCPUs.

ebizzy -M 
              vanilla    optimized     boost
1VM            46799       48670         4%
2VM            23962       42691        78%
3VM            16152       37539       132%

pv_ops.mmu.flush_tlb_multi

1. Guest initiate TLB flush request by setting KVM_VCPU_FLUSH_TLB bit

This function is only called by:

flush_tlb_mm_range
flush_tlb_multi
static inline void __flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info)
{
	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
}

default callback is native_flush_tlb_multi, which means this kernel is running as host, not in a VM. In KVM TLB flushing scenario, it is kvm_flush_tlb_multi function.

kvm_flush_tlb_multi

This function will try to set the KVM_VCPU_FLUSH_TLB bit for each vcpu:

  • If the corresponding vCPU's KVM_VCPU_PREEMPTED is set, which means we can leverage it's next vmentry to flush TLB, so we can using our PV method to just set the KVM_VCPU_FLUSH_TLB bit;
  • if it is not set, the vCPU is running, use legacy method to send IPI to it.
// Guest kernel
static void kvm_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info)
{
	u8 state;
	int cpu;
	struct kvm_steal_time *src;
	struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);

	cpumask_copy(flushmask, cpumask);
	for_each_cpu(cpu, flushmask) {
		src = &per_cpu(steal_time, cpu);
		state = READ_ONCE(src->preempted);
		if ((state & KVM_VCPU_PREEMPTED)) {
            // set the KVM_VCPU_FLUSH_TLB flag
			if (try_cmpxchg(&src->preempted, &state, state | KVM_VCPU_FLUSH_TLB))
                // clear the processed cpu bit in flushmask
				__cpumask_clear_cpu(cpu, flushmask);
		}
	}

	native_flush_tlb_multi(flushmask, info);
}

2. Host KVM handle TLB flush request KVM_VCPU_FLUSH_TLB

KVM_REQ_STEAL_UPDATE

This request bit (KVM_REQ_STEAL_UPDATE) for this vcpu will be set when:

  • Guest kernel write to MSR: MSR_KVM_STEAL_TIME;
  • kvm_arch_vcpu_load

Because guest kernel only write to MSR MSR_KVM_STEAL_TIME one time when registering it, so the triggering point is the second condition: KVM got scheduled back and kvm_arch_vcpu_load is called, which will make request KVM_REQ_STEAL_UPDATE and check the bit KVM_VCPU_FLUSH_TLB is set:

vcpu_enter_guest
    if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
        record_steal_time
            // here, the KVM_VCPU_FLUSH_TLB is set
    		if (st_preempted & KVM_VCPU_FLUSH_TLB)
    			kvm_vcpu_flush_tlb_guest(vcpu);
                    vmx_flush_tlb_guest
                        vmx_get_current_vpid
                        vpid_sync_context
                            vpid_sync_vcpu_single
                                vmx_asm2(invvpid, "r"(ext), "m"(operand), ext, vpid, gva);

pv_ops.mmu.tlb_remove_table

// These 4 functions are called when the corresponding one will be removed, for pmd as an example:
___pte_free_tlb
___pmd_free_tlb
___pud_free_tlb
___p4d_free_tlb
    paravirt_tlb_remove_table
    	PVOP_VCALL2(mmu.tlb_remove_table, tlb, table);

The original callback function is tlb_remove_page

The updated callback function is tlb_remove_table

tlb_remove_page

这个函数主要将物理页加入积聚结构中, 然后达到最大值时(在 __tlb_remove_page_size 中判断)进行批量释放。

This function will call following functions:

tlb_remove_page
    tlb_remove_page_size
        if __tlb_remove_page_size
            tlb_flush_mmu

tlb_flush_mmu

释放之前积聚的物理页。

tlb_remove_table

积聚,当达到 MAX_TABLE_BATCH 时,执行 tlb_table_flush 来释放之前积聚的存放各级页目录的物理页。

void tlb_remove_table(struct mmu_gather *tlb, void *table)
{
	struct mmu_table_batch **batch = &tlb->batch;
    //...
	(*batch)->tables[(*batch)->nr++] = table;
	if ((*batch)->nr == MAX_TABLE_BATCH)
		tlb_table_flush(tlb);
}

tlb_table_flush

释放之前积聚的存放各级页目录的物理页。

Why changing from function tlb_remove_page to tlb_remove_table?